Real-Estate Listings Web Scraper (Goutte)
Overview
A focused web scraper that crawls the Romanian real-estate portal imobiliare.ro to extract estate-agency phone numbers across the country. Built with Goutte, it walks every county’s apartment-sales listings and harvests contact numbers into per-city text files.
Why It Exists
The goal (per the composer description, “numerele agentiilor imobiliare”) was to compile a dataset of real-estate agency phone numbers nationwide, a lead-generation / data-mining task that would be impractical to do by hand across dozens of cities.
What We Built
A single run.php driver using Goutte’s Client to iterate a hard-coded list of ~25 Romanian counties (Bucureşti, Timişoara, Braşov, Cluj, Iaşi, and more). For each county it requests paginated agency listings (?agentii=1&pagina=N), filters listing nodes via CSS selectors, applies a phone-number regex to the text, de-duplicates results, and streams them to a per-city output file (bucuresti.txt, cluj-napoca.txt, etc.). The committed .txt files are the captured datasets.
Technologies & Approach
Goutte (Symfony’s DomCrawler + BrowserKit + Guzzle) for HTTP and HTML parsing, with a pagination loop that continues until a page returns no listing nodes, and regex extraction with in-memory de-duplication. Timezone pinned to Europe/Bucharest. Deliberately minimal, a single-purpose automation script.
Outcome / Impact
Produced concrete datasets: 25+ city-level text files of extracted agency phone numbers, proving an end-to-end scrape-paginate-extract-dedupe-export pipeline. A practical demonstration of automation and data-extraction skills.
Capabilities Demonstrated
- Web scraping and HTML parsing with Goutte
- Robust pagination crawling with terminal-condition handling
- Regex-based data extraction and de-duplication
- Producing structured datasets from unstructured web pages