Anti-Bot Screenshot & Content Scraper
Overview
A Python scraper that captures full-page screenshots and text from sites that actively defend against automation. It uses stealth browser tooling and a virtual-display setup to render and extract content from protected pages, including social platforms and news sites.
Why It Exists
Many high-value sources sit behind bot-detection, Cloudflare-style challenges and CAPTCHAs that defeat naive scrapers. This tool exists to reliably reach and capture those pages, both as visual evidence (screenshots) and as extracted text.
What We Built
Two complementary entry points: run.py drives the nodriver undetected Chrome library to navigate and save screenshots, while s.py uses SeleniumBase in undetected (uc=True) CDP mode with activate_cdp_mode and uc_gui_click_captcha() to bypass challenges, running inside a PyAutoGUI + Xvfb virtual display (via Xlib) so GUI-level CAPTCHA clicks work headlessly. A Dockerfile builds a full Ubuntu image with Google Chrome, fonts, Xvfb and SeleniumBase for reproducible headless runs. Captured artefacts (timestamped screenshots, downloaded files, logs) are written out per target.
Technologies & Approach
SeleniumBase UC/CDP mode and nodriver for stealth, detection-resistant browsing; PyAutoGUI + Xvfb + Xlib to perform real GUI interactions (CAPTCHA clicks) without a physical display; Docker to package Chrome and all native dependencies for consistent execution. The dual-engine approach hedges against any single anti-bot technique.
Outcome / Impact
A working capability for extracting content and screenshots from bot-protected and CAPTCHA-gated pages, a building block for data-collection pipelines where standard scrapers fail.
Capabilities Demonstrated
- Stealth / anti-detection web scraping at scale
- CAPTCHA and challenge handling via CDP + GUI automation
- Headless rendering with virtual displays (Xvfb)
- Dockerised, reproducible browser-automation environments