← All work
Infrastructure · 2026

Web Content Scraping Service with Transform-Based Caching

A media-monitoring / data-orchestration platform

Overview

A FastAPI web-content scraping service that fetches URLs (HTML pages and PDFs) via a browser-based scraper, runs configurable transforms over the content, and caches results in Redis for fast reuse.

The Challenge

A media-monitoring pipeline frequently needs the full text behind article URLs, including bot-protected pages and PDFs. Re-fetching the same content repeatedly is slow and costly, so the platform needs a cache-first fetch layer that can also normalize content through reusable transforms.

What We Built

A FastAPI service exposing /warm (enqueue + pre-compute transforms) and /fetch (same, but returns transform outputs and auto-enqueues unknown URLs) plus a health endpoint. A browser-based scraper integration (Scrappey) handles difficult pages, PyMuPDF extracts text from PDFs, and a transform registry applies content transforms whose outputs are cached in Redis. The app is cleanly structured into scraper, transforms, worker, models, and Redis-client modules and ships with a Dockerfile and compose setup.

Technologies & Approach

Python with FastAPI and Pydantic for typed request/response handling, Redis (with hiredis) for the content/transform cache, httpx for HTTP, and PyMuPDF for PDF parsing. A registry pattern keeps transforms pluggable; warm-vs-fetch semantics separate pre-computation from retrieval.

Outcome / Impact

Provides the pipeline a fast, deduplicated content-fetch layer so downstream enrichment and matching operate on cleaned text without redundant scraping.

Capabilities Demonstrated

  • Cache-first scraping architecture (warm vs. fetch)
  • Browser-based scraping for protected pages
  • Pluggable transform registry over fetched content
  • PDF text extraction in the ingestion path
More work See all →