Article Collection & Enrichment Pipeline on Apache Airflow 3
A media-monitoring / data-orchestration platform
Overview
A data collection and enrichment pipeline built on Apache Airflow 3 that ingests articles from multiple sources for client subscriptions, deduplicates them, enriches them with metadata, and persists them to PostgreSQL.
The Challenge
A media-monitoring platform must continuously pull content from many upstream providers, normalize and deduplicate it at scale, enrich it, and make it queryable, while supporting point-in-time recovery and reprocessing when matching rules or enrichment logic change.
What We Built
A multi-stage DAG flow, SUBSCRIPTION_DISPATCHER → DATA_COLLECTOR → JOURNAL → ENRICHMENT → LEDGER → VAULT, implemented across dozens of DAGs including enrichment v2, journal/ledger stages, Firestore subscription sync, LexisNexis ingestion, broadcast handling, partition maintenance, archive reprocessing, backfills, and retry-of-failed-runs. The design is S3-centric: all intermediate data lands in S3/GCS (or a mock store for local dev) with TTL cleanup and immutable archives enabling point-in-time recovery and reprocessing. Batch processing runs 500 articles per batch via CeleryExecutor with Redis, and Airflow 3’s new DAG Processor / API Server / Triggerer split is used explicitly. Structured task logs and custom metrics flow to Google Cloud Logging and Monitoring; OpenAI is used for transcript cleaning.
Technologies & Approach
Apache Airflow 3.1 on Python 3.11, distributed via Celery + Redis, containerized with Docker Compose. Google Cloud providers (Storage, Logging, Monitoring) plus Firebase Admin for Firestore access. A “mock mode” lets the entire pipeline run locally without AWS/GCS dependencies.
Outcome / Impact
Acts as the backbone that feeds the platform’s downstream matching, API, and dashboard layers. The immutable-archive and reprocessing design makes it possible to re-run enrichment and matching safely as rules evolve.
Capabilities Demonstrated
- Production-grade orchestration on Apache Airflow 3
- S3/GCS-centric immutable data lake with TTL and reprocessing
- Distributed batch processing with Celery + Redis
- Multi-source ingestion (incl. LexisNexis, broadcast, scientific grantees)
- Cloud-native observability (structured logging, custom metrics)