Infrastructure · 2025–26 Flagship

Article Collection & Enrichment Pipeline on Apache Airflow 3

A media-monitoring / data-orchestration platform

Overview

A data collection and enrichment pipeline built on Apache Airflow 3 that ingests articles from multiple sources for client subscriptions, deduplicates them, enriches them with metadata, and persists them to PostgreSQL.

The Challenge

A media-monitoring platform must continuously pull content from many upstream providers, normalize and deduplicate it at scale, enrich it, and make it queryable, while supporting point-in-time recovery and reprocessing when matching rules or enrichment logic change.

What We Built

A multi-stage DAG flow, SUBSCRIPTION_DISPATCHER → DATA_COLLECTOR → JOURNAL → ENRICHMENT → LEDGER → VAULT, implemented across dozens of DAGs including enrichment v2, journal/ledger stages, Firestore subscription sync, LexisNexis ingestion, broadcast handling, partition maintenance, archive reprocessing, backfills, and retry-of-failed-runs. The design is S3-centric: all intermediate data lands in S3/GCS (or a mock store for local dev) with TTL cleanup and immutable archives enabling point-in-time recovery and reprocessing. Batch processing runs 500 articles per batch via CeleryExecutor with Redis, and Airflow 3’s new DAG Processor / API Server / Triggerer split is used explicitly. Structured task logs and custom metrics flow to Google Cloud Logging and Monitoring; OpenAI is used for transcript cleaning.

Technologies & Approach

Apache Airflow 3.1 on Python 3.11, distributed via Celery + Redis, containerized with Docker Compose. Google Cloud providers (Storage, Logging, Monitoring) plus Firebase Admin for Firestore access. A “mock mode” lets the entire pipeline run locally without AWS/GCS dependencies.

Outcome / Impact

Acts as the backbone that feeds the platform’s downstream matching, API, and dashboard layers. The immutable-archive and reprocessing design makes it possible to re-run enrichment and matching safely as rules evolve.

Capabilities Demonstrated

Production-grade orchestration on Apache Airflow 3
S3/GCS-centric immutable data lake with TTL and reprocessing
Distributed batch processing with Celery + Redis
Multi-source ingestion (incl. LexisNexis, broadcast, scientific grantees)
Cloud-native observability (structured logging, custom metrics)

More work See all →

Product 2026