← All work
Infrastructure · 2025

Airflow 3 Data Collection & Enrichment Pipeline

A media-monitoring / data-orchestration platform

Overview

An Apache Airflow 3 orchestration layer that collects articles from multiple sources for client subscriptions, deduplicates and enriches them with metadata, and persists the results to PostgreSQL. Built around an S3-centric, immutable-archive design for point-in-time recovery and reprocessing.

The Challenge

Continuously collecting media content for many subscriptions means coordinating fan-out collection, deduplication, enrichment, and storage at batch scale, while keeping intermediate state recoverable and the system reprocessable when logic changes. This calls for a robust, observable orchestration framework rather than ad-hoc cron jobs.

What We Built

A modular Airflow 3.1 deployment (Python 3.11) running the full new-architecture stack: API server, the new mandatory DAG processor, scheduler, Celery workers, and triggerer for deferrable tasks. The pipeline follows a clear staged flow, SUBSCRIPTION_DISPATCHER → DATA_COLLECTOR → JOURNAL → ENRICHMENT → LEDGER → VAULT, implemented as DAGs (journal, ledger, enrichment, cache export/rebuild, plus data-collector and pub/sub consumer templates). All intermediate data lands in S3/GCS (with a mock-storage mode for fully local development) under TTL cleanup, articles are processed in configurable 500-item batches, and immutable archives enable reprocessing. CeleryExecutor with Redis provides distributed execution; Google and Postgres providers wire into the wider platform. Docker Compose, a DevContainer, and a Makefile make the stack reproducible locally.

Technologies & Approach

Apache Airflow 3.1 on Python 3.11 with the Celery/Redis executor and Google Cloud, Postgres, and Redis providers. SQLAlchemy 2 and psycopg2 for database access, google-cloud-storage for the S3/GCS-centric data lake. The mock-mode design lets engineers run the full pipeline without AWS/GCS credentials, accelerating development and testing.

Outcome / Impact

Provides the platform’s backbone for scheduled, large-scale content collection and enrichment, with recoverable intermediate state and reprocessing built in, a modern adoption of Airflow 3’s separated DAG-processor architecture.

Capabilities Demonstrated

  • Standing up Apache Airflow 3 with its new DAG-processor architecture
  • Designing staged, S3-centric ETL with immutable archives and reprocessing
  • Distributed batch task execution with CeleryExecutor and Redis
  • Local-first development via mock storage, Docker Compose, and DevContainers
  • Integrating GCS/S3, PostgreSQL, and Pub/Sub into one orchestration layer
More work See all →