LLM-Assisted Domain Tier Classification Pipeline
Overview
A three-phase data pipeline that scores and tiers news-source domains for an aggregator. It extracts behavioral signals from a warehouse, applies a deterministic rule-based pre-classifier, and escalates only the uncertain cases to an LLM, producing a final per-domain tier classification.
The Challenge
Ranking thousands of news domains by quality and role (original publisher vs. wire service vs. aggregator vs. blog/scraper) is nuanced: the same signal can mean opposite things depending on context (e.g. wire services have low originality because their content is syndicated). Pure rules are brittle and pure LLM calls are expensive at scale, so the pipeline blends both.
What We Built
A Python orchestrator (run_pipeline.py) driving three phases. Phase 1 (phase1_extract_signals.py) pulls signals from a PostgreSQL warehouse, originality rate, syndication group origination/spread, sources-copied-from, average reach, unique authors, and content-type breakdown, into CSV. Phase 2 (phase2_rule_classifier.py) applies a rule-based pre-classification against tier_definitions.py, separating confident calls from uncertain_domains.csv. Phase 3 (phase3_llm_classifier.py) sends the uncertain set to the Anthropic API with a domain-expert system prompt and a signal-interpretation guide, batching 40 domains per call with throttling. Outputs include domain_signals.csv, preclassified.csv, and final_classification.csv.
Technologies & Approach
Python with psycopg-style PostgreSQL access for signal extraction, a hand-written rule engine for cheap high-confidence calls, and the anthropic SDK (model claude-sonnet-4-20250514) for the hard cases, configured with batch size and inter-call delay to control cost and rate limits. The tiered escalation (rules first, LLM only when needed) keeps the run economical over large domain sets.
Outcome / Impact
A reusable classification pipeline that turns raw warehouse signals into an explainable, tiered ranking of news sources, demonstrating a cost-aware hybrid of deterministic rules and LLM judgment for large-scale data labeling.
Capabilities Demonstrated
- Multi-stage data classification pipeline design
- Behavioral signal/feature extraction from a SQL warehouse
- Hybrid rule-based + LLM classification with confidence routing
- Cost- and rate-aware LLM batching for large datasets