Capability · 2026

LLM-Assisted Domain Tier Classification Pipeline

Overview

A three-phase data pipeline that scores and tiers news-source domains for an aggregator. It extracts behavioral signals from a warehouse, applies a deterministic rule-based pre-classifier, and escalates only the uncertain cases to an LLM, producing a final per-domain tier classification.

The Challenge

Ranking thousands of news domains by quality and role (original publisher vs. wire service vs. aggregator vs. blog/scraper) is nuanced: the same signal can mean opposite things depending on context (e.g. wire services have low originality because their content is syndicated). Pure rules are brittle and pure LLM calls are expensive at scale, so the pipeline blends both.

What We Built

A Python orchestrator (run_pipeline.py) driving three phases. Phase 1 (phase1_extract_signals.py) pulls signals from a PostgreSQL warehouse, originality rate, syndication group origination/spread, sources-copied-from, average reach, unique authors, and content-type breakdown, into CSV. Phase 2 (phase2_rule_classifier.py) applies a rule-based pre-classification against tier_definitions.py, separating confident calls from uncertain_domains.csv. Phase 3 (phase3_llm_classifier.py) sends the uncertain set to the Anthropic API with a domain-expert system prompt and a signal-interpretation guide, batching 40 domains per call with throttling. Outputs include domain_signals.csv, preclassified.csv, and final_classification.csv.

Technologies & Approach

Python with psycopg-style PostgreSQL access for signal extraction, a hand-written rule engine for cheap high-confidence calls, and the anthropic SDK (model claude-sonnet-4-20250514) for the hard cases, configured with batch size and inter-call delay to control cost and rate limits. The tiered escalation (rules first, LLM only when needed) keeps the run economical over large domain sets.

Outcome / Impact

A reusable classification pipeline that turns raw warehouse signals into an explainable, tiered ranking of news sources, demonstrating a cost-aware hybrid of deterministic rules and LLM judgment for large-scale data labeling.

Capabilities Demonstrated

Multi-stage data classification pipeline design
Behavioral signal/feature extraction from a SQL warehouse
Hybrid rule-based + LLM classification with confidence routing
Cost- and rate-aware LLM batching for large datasets

More work See all →

Product 2026