LLM-Assisted Author Entity Resolution against OpenAlex
A media-monitoring / data-orchestration platform
Overview
A data-enrichment script that resolves scientific grantees to their OpenAlex author IDs, combining direct ORCID lookups with LLM-assisted disambiguation for ambiguous cases.
Why It Exists
Linking grant/author records to a canonical scholarly identifier (OpenAlex) is a classic entity-resolution problem: some authors have an ORCID and resolve cleanly, but many require judgment to pick the right candidate from name-collision results using surrounding context.
What We Built
A Python tool with a two-strategy approach: authors with an ORCID get a direct OpenAlex lookup (no LLM needed); authors without one are searched by name in OpenAlex, then an OpenAI model picks the correct match from candidates using full context, institution, project, program, and research field. The script runs concurrently with a ThreadPoolExecutor, checkpoints progress to JSON for resumption, and emits an updated CSV plus a detailed matching report and a needs-review queue. Companion scripts retry unmatched records and test the matching prompt and model.
Technologies & Approach
Python with the OpenAlex REST API, ORCID-based fast paths, and an OpenAI LLM (GPT-4o) for context-aware disambiguation, parallelized for throughput and made resumable via progress checkpoints. CSV/JSON artifacts make the results auditable.
Outcome / Impact
Proved out a pragmatic, cost-aware entity-resolution pattern, cheap deterministic lookups first, LLM judgment only where needed, producing OpenAlex IDs for grantees plus a transparent report and review queue for the uncertain matches.
Capabilities Demonstrated
- Record linkage / entity resolution against scholarly data
- Hybrid deterministic + LLM disambiguation strategy
- Concurrent, checkpointed, resumable batch processing
- Auditable outputs with an explicit needs-review queue