Product · 2026

LLM-Assisted Author Entity Resolution against OpenAlex

A media-monitoring / data-orchestration platform

Overview

A data-enrichment script that resolves scientific grantees to their OpenAlex author IDs, combining direct ORCID lookups with LLM-assisted disambiguation for ambiguous cases.

Why It Exists

Linking grant/author records to a canonical scholarly identifier (OpenAlex) is a classic entity-resolution problem: some authors have an ORCID and resolve cleanly, but many require judgment to pick the right candidate from name-collision results using surrounding context.

What We Built

A Python tool with a two-strategy approach: authors with an ORCID get a direct OpenAlex lookup (no LLM needed); authors without one are searched by name in OpenAlex, then an OpenAI model picks the correct match from candidates using full context, institution, project, program, and research field. The script runs concurrently with a ThreadPoolExecutor, checkpoints progress to JSON for resumption, and emits an updated CSV plus a detailed matching report and a needs-review queue. Companion scripts retry unmatched records and test the matching prompt and model.

Technologies & Approach

Python with the OpenAlex REST API, ORCID-based fast paths, and an OpenAI LLM (GPT-4o) for context-aware disambiguation, parallelized for throughput and made resumable via progress checkpoints. CSV/JSON artifacts make the results auditable.

Outcome / Impact

Proved out a pragmatic, cost-aware entity-resolution pattern, cheap deterministic lookups first, LLM judgment only where needed, producing OpenAlex IDs for grantees plus a transparent report and review queue for the uncertain matches.

Capabilities Demonstrated

Record linkage / entity resolution against scholarly data
Hybrid deterministic + LLM disambiguation strategy
Concurrent, checkpointed, resumable batch processing
Auditable outputs with an explicit needs-review queue

More work See all →

Product 2026