VAT-Number Extraction & Validation Cache
An SMB accounting/fintech platform
Overview
A data-processing notebook that derives clean, validated VAT numbers from the platform’s OCR-predicted document data and persists them to a local SQLite cache. It focuses on Irish (IE-prefixed) VAT registrations, normalising and deduplicating seller VAT numbers per business.
Why It Exists
OCR extraction produces noisy, inconsistent VAT numbers, stray spaces, partial strings, non-Irish formats. To assign correct VAT codes and pre-fill returns, the platform needs a trustworthy mapping from business to validated VAT number, built and cached from historical document data.
What We Built
A Jupyter notebook (vat_numbers.ipynb) that loads a wide export (latest.csv) with pandas, filters to records updated after a cut-off date, keeps only rows with a non-null business and a plausible predicted_seller_vat_number (length > 7, IE prefix), strips whitespace, and produces a clean business-to-VAT mapping. Results are stored in vat_cache.sqlite for fast lookup by downstream services.
Technologies & Approach
Python with pandas and NumPy for filtering, type-coercion and string normalisation; SQLite as a lightweight persistent cache. The notebook format suited iterative tuning of the validation rules against real platform data.
Outcome / Impact
Produced a reusable, validated VAT-number cache that improved the accuracy of VAT coding and reduced reliance on raw OCR output. Proved out the cleansing rules later applied in the production document and tax pipelines.
Capabilities Demonstrated
- Cleansing and validating noisy OCR-derived financial identifiers
- Country-specific tax-number normalisation (Irish VAT)
- Building lightweight persistent caches to serve downstream automation