← All work
Product · 2023

PDF Table-to-CSV Extraction Spike

Overview

A minimal Python spike that extracts tabular data from a PDF and writes it out as CSV using the tabula library. A focused proof of a single capability: turning PDF tables into structured rows.

Why It Exists

PDFs are a common but awkward source for tabular data. This spike validated tabula-py as a quick path for pulling tables out of PDF documents into a machine-readable CSV, ahead of building larger document-processing flows.

What We Built

A single do.py script that calls tabula.convert_into("test.pdf", "output.csv", output_format="csv"), with a sample input PDF and its produced CSV output checked in. Deliberately small, one file, one job.

Technologies & Approach

Python with tabula-py (a wrapper over the Tabula Java engine) for table detection and extraction. Chosen for being the fastest way to evaluate PDF table extraction without writing custom parsing logic.

Outcome / Impact

Proved that tabular PDF content can be reliably converted to CSV with minimal code, informing later, more substantial document-processing work in the studio’s Documents & PDF capability.

Capabilities Demonstrated

  • Extracting tables from PDF documents into structured CSV
  • Rapid library evaluation / technical de-risking
More work See all →