PDF Data Extraction Pipeline | Sébastien Feser

The Problem

In volcanology, understanding how magma behaves underground requires experimental data: compositions of minerals and melts measured in the lab under controlled temperatures and pressures. Over the years, thousands of these experiments have been published in scientific papers. One researcher on the VAMOS project at the University of Geneva compiled about 3,000 of them manually into a single spreadsheet, reading each PDF, finding the right tables, and copy-pasting values one by one.

But there are thousands more papers out there. Doing this by hand is time-consuming, error-prone, and simply doesn't scale. A single paper can take 15 to 30 minutes to process. Multiply that by a few thousand, and you're looking at years of manual work.

So I set out to automate the entire pipeline: give it a PDF, get clean, structured data back. It sounds simple, but it turned out to be a fascinating engineering challenge.

The Challenge

The goal sounds simple: give it a PDF, get a clean spreadsheet with all the oxide compositions, temperatures, and pressures. But PDF tables are a nightmare for computers. Every paper uses a different format, with different column orders, different names for the same oxides, and units that aren't always specified.

Some tables are rotated 90 degrees, turning columns into rows. Others span two or three pages, with headers only on the first one. And often the data you need is split across multiple tables: experimental conditions in Table 1, mineral compositions in Table 3. You have to match them yourself.

A human can handle this because they understand the context. A regular script cannot.

What is Claude?

Claude is an AI model made by Anthropic, similar to ChatGPT. You give it text, it processes it, and it gives you a response. You can give it very specific instructions: here's a table, extract the oxide values into this exact JSON format. And it does it. It can also write code.

It comes in three sizes. Haiku is the smallest: fast and cheap, about $0.001 per page. Sonnet is the middle ground, at roughly $0.01 per page. And Opus is the most powerful, at around $0.1 per page.

There are two ways to use it. The API lets you send text from a script and get data back, paying per use. Claude Code is an AI assistant that runs in your terminal, reads your files, writes code, and runs commands, for a flat monthly subscription ($20/month for Pro, $100/month for Max, or $200/month for Max x20).

Finding the Right Approach

This is where I spent the most time. Each failure taught me something, and the final solution only exists because of everything I learned along the way.

Attempt 1

Brute Force with Claude Code

I started with the simplest idea: just ask Claude Code to read the PDFs directly and extract everything. No pre-processing, no pipeline. The problem? It burned through my entire session budget in about 5 minutes, and it hadn't even finished processing a single paper. Sending full PDFs to the most powerful model with no filtering is extremely expensive. Not viable at all.

Attempt 2

Haiku Reads PDFs Directly

I tried the cheapest model, Haiku, on the raw PDF content. At $0.001 per page, the cost was right. But Haiku couldn't reliably parse complex table layouts. Too many errors, too many missed values. Cheap but unusable.

Attempt 3

Sonnet Vision + Haiku Review

I used Sonnet in vision mode, sending images of each page. It worked! The model could actually read the tables visually. But it cost $0.50 to $1.00 per PDF. For thousands of papers, that's thousands of dollars. Accurate but way too expensive at scale.

Attempt 4

pymupdf Pre-processing + Haiku

The breakthrough idea was doing more work in Python first. I used pymupdf to extract text with positions, reconstructed the table layout, and filtered out irrelevant content. Then I sent only the relevant blocks to Haiku. Cost dropped to $0.08 per paper, but accuracy was only 77 to 94%. Better, but not reliable enough.

Attempt 5

Custom Python Parsers

I tried writing custom Python parsers for each table format. The idea was to bypass AI entirely and just parse the tables programmatically. But every paper is different: different layouts, different column names, different structures. Writing a new parser for each paper is simply not scalable.

Attempt 6

tabula + Haiku

I switched to tabula-py, a Python wrapper around tabula-java, which is much better at detecting table boundaries. Combined with Haiku for interpretation, it reached 81.5% accuracy at $0.09 per PDF. Getting closer, but still not good enough for scientific data where every value matters.

Attempt 7

tabula + Claude Code Agents

The winning combination. tabula handles table extraction from the PDF, then Claude Code agents (using Opus, the most powerful model) process and cross-reference the data. Three AI agents work in parallel on different papers simultaneously, maximizing throughput. Result: 99% accuracy at roughly $100 per month for over 1,000 PDFs.

Why This Approach Works

Three things make this solution work where the others failed.

First, Claude Code uses Opus, the most powerful model. While Haiku and Sonnet struggled with complex table layouts, Opus reads them like a human would. It understands context, cross-references between tables, and doesn't miss values.

Second, the pricing model changes everything. With the API, you pay per token, so you try to send as little as possible. With Claude Code, it's a flat monthly fee. You can process over 1,000 PDFs without worrying about costs.

Third, Claude Code only handles the hard part: reading tables and writing structured JSON. Everything else (running tabula, assembling the spreadsheet, validation, flagging) is handled by Python scripts. You use AI where it adds value, and code where it's more reliable.

Quality Control

AI is a tool, not a replacement for expertise

Even with 99% accuracy on our test set, there must always be a human reviewing the output. The system flags uncertain values and missing data, but the final check is always done by a domain expert. No amount of automation replaces the eye of a scientist who knows what the numbers should look like.

Key Takeaways

Don't throw AI at a problem without pre-processing. The more you prepare the input, the better the output.
The cheapest model is not always the most cost-effective. A more powerful model that gets it right the first time can save you hours of debugging and re-processing.
Split the work: let Python handle what Python does best (file I/O, formatting, validation), and let AI handle what it does best (understanding messy, unstructured data).
Iterate fast, fail cheap. Each failed attempt took a few hours at most, but the lessons compounded into the final solution.
Always keep a human in the loop. AI is powerful, but domain expertise is irreplaceable.

Technologies Used

Python tabula-py / tabula-java Claude AI Claude Code pymupdf