Show HN: Validated Table Extractor–Verify PDF Tables Using Docling+Vision LLMs

Hey HN,

I built this because I got tired of "silent failures" in traditional PDF table extraction tools.

In my day job working with financial and legal documents, tools like Camelot or Tabula often return data that looks plausible but has shifted columns or missing decimal points. In regulated environments, you can't afford to guess.

I built a pipeline that treats extraction as a hypothesis to be verified:

1. *Extraction:* Uses IBM’s Docling to parse the layout and get the structure (Markdown).

2. *Visual Verification:* Captures a screenshot of the specific table region from the PDF.

3. *Validation:* Feeds both the Markdown and the Screenshot into a local Vision LLM (Llama 3.2 via Ollama).

4. *Scoring:* The LLM compares pixel truth vs. extracted text and outputs a confidence score + audit trail.

The trade-off is speed (it takes ~5s per table) vs. confidence. It's designed to run 100% locally for privacy-critical documents.

Repo is here: https://github.com/2dogsandanerd/validated-table-extractor

Would love to hear how you handle data integrity in RAG pipelines!

Summary

The article describes a tool called 'validated-table-extractor' that can automatically extract data from HTML tables and validate it against a schema, ensuring the integrity of the extracted data. The tool is designed to be used in data-driven projects to streamline the process of gathering and validating tabular data from web sources.

Story

Show HN: Validated Table Extractor–Verify PDF Tables Using Docling+Vision LLMs