Show HN: Validated Table Extractor–Verify PDF Tables Using Docling+Vision LLMs
2dogsanerd Monday, December 08, 2025Hey HN,
I built this because I got tired of "silent failures" in traditional PDF table extraction tools.
In my day job working with financial and legal documents, tools like Camelot or Tabula often return data that looks plausible but has shifted columns or missing decimal points. In regulated environments, you can't afford to guess.
I built a pipeline that treats extraction as a hypothesis to be verified:
1. *Extraction:* Uses IBM’s Docling to parse the layout and get the structure (Markdown).
2. *Visual Verification:* Captures a screenshot of the specific table region from the PDF.
3. *Validation:* Feeds both the Markdown and the Screenshot into a local Vision LLM (Llama 3.2 via Ollama).
4. *Scoring:* The LLM compares pixel truth vs. extracted text and outputs a confidence score + audit trail.
The trade-off is speed (it takes ~5s per table) vs. confidence. It's designed to run 100% locally for privacy-critical documents.
Repo is here: https://github.com/2dogsandanerd/validated-table-extractor
Would love to hear how you handle data integrity in RAG pipelines!