Hazumi News | Show HN: Smelt – Extract structured data from PDFs and HTML using LLM

I built a CLI tool in Go that extracts structured data (JSON, CSV, Parquet) from messy PDFs and HTML pages.

The core idea: LLMs are great at understanding structure but wasteful for bulk data extraction. So smelt uses a two-pass architecture:

1. A fast Go capture layer parses the document and detects table-like regions 2. Those regions (not the whole document) get sent to Claude for schema inference — column names, types, nesting 3. The Go layer then does deterministic extraction using the inferred schema

This means the LLM is never in the hot path of actual data processing. It figures out "what is this data?" once, and then Go handles the "extract 10,000 rows" part efficiently.

Usage is simple:

  smelt invoice.pdf --format json
  smelt https://example.com/pricing --format csv
  smelt report.pdf --schema   # just show the inferred structure

You can also pass --query "extract the revenue table" to focus extraction when a document has multiple tables.

Still early (no OCR yet, HTML is limited to <table> elements), but it handles the common cases well. Would love feedback on the architecture — especially from anyone who's dealt with PDF table extraction at scale.

Summary

Smelt is an open-source framework for building and testing WebAssembly (Wasm) applications. It provides a development environment, build tools, and testing capabilities to streamline the Wasm development process.

Story

Show HN: Smelt – Extract structured data from PDFs and HTML using LLM