Story

Show HN: Full-text search engine for Epstein docs (OCR and OpenSearch)

ProbDashAI Tuesday, December 23, 2025

Hi HN,

Like many people, I was frustrated that the released Epstein/Maxwell court documents were mostly scanned images (PDFs) with no text layer. This made them impossible to Ctrl+F or analyze programmatically.

I built a pipeline to fix this using Python, Tesseract, and OpenSearch.

The Site: https://epsteinfilez.com

The Stack:

Ingestion: Python workers using ocrmypdf (Tesseract) to perform parallel OCR on raw files.

Search: OpenSearch for indexing the extracted text.

Frontend: Next.js (SSR) for the UI.

Infrastructure: Self-hosted Docker swarm.

Features:

Sub-second full-text search across all files.

Highlights search terms directly on the PDF page.

Deep linking to specific pages/documents.

This is a transparency tool, not a political one. I wanted to make the raw primary sources accessible to researchers and journalists.

Feedback on the search relevance or indexing pipeline is welcome!

2 0
Read on Hacker News