Story

Show HN: Ragctl – document ingestion CLI for RAG (OCR, chunking, Qdrant)

ahsekka Wednesday, December 24, 2025

Hi HN — sharing ragctl, an open-source CLI for the most failure-prone part of RAG pipelines: document ingestion, OCR, parsing/cleaning, and chunking.

Vector DB setup is fairly standardized now, but getting high-quality, consistent text + metadata into it still takes a lot of brittle glue code. ragctl aims to make that “pre-vector” step repeatable: turn messy documents into retrieval-ready chunks in a few commands.

Features • Multi-format input: PDF, DOCX, HTML, images • OCR for scanned/image-based docs • Semantic chunking (LangChain) • Batch runs with retries + error handling • Output: direct ingestion into Qdrant (for now)

Looking for feedback • DX: is the CLI intuitive? • Performance / edge cases: weird PDFs, mixed layouts, tables • Roadmap: which connectors (S3, Slack, Notion) or vector stores should be next?

Repo: https://github.com/datallmhub/ragstudio Happy to answer questions about the architecture and chunking approach.

4 0
github.com
Visit article Read on Hacker News