Hazumi News | Show HN: Aracno – a distributed web crawler in Go

Aracno is a polite, distributed web crawler. The goal was to make it simple and user-friendly. There are much more powerful crawlers, but they can be excessively complex for simple tasks, especially in the distributed mode.

It uses a slightly modified version of the frontier algorithm from Heritrix3, the Internet Archive’s crawler. It is quite elegant, and suits crawler needs, although Aracno is not an incremental crawler, like Heritrix.

Aracno is fully distributed, based on the Chord DHT protocol, which means zero additional infrastructure is needed. You can join as many nodes as you want and they will just work out of the box. There is also a failure tolerance built into the Chord protocol, so nodes can leave at any time. The system uses key partitioning (where the key is the hostname of a URL) to distribute the crawling workload. Queue-based design of the heritrix frontier algorithm made it easy to repartition queues between nodes.

Persistence is handled via RocksDB, so you can stop the crawler at any point and resume where it left off.

Aracno saves crawled pages and relevant metadata as archived WARC files, which are just saved on the disk. There is intentionally no program-specific API involved, so it can be easily plugged into any system, although an endpoint for retrieving these files is planned.

Summary

Aracno is an open-source project that provides a web interface for managing and monitoring Docker containers. It offers features like container management, metrics visualization, and multi-host support, making it a versatile tool for managing containerized applications.

Story

Show HN: Aracno – a distributed web crawler in Go