Story

Show HN: I built a 50 site sampler from CommonCrawl refreshing every 30 minutes

whothatcodeguy Tuesday, February 03, 2026

I tossed this together this afternoon mostly just to validate a premise: the internet has become so heavily consolidated into a few key discovery surfaces for the common user, and I miss when you could really just get lost in it. Is there a way we can unearth pieces of it we would never actually see under normal circumstances? Wouldn't it be so cool if you could just explore the internet like you're walking through random doors in a long, eternal 6TB hallway?

So, I made RandomCrawl. It's a super minimal website that does nothing more than run a Node script every 30 minutes, pick a random path down the file structure of the Common Crawl dataset, minor filtering for secure .com websites for good measure, and takes a random sample of 50 websites from the chunk.

There has been a ton of noise, but it has been surprisingly fun. I feel like an internet archaeologist. For every 5 random sass websites, you get like some random tourism site for a town you've never heard of, or an ancient blogspot from the early 2000s.

Here are a couple of great finds so far: https://ahapoetry.com/ https://alexunu.blogspot.com/2007/ https://www.brtpeinture.com/

I'm not sure I'll do much more with the website since it was an experiment, but you can bet I'll be digging around this dataset some more. It reminded me there is still a lot of expression out there on the internet, and its amazing some of these sites are even still live. It's way more fun to explore than to mindlessly scroll one of our five favorite websites.

disclaimer: im not filtering out nsfw so keep that in mind

Summary
RandCrawl is an open-source web crawling tool that allows users to explore and extract data from websites in a scalable and efficient manner. The tool provides a range of features, including parallel processing, content extraction, and data storage options, making it a versatile solution for web data harvesting tasks.
1 0
Summary
randcrawl.com
Visit article Read on Hacker News