@gwern
12d
As far as is known, IA is poorly represented in existing LLM datasets. They don't allow indexing or scraping, so they don't show up in Common Crawl which is the starting point. (The occasional link to IA might show up, and someone processing CC might choose to follow it, but that's relatively unusual, aside from image links: most people focus on the text inside CC itself.) And their servers are quite slow & overloaded, so if you targeted them manually, your scrapers will be rate-limited, banned, or just incredibly flaky. They contain a lot of highly redundant snapshots, so they're a hassle to post-process. And much of what they contain is implied or covered by easier to get datasets. I also haven't seen any hints in either randomly-generated samples or prompted samples from GPT-3, ChatGPT, GPT-4, or other LLMs of signatures of IA snapshots like their book OCR or their HTML headers. So... yeah, it's possible, and I wouldn't be surprised if data-hungry LLMs like GPT-4 have or will start tapping into IA, but right now there's no real reason to think that.
@ronsor
13d
I wonder what an LLM trained on the entire Wayback Machine would be like