Sunday, 21 June 2026 PDT | 09:03 AM
The 1 News Alt Logo Text Smart News for Global Indians

Why the opaque army of little robots that roam the internet and feed AI inherit the blind spots and biases of the web

AI News June 17, 2026 12:30 AM
Why the opaque army of little robots that roam the internet and feed AI inherit the blind spots and biases of the web

Why AI systems, which rely on the internet, poorly reflect the diversity of human knowledge

InvestigationAs the internet supplies the vast amounts of data AI systems need for training, ChatGPT and the likes inherit the web's biases and thus cannot claim to be comprehensive.

They are tiny creatures that we unknowingly share our lives with. Discreet insects drawn to the traces of human activity, swarming in the shadows of our everyday existence. But unlike the spiders that inhabit our gardens and homes, these crawlers – sometimes called "web spiders" – are not made of chitin; their web is built from code, fiber optics and network protocols. These industrious little robots spread across the internet as the web's surveyors, tasked with moving from link to link across the vast digital landscape.

In the great mechanical family of web spiders, not all species have the same job. One of the oldest appeared with the first major search engines and directories: these crawlers, such as Googlebot (Google's crawler), Bingbot (Bing's crawler) and Slurp (Yahoo!'s first crawler), are sent into the wild to catalog and index existing web pages, making them easier for internet users to access.

In recent years, a new generation of crawlers has swept across the internet. Powered by large language models (LLMs), the programs that power artificial intelligence agents, they do far more than simply index the web. New bots such as GPTBot, ClaudeBot, Meta-ExternalAgent and Bytespider scrape content on a massive scale.

The goal is to sweep through the web, an inexhaustible reservoir of knowledge, to build gigantic corpora of textual data. These corpora are then used to feed and train the LLMs developed by OpenAI, Anthropic and Meta, enabling their respective agents – ChatGPT, Claude and Llama – to generate increasingly plausible responses to user prompts.

You have 91.52% of this article left to read. The rest is for subscribers only.