Why the opaque army of little robots that roam the internet and feed AI inherit the blind spots and biases of the web

Why AI systems, which rely on the internet, poorly reflect the diversity of human knowledge

InvestigationAs the internet supplies the vast amounts of data AI systems need for training, ChatGPT and the likes inherit the web's biases and thus cannot claim to be comprehensive.

They are tiny creatures that we unknowingly share our lives with. Discreet insects drawn to the traces of human activity, swarming in the shadows of our everyday existence. But unlike the spiders that inhabit our gardens and homes, these crawlers – sometimes called "web spiders" – are not made of chitin; their web is built from code, fiber optics and network protocols. These industrious little robots spread across the internet as the web's surveyors, tasked with moving from link to link across the vast digital landscape.

In the great mechanical family of web spiders, not all species have the same job. One of the oldest appeared with the first major search engines and directories: these crawlers, such as Googlebot (Google's crawler), Bingbot (Bing's crawler) and Slurp (Yahoo!'s first crawler), are sent into the wild to catalog and index existing web pages, making them easier for internet users to access.

In recent years, a new generation of crawlers has swept across the internet. Powered by large language models (LLMs), the programs that power artificial intelligence agents, they do far more than simply index the web. New bots such as GPTBot, ClaudeBot, Meta-ExternalAgent and Bytespider scrape content on a massive scale.

The goal is to sweep through the web, an inexhaustible reservoir of knowledge, to build gigantic corpora of textual data. These corpora are then used to feed and train the LLMs developed by OpenAI, Anthropic and Meta, enabling their respective agents – ChatGPT, Claude and Llama – to generate increasingly plausible responses to user prompts.

You have 91.52% of this article left to read. The rest is for subscribers only.

Why the opaque army of little robots that roam the internet and feed AI inherit the blind spots and biases of the web

Related Stories

World Cup 2026: Why the debate surrounding Jude Bellingham for England remains ahead of Ghana game

France restricts public drinking and outdoor sports as heat wave bakes parts of Europe

Mbappe, France play Iraq in World Cup match: prediction, team news, lineups

Four months after the horrific Iran school bombing, fears grow that Trump and Hegseth will bury the truth

A decade after Brexit, its economic and political aftershocks haunt Britain

The black community's 'untold stories' to be shared

Record Canadian trade mission heads to Japan as CUSMA review looms

Mark Carney shifts his tone on U.S. trade tensions