top of page

When AI Models Forget Languages, They Forget Markets

There is a strange asymmetry in how languages disappear. They do not die because people stop speaking to them. They die because systems stop recognising them. The shift from print to digital accelerated this. The shift from digital to AI will finish the job unless we change how models are built.

Today, the world’s largest AI models treat most of the Global South as a statistical rounding error. This is not malice. It is simple math. Public datasets over-index on English to a degree that would make any policymaker wince. Common Crawl spans hundreds of billions of pages collected over 18 years. English accounts for almost half. Hindi, spoken by seven percent of the world, accounts for 0.2 percent. Tamil, with more than 80 million speakers, sits at 0.04 percent. And this is not a dataset anomaly. It is a structural problem of how the internet itself was built and indexed.

Models trained on these distributions learn a predictable lesson. They assume the world speaks, thinks and behaves in English. When pushed into languages from the Global South, they approximate, guess and generalise. You see this in translation errors. You see it in culturally tone-deaf recommendations. You see it when a model is asked to describe a festival or family dynamic it has barely seen. The output is coherent but hollow, like a high-resolution photo with the colour washed out.

Language loss used to be a sociological concern. In the age of GenAI, it becomes an infrastructure concern. If a model has no meaningful exposure to a language, the culture attached to that language effectively becomes invisible to the algorithm. It will not show up in predictions. It will not shape behaviour. It will not influence design. The most powerful systems in history simply route around it.

This matters for two reasons. The first is obvious: accuracy. A model that cannot understand how people actually speak or buy or joke will build fragile products for half the planet. Think of e-commerce platforms generating synthetic product videos that get the skin tone wrong, the gesture wrong, the cooking setup wrong. Think of recommendation engines that misread intent because they have never seen how humour or sarcasm works in Marathi or Hausa or Javanese. The gap is not a matter of optimisation. It is a matter of missing data.

The second reason is less visible but more important. Models shape markets. They determine what content gets discovered, what gets recommended, what gets automated. If languages from the Global South remain poorly represented in training sets, the future digital economy will encode those absences. In practical terms, it means an Indian or Indonesian creative industry feeding billions of views into global platforms will still be treated as niche. It means cultural nuance gets normalised out. It means economic opportunity flows to those whose datasets already dominate.

The irony is that the Global South has something the AI labs desperately need. It has scale, diversity and depth of context. It has television archives, film libraries, regional newsrooms, theatre recordings, oral histories, advertising material and mobile-first vernacular video that no Western dataset can approximate. But this material is scattered. It sits in private vaults, ageing servers, bureaucratic ministries and unindexed drives. The opportunity is to turn this cultural mass into structured datasets. Not as a nostalgic preservation project but as core AI infrastructure.

This will require three shifts. Governments must treat linguistic archives as strategic assets, not museum pieces. Content owners must recognise that they hold the raw material for the next generation of AI and price it accordingly. And AI companies must acknowledge that generic scraped data will not carry them through the next phase of model differentiation. If everyone is training on the same English-heavy mix, everyone will converge to the same median output.

The Global South does not need special treatment. It needs representation. The cost of ignoring this will not show up tomorrow. It will show up when future models can simulate entire worlds but struggle to correctly represent a Tamil street, a Yoruba folktale or a Tagalog joke. When that happens, the loss will not be linguistic. It will be cultural and economic.

The choice now is simple. Either the next wave of GenAI is built on datasets that reflect where the world actually lives. Or the world learns to live inside the limits of someone else’s dataset.

One option scales. The other shrinks.

 
 
 

Comments


bottom of page