The World Is Being Modeled. Most of It Is Missing.

We are building a simulation of the world.

Not the cinematic version of the Matrix movies. Something that is happening at a slow pace and yet far more consequential. The models being trained today will decide how machines see, interpret, and act. They will sit between humans and systems. Commerce, healthcare, education, governance will increasingly pass through them.

And yet, most of the world is either thinly represented or not represented at all.

This is not philosophical. It is a data problem.

The Nuance That Data Misses

Consider a street negotiation in Jakarta. Price is only part of the exchange. There is pacing, hesitation, humor, hierarchy. A pause signals interest. A smile softens disagreement. None of this translates cleanly into text. Even less into datasets dominated by Western, transactional interactions.

Or take a household in Chennai. Instructions are often indirect. Authority is implied. A request can sound like a suggestion. A refusal can sound like agreement. This is not nuanced at the margins. This is how hundreds of millions of people communicate.

If a model has not seen this, it does not understand the world. It approximates it.

The Consequences Are Already Here

We are already seeing them. Assistants that sound fluent but behave strangely in non-Western contexts. Commerce systems that misread intent. Customer service bots that escalate when they should de-escalate. Recommendation systems that collapse cultural context into generic categories.

The failure is not dramatic. The model does not break. It is just slightly wrong, repeatedly. Over time, those small errors compound into exclusion.

The Deeper Cost

Culture is not just language. It is behavior, context, and accumulated social logic. When models are trained on a narrow slice of that, they do not just miss information. They reshape it.

We risk building systems that standardize human experience into a single dominant template. A monoculture by default.

And it compounds. Models trained on limited data generate outputs that reflect those limits. Those outputs become new training data. The loop tightens. Diversity gives way to convenience.

The Global South Is Not an Edge Case

This is where the Global South matters. Not as a category. As the majority of lived human experience.

India operates across dozens of major languages and social codes. Indonesia spans thousands of islands with distinct norms. The Middle East carries layered formality and tradition. Africa's urban and informal economies run on their own systems of trust and exchange.

These are not edge cases. They are primary systems.

If world models are to work, they must be grounded here.

The Supply Chain Problem

But this will not happen on its own.

The current data supply chain optimizes for scale. It scrapes what is abundant. It ignores what is hard to capture. It prefers unlicensed data because it is faster and cheaper. It trains on what exists, not what is missing.

Left to itself, the market will continue to underrepresent large parts of the world.

Where Governments Matter

Not as regulators slowing things down, but as actors shaping direction.

We have seen this before. Railways, power grids, telecom networks. Markets did not build them alone because the benefits were systemic and long-term. AI data infrastructure is no different.

If countries want their economies and cultures to be accurately represented in the systems that will increasingly govern them, they need to invest in building high-quality, licensed, locally grounded datasets.

This means funding content creation and digitization. Supporting companies that can structure and enrich this data into something usable by AI systems. Setting clear standards for provenance so data can be trusted and creators are compensated.

A Question of Sovereignty

A model that does not understand your context cannot serve your interests. At best, it is inefficient. At worst, it is extractive.

There is a commercial reality here as well. As base models improve, generic data becomes less valuable. What becomes scarce is context. Cultural nuance. Real-world interaction. Data that cannot be easily synthesized.

Whoever supplies this sits upstream of the AI value chain.

But building it is not trivial. It requires access, trust, and time. It requires working with communities and creators. It requires legal clarity. It requires patient capital.

This is infrastructure, not a feature.

The Alternative

The alternative is straightforward. AI systems that work well in a few contexts and poorly in many. Large populations misread or ignored. Cultural nuance flattened into a global average.

We will still call them world models. They just will not model the world.

A Narrow Window

We have a narrow window to correct this.

The choices being made now about what data is collected, how it is structured, and who controls it will shape these systems for decades.

This is not just a technical decision. It is a cultural one.

If the world is plural, our models must be too.

Otherwise, we are not building intelligence. We are building approximations at scale.

Back to Journal