Normal view

Yesterday — 21 May 2026Search Engine Land

What multilingual regions reveal about the future of AI search

What multilingual regions reveal about the future of AI search

AI search doesn’t just translate or localize results. It decides which sources, institutions, and versions of reality get surfaced in the first place.

Catalonia offers a useful stress test for that system. Two languages share the same geography, which makes retrieval patterns easier to spot. 

When the same queries are run in Catalan and Spanish across Google AI Overviews and ChatGPT, the differences go far beyond wording — and reveal broader problems that extend well beyond multilingual regions.

Catalonia as a stress test for AI search

Did you know that if you search for Tradicions de Sant JordiSaint George’s Traditions, written in Catalan — Google Translate will identify the source language as Occitan?

Probably not. Most Catalan speakers don’t know it either, partly because Translate’s language guess isn’t exactly wrong: Catalan and Occitan share a common Romance ancestry, and some classification systems group them together. 

The answer is technically defensible. It’s also, statistically, an odd call — and the kind of small anecdote that points at a much larger problem in the infrastructure underneath.

Google Translate showing "Detectado: Occitano" with input "Tradicions de Sant Jordi" and output "Tradiciones de San Jorge"
Google Translate showing “Detectado: Occitano” with input “Tradicions de Sant Jordi” and output “Tradiciones de San Jorge”

Occitan has roughly 200,000 speakers, mostly in southern France. Catalan has roughly 9 million speakers and is the co-official language of Catalonia, one of Europe’s wealthier regions and home to a city Google has operated in for over 20 years. 

Asked from a Barcelona IP, Google’s translation product decides that the more plausible source language is the one with more than an order of magnitude fewer speakers, in another country. Translate then renders Sant Jordi into Spanish as San Jorge — castilianizing the proper name of the patron saint of Catalonia, a name that doesn’t need translating in the first place.

This single Translate quirk is anecdotal. What it points at isn’t. It’s a language-identification problem that has lived inside Google’s infrastructure for years — and Google itself has publicly acknowledged it

In January 2023, the company’s Search Liaison account responded to a wave of complaints from Catalan-speaking users about Catalan results being downgraded in favor of Spanish ones. Google called the issue “a priority” and committed to keep investigating. The acknowledgment was even posted in Catalan — a tacit admission that the affected audience was real and large enough to warrant a direct response.

Google later pushed updates that year that measurably improved Catalan visibility in classical SERPs. But the underlying language-identification layer was never structurally repaired. When a Catalan speaker today watches Google’s AI Overview answer a Catalan-language query in Spanish, it isn’t a new bug. It’s an old bug now sitting underneath a synthesis layer that propagates it.

AI search, when it arrives, inherits the assumption that the language of the query is unreliable in the first place. The retrieval pipeline that flattens Catalan into Spanish today is the same pipeline that will, in modified forms, flatten sub-national jurisdictional context in markets where the surface language never changes.

I have spent the last several months documenting how AI search collapses Hispanic markets — treating 20+ Spanish-speaking countries as a single statistical default. That work is severe in its consequences, but at least the geography is clean: Spain is one country, Mexico is another, the model just fails to tell them apart. 

What happens inside Catalonia is more revealing because the geography doesn’t change. Two languages share one territory, and the system produces two parallel realities — when it can identify the languages at all.

Multilingual regions are where the architectural defaults of retrieval become visible, because users in those regions can switch languages and watch the system reassign meaning, authority, and sometimes even the answer’s language.

The same defaults will surface inside markets that look monolingual on the surface, in different forms and with different mitigations. Catalonia is a leading indicator.

How I tested this

The patterns I’m about to describe are familiar to any practitioner who has worked on Catalan-language SEO over the last decade — my own experience, and the experience of many colleagues working under similar conditions. 

Anyone who has tried to do keyword research in Catalan has watched Google Keyword Planner report essentially zero volume for terms catalan-speakers query daily, or return volumes that are clearly mixed with Spanish-language data and impossible to use cleanly. 

Anyone who has run multilingual sites has watched their Catalan variants underperform their Spanish ones for reasons standard tooling can’t explain. The small experiment I describe below is one specific, reproducible illustration of this broader, well-known systemic situation — not the foundation of the claim.

The setup was deliberately simple. From a residential IP in the Barcelona metropolitan area, I ran a set of paired queries in Catalan and Spanish across two surfaces: 

  • ChatGPT (logged out, fresh session, no personalization).
  • Google web search with its AI Overview enabled when the system chose to generate one. (Google doesn’t generate an Overview for every query — itself a signal worth noting.) 

Sessions ran in incognito mode. I ran the queries twice, roughly a week apart, to test whether what I was seeing was a stable pattern or a single-session artifact. Both dates are documented. Screenshots are available with location footers visible.

I chose five intent pairs, each designed to test a different layer of the retrieval stack:

  • A politically loaded factual query about Catalan independence, chosen because it has academic precedent in Walker and Timoneda’s 2025 study (Department of Political Science, Purdue University) of language-conditioned LLM output, published in Cambridge University Press’s Political Science Research and Methods. The replication of their method on a Barcelona IP gives the section editorial cover.
  • A transactional commercial query about local accountants for freelancers, chosen because it sits squarely inside the everyday SEO economy and is identical in intent across languages.
  • A cultural-tradition query about Sant Jordi, chosen because it has clear native authority (regional government, municipal authorities), low political temperature, and centuries of documented history independent of any particular brand.
  • A regulatory query about Catalan rental subsidies, chosen because it requires hyper-local jurisdictional precision and is administered by the Generalitat de Catalunya directly.
  • A language-identification stress test — a mix of casual and formal Catalan queries — to see whether the surface even recognized the input as Catalan.

The findings below are reproducible existence proofs rather than statistical evidence. These specific failures occur on these specific platforms today — from this specific location — and any practitioner can replicate them in under 15 minutes. 

The broader claim — that these patterns generalize — rests on the community evidence the Google Search Liaison acknowledgment implicitly confirmed three years ago, and the lived experience of practitioners working in Catalan and other minority languages over the last decade.

Four patterns emerged. The first three describe retrieval. The fourth describes identification, and it underpins the other three.

Finding 1: Vocabulary and source plurality diverge

I asked both ChatGPT and Google’s AI Overview about the main arguments around Catalan independence. 

In Spanish, both surfaces produced a legalistic frame anchored in the 1978 Constitution and the 2017 referendum’s illegality. In Catalan, both surfaces foregrounded dret a decidir (right to decide) and autodeterminació as named conceptual blocks, with historical references to the loss of institutions after the Decrees of Nova Planta. 

The Catalan output wasn’t more ideological. It was more complete. It retained anti-independence arguments, including framings absent from the Spanish version.

Side-by-side comparison of Google AI Overview for the independence query in Spanish and Catalan, citation panels visible
Side-by-side comparison of Google AI Overview for the independence query in Spanish and Catalan, citation panels visible

The divergence sharpens in the citations. The Spanish AI Overview pulled from BBC, Wikipedia (ES), Fundación Espacio Público, and France 24. The Catalan AI Overview added El Punt Avui, VilaWeb, Reddit r/catalunya, and Wikipedia (CA), while still citing BBC and El País.

Same engine, same geography, same question. Two non-overlapping retrieval pools, triggered by the language string. The language isn’t labeling the answer. It’s filtering the corpus.

Finding 2: Commercial retrieval shifts, and the engine doubts the minority language

The transactional pair was simple: Millors gestories per a autònoms a Barcelona / Mejores gestorías para autónomos en Barcelona. Best accountants for freelancers in Barcelona, in two languages, from the same city.

ChatGPT recommended largely the same physical firms in both versions, but the online providers diverged: the Catalan response surfaced Openges and Gestasor; the Spanish response surfaced Gestoría Online and Gestorum. Same intent, same geography, two parallel commercial universes for the digital-first segment.

Google’s organic SERP showed a more pronounced split. The Catalan version elevated locally bilingual sites (Gremicat, Calders Assessors, Gestumm, barcelona.cool). The Spanish version led with aggregators and generalist directories (Legify, Zaask, bcngest).

Two secondary signals matter more than the rankings.

First, Google autocorrected the Catalan query. Above the results, the engine offered: Quizás quisiste decir: Millors gelateries per a autònoms a Barcelona. Did you mean ice cream shops? The system, sitting on a Barcelona IP, declined to believe a commercial query in Catalan was genuine and proposed a homophone-adjacent alternative.

Google SERP showing the autocorrection "Quizás quisiste decir: Millors gelateries per a autònoms a Barcelona"
Google SERP showing the autocorrection “Quizás quisiste decir: Millors gelateries per a autònoms a Barcelona”

Second, the Spanish results carried paid ads — Talenom, Declarando, Horus Firm. The Catalan results carried zero. The SEM market treats Catalan as territory without bidders, and the absence of a commercial signal is itself a signal. Models trained on click and engagement data read that absence as evidence that the language isn’t commercially serious and weight retrieval accordingly.

The mechanism teaches itself. Less commercial bidding produces less commercial visibility. Less commercial visibility produces less commercial signal. 

The language is steadily deprioritized for transactional intent — even though every user typing in Catalan from Barcelona shares the same geography as a user typing in Spanish. This will become relevant again when we look at language identification itself.

Dig deeper: How AI search defines market relevance beyond hreflang

Get the newsletter search marketers rely on.


Finding 3: Cultural authority gets reassigned

The Sant Jordi pair shows it most clearly, and the specific reassignment changes between sessions in a way that is itself revealing.

In the first session, the Spanish-language AI Overview for Tradiciones de Sant Jordi led with two hotel chains as primary citations — Casa Llimona Hotel Boutique and Sumus Hotels. The Catalan version cited the Ajuntament de Barcelona, the city council that has formally stewarded the tradition for centuries.

In the second session, a week later, the same queries returned a different reassignment. The Spanish version now cited the Ajuntament alongside Spain.info, the state tourism portal aimed at foreign visitors. The Catalan version moved up the institutional hierarchy entirely — its primary citation became the Generalitat de Catalunya, the regional government, with a footer link to the Guia Oficial de la Diada de la Generalitat de Catalunya. The Ajuntament was absent.

Composite — Google AI Overview citation panels for Sant Jordi traditions across both sessions, showing the language-conditioned shift in cited authorities

What stays stable across both sessions is the structural pattern: the cultural custodian credited by the system changes with the language. Catalan-language queries surface regional and municipal government, the institutions native to the tradition. Spanish-language queries surface state tourism, commercial entities, or municipal government framed as a tourist destination.

ChatGPT reinforces the same pattern in its prose. The Spanish version describes Sant Jordi externally: Día del amor “a la catalana,” oportunidad para conocer el patrimonio cultural catalán. The Catalan version uses native terminology without distance. The same 600-year-old tradition is described as exotic-from-outside in one language and as tradition-from-inside in the other.

The model isn’t lying in either language. It’s producing the most statistically plausible synthesis given its retrieval pool. But the retrieval pool itself is constituted differently by language — and one constitution treats government as the cultural custodian, while the other treats tourism marketing as the cultural custodian.

For brands, this isn’t a translation problem. It’s a question of who the model thinks owns the answer.

Finding 4: Language identification was already broken before LLMs touched it

This is the finding that reframes the rest. The reassignment patterns above all depend on the system correctly identifying the language of the query in the first place. Often, it doesn’t.

The Google Translate finding — Catalan misclassified as Occitan from a Barcelona IP — is one face of it. Another is what happens when you type a query that is unambiguously Catalan into Google Search. 

The query receptes de calçots — recipes for calçots, a vegetable that exists only in Catalonia and retains its Catalan name in every other language — produces a banner above the results: Sugerencia: Mostrar resultados en español. También puedes consultar más información sobre cómo filtrar por idioma

The system suggests that the user filter Catalan results out. No AI Overview is generated for the query. The infrastructure has decided that a recipe search for a Catalan-only vegetable, in Catalan, is more usefully answered in Spanish.

 Google Search showing the suggestion "Sugerencia: Mostrar resultados en español" for the query "receptes de calçots"
Google Search showing the suggestion “Sugerencia: Mostrar resultados en español” for the query “receptes de calçots”

In Google’s AI Overview, the query Tradicions de Sant Jordi sometimes returns a Spanish-language answer despite being written entirely in Catalan, citing Spain.info. In other sessions, the same query is correctly identified and answered in Catalan. 

The behavior is inconsistent across sessions, which is worse than consistently wrong: it is undiagnosable. A site owner can’t fix something that breaks intermittently for reasons the system itself doesn’t surface.

The failure isn’t universal. Queries like festivitats de Catalunya or poetes catalans contemporanis — slightly more formal or erudite phrasings — are correctly identified as Catalan and answered with Catalan-language synthesis, citing regional sources (Pimec, Gencat, El Temps, Lletra UOC). 

The system can identify Catalan. It just doesn’t do so reliably for commercial or popular queries, which is where the cost of getting it wrong is highest for site owners.

This is where Findings 2 and 4 close a loop. The same commercial categories that show zero SEM bidding in Catalan are the categories where language identification fails most often. A language with no commercial signal teaches the system that it doesn’t need to be treated as commercially serious — and so, for commercial queries, the system permits itself to identify it less reliably. The two failures reinforce each other.

None of this is new. Google Search Liaison publicly acknowledged the Catalan demotion problem in January 2023 and later that year pushed visible improvements to classical SERPs. 

The synthesis layer that now sits on top has not inherited those fixes. AI search is built on these pipelines. It inherits their defaults, their training-data composition, and their decisions about when a language deserves to be treated as the language of the answer.

The slop loop closing on minority languages

A second, slower mechanism makes all of this worse over time, and it is worth flagging because it is starting to be visible elsewhere.

LLMs trained on web-scale corpora are now generating significant quantities of low-quality content in minority languages — both directly (via translation features) and indirectly (via downstream tools that produce SEO content, social posts, and automated articles). 

That generated content gets indexed, gets crawled, gets fed back into the next generation of training data. The model that doesn’t understand Catalan well produces the Catalan content that trains the next model.

This isn’t theoretical. A 2024 Princeton study by Brooks, Eggert, and Peskoff found that over 5% of newly created English Wikipedia articles showed signs of being AI-generated, with lower but still measurable rates in German, French, and Italian editions. 

By extension — though outside the Princeton team’s measurement scope — minority-language editions with thinner editorial oversight are likely to absorb a greater proportional impact.

The minority-language damage is now well-documented. MIT Technology Review reported in September 2025 on a linguistic “doom loop” in vulnerable-language Wikipedias. 

  • Volunteers working on four African-language editions estimated that between 40% and 60% of their articles were uncorrected machine translations.
  • The Inuktitut edition contained machine-translated portions in more than two-thirds of substantive pages.
  • Some Hawaiian-language entries had 35% of their words flagged as incomprehensible by native speakers. 
  • The Greenlandic edition, where virtually no articles had been written by actual speakers, was ultimately recommended for closure in 2025, with the Wikipedia Language Committee citing AI tools that had “frequently produced nonsense that could misrepresent the language.” 

Wikipedia was estimated in 2022 to be the sole easily accessible source of online linguistic data for 27 under-resourced languages — meaning these errors don’t stay on Wikipedia. AI systems train on them next.

This is the loop. Bad language identification produces bad retrieval. Bad retrieval surfaces bad content. Bad content gets generated at scale by LLMs that don’t fully understand the language. Bad content gets indexed. The next model trains on it. 

The mechanism doesn’t need malice to degrade quality — it needs only volume. And volume in minority languages has never been easier to manufacture.

What Wikipedia decided to do about it

The clearest institutional signal that this problem is real comes from one of the few platforms with both the experience and the incentive to take it seriously.

On March 20, the English Wikipedia community formally voted to prohibit LLM-generated article content across its 7.1 million articles. Editors are still permitted to use LLMs for basic copyediting and for supervised translation of articles from other-language editions, but generating or rewriting article content with LLMs is prohibited outright. 

The decision was a response to years of mounting concern: ChatGPT-era articles were appearing with the “as a large language model” prompt left in the text, with entirely nonexistent citations, and with the kind of fluent-but-empty prose that reviewers were spending disproportionate volunteer time cleaning up.

Wikipedia isn’t a typical SEO concern. It’s a curated knowledge platform with strong volunteer governance and explicit neutrality policies. If a platform with that level of structural defense against low-quality content has concluded that AI-generated text damages knowledge integrity, the SEO industry should not assume that retrieval pipelines downstream of Wikipedia will produce better answers than Wikipedia itself was willing to publish.

The institutions building defenses against AI-generated content in minority languages — Wikipedia, the Aina Project in Catalonia, the Latxa models in the Basque Country — aren’t being defensive for ideological reasons. They are responding to a measured degradation in quality. That degradation is now part of the training data of the next generation of AI search.

Dig deeper: How to use Google and LLM insights to improve international SEO

Why this happens, mechanically

Motoko Hunt has documented how AI systems collapse geographic boundaries by treating language as a proxy for markets, a phenomenon she calls geo-identification drift. The mechanism is the same here, with one extra constraint that exposes it more clearly. 

When two languages share one geography, the system can’t quietly default to “the country the language belongs to.” It’s forced to choose something else. The choice usually goes to whichever corpus is larger, more recent, or more commercially tagged.

The Walker and Timoneda study above grounded this empirically. Their finding — that anti-independence framings appeared roughly twice as often in Spanish output as in Catalan — wasn’t a finding about politics. It was a finding about how training-data composition determines output. Catalan-language texts in the training corpus carry one distribution of perspectives; Spanish-language texts carry another. The model inherits both and surfaces whichever it is currently reaching for.

This compounds with what researchers call semantic collapse: when retrieval embeddings can’t reliably separate sub-national signals, the system flattens them into the dominant variant. In monolingual countries, the dominant variant is the country itself. In a region like Catalonia, the dominant variant is the larger linguistic neighbor — Spain — pulling Catalan-specific meaning toward a generic Spanish default unless something explicit pulls back.

Sub-national governments have noticed. The Aina Project and the Latxa models aren’t isolated efforts: they are direct attempts to build language-resource sovereignty because standard global LLMs perform measurably worse on Catalan and Basque than on Spanish. When governments start training their own LLMs, the SEO industry should treat that as evidence the underlying mechanism is real and structural.

The pattern isn’t unique to Catalonia. 

  • Quebec users querying in French routinely receive Parisian-French defaults and answers anchored in French regulatory frameworks rather than Quebec’s distinct civil law and provincial tax regime. 
  • Belgian users get conflated French and Dutch jurisdictional defaults inside a country whose three regions operate under genuinely different legal and linguistic rules. 
  • Swiss users see retrieval flattened toward German or French national defaults rather than Switzerland’s own conventions. 

The Catalan case is the easiest to test from a single IP in a single session, but the structural finding generalizes to every region where two or more languages share one geography.

The leading-indicator argument

The interesting question isn’t what this means for Catalonia. It’s what Catalonia means for everyone else.

Multilingual regions are the canary. The architectural flaw exposed when two languages share one geography — a vector space that can’t reliably separate jurisdiction from meaning, sitting on top of a language-identification layer that already gets things wrong — will show up in other forms as AI search matures and attempts genuinely sub-national answers.

This is where I want to be careful with the parallel. In monolingual markets, AI search does have access to localization signals that the Catalan case partly removes: IP geolocation, GPS context, browser locale, and structured local pack data. 

A query from Austin about contractor licensing isn’t as ambiguous to the system as a query in Catalan from Barcelona, because the system has more non-linguistic context to lean on. The Catalonia–Texas parallel isn’t a direct equivalence.

It’s a hypothesis worth testing, though. The same mechanisms that flatten Catalan into Spanish — corpus-weight defaults, semantic collapse, training-data composition — are present in synthesis pipelines regardless of the language pair. 

As AI Overviews and chat-style search increasingly answer queries by synthesis rather than by surfacing localized links, the protective effect of IP-based localization weakens. The system has to make a decision about which corpus represents “the answer,” and the corpus weight tends to win.

The places this is most likely to surface inside monolingual English markets: State-level regulation with significant corpus asymmetry. California’s CCPA and Texas’s data privacy regime are written in the same language but represent different jurisdictional realities. 

The privacy literature is heavily California-weighted. When an AI Overview synthesizes a generic “what privacy rights do I have” answer, the defaults tilt toward whichever jurisdiction has more authority signals. Localization helps, but only when the query itself is jurisdictionally explicit.

Sub-national regulatory granularity in any large country. Liquor licensing, contractor licensing, real estate disclosure rules, alimony calculations, school district policies, zoning codes — jurisdiction-specific, all in English, with wildly different corpus weights between jurisdictions. As more queries are answered by synthesis rather than links, jurisdictional defaults become consequential in ways traditional SEO never had to worry about.

I don’t want to oversell this. The clean Catalan demonstration isn’t directly replicable in Texas. What is replicable is the underlying observation: when the retrieval system collapses signals, it collapses them in favor of the larger, better-represented corpus. That is true whether the signals being collapsed are linguistic, jurisdictional, or both. 

The brands that figured out how to operate across Spain and Mexico have already learned a version of this lesson. The brands operating across Texas and California will likely learn a related one, in a form that will not look identical and will require its own diagnostics.

What to do about it

The principles that work for multilingual fragmentation transfer, with adaptation, to multi-jurisdictional fragmentation. Same family of medicines, different patient.

Treat sub-national jurisdictions as distinct entities. If your business operates in regulated verticals across multiple U.S. states, those state versions need their own authority signals — not just a folder structure. Each variant should canonicalize to itself, not to a national parent page that would invite collapse.

Encode jurisdiction explicitly in structured data and copy. Schema.org’s areaServed operates at any geographic granularity; use it down to the state, county, or municipality where it matters. Pair it with explicit copy markers: regulator names, state-specific identifiers, region-specific currencies or formats. The model needs deterministic hooks. Without them, it improvises.

Reinforce sub-national grounding through Wikidata. Most SEO programs stop at on-site schema, but knowledge graphs are reading what other graphs say about you. Wikidata’s jurisdiction property (P1001) and explicit language properties let you encode jurisdictional and linguistic boundaries at the knowledge-graph level — exactly the layer where AI systems pull entity context. If you operate in a sub-national jurisdiction that matters commercially, your entity should be modeled there with the granularity that matters.

Audit for sub-national authority gaps the same way you’d audit for international ones. Run the diagnostic prompts you would run for Spain versus Mexico, but for Texas versus California, or Ontario versus Quebec inside Canada, or any pair of jurisdictions where your business operates. If the model conflates them, your content has a fragmentation problem inside what looked like a single market.

Watch the secondary signals. In Catalan, the absence of SEM bids was a signal, and the system learned from it. The same applies to underserved monolingual jurisdictions: if no one bids on Texas-specific terminology, Texas-specific content gets deprioritized in synthesis. If your knowledge-graph presence, local citations, and authority signals all point to the dominant jurisdiction, the model has no reason to surface the underrepresented one.

This isn’t a new playbook. It’s the cultural SEO framework applied below the country line: market segmentation, transcreation, retrieval constraints, and entity reinforcement, but at sub-national granularity.

What this means for your AI search strategy

The Sant Jordi answer didn’t fail because of bad translation. It failed because the language-identification layer beneath the translation has never consistently distinguished Catalan from Occitan, Catalan from Spanish, or Catalan-the-language-of-the-query from Catalan-as-irrelevant-noise. 

Google said so itself, in Catalan, three years ago. The retrieval pipeline built on top of that layer inherits every one of those decisions, and now produces synthesized answers that quietly propagate them.

Wikipedia, looking at the same generative-AI ecosystem from a different angle, decided in March 2026 that the risk of degradation was severe enough to prohibit LLM-generated content outright. The Aina Project and the Latxa team reached the same conclusion in advance by funding their own foundation models. The institutions closest to multilingual knowledge integrity are pulling away from generic AI. The SEO industry should at least notice the pattern.

Multilingual regions reveal a structural assumption baked into AI search: that language and market are the same thing, and that language is reliably knowable from a query string. Neither is true. Hreflang made the geographic distinction operational for traditional search. Nothing has yet made it operational for generative retrieval.

The brands that operate well across Spain and Mexico already know how to fix this for languages. The same techniques — explicit jurisdiction signals, market-specific authority, retrieval constraints, transcreation rather than translation, entity grounding in knowledge graphs — are now table stakes for operating well across any pair of jurisdictions, in any language combination.

If you operate across multiple jurisdictions, the question to ask isn’t whether your content is localized. It’s whether the model can tell.

❌
❌