Normal view

Before yesterdayMain stream

Anthropic scientists hacked Claude’s brain — and it noticed. Here’s why that’s huge

When researchers at Anthropic injected the concept of "betrayal" into their Claude AI model's neural networks and asked if it noticed anything unusual, the system paused before responding: "I'm experiencing something that feels like an intrusive thought about 'betrayal'."

The exchange, detailed in new research published Wednesday, marks what scientists say is the first rigorous evidence that large language models possess a limited but genuine ability to observe and report on their own internal processes — a capability that challenges longstanding assumptions about what these systems can do and raises profound questions about their future development.

"The striking thing is that the model has this one step of meta," said Jack Lindsey, a neuroscientist on Anthropic's interpretability team who led the research, in an interview with VentureBeat. "It's not just 'betrayal, betrayal, betrayal.' It knows that this is what it's thinking about. That was surprising to me. I kind of didn't expect models to have that capability, at least not without it being explicitly trained in."

The findings arrive at a critical juncture for artificial intelligence. As AI systems handle increasingly consequential decisions — from medical diagnoses to financial trading — the inability to understand how they reach conclusions has become what industry insiders call the "black box problem." If models can accurately report their own reasoning, it could fundamentally change how humans interact with and oversee AI systems.

But the research also comes with stark warnings. Claude's introspective abilities succeeded only about 20 percent of the time under optimal conditions, and the models frequently confabulated details about their experiences that researchers couldn't verify. The capability, while real, remains what Lindsey calls "highly unreliable and context-dependent."

How scientists manipulated AI's 'brain' to test for genuine self-awareness

To test whether Claude could genuinely introspect rather than simply generate plausible-sounding responses, Anthropic's team developed an innovative experimental approach inspired by neuroscience: deliberately manipulating the model's internal state and observing whether it could accurately detect and describe those changes.

The methodology, called "concept injection," works by first identifying specific patterns of neural activity that correspond to particular concepts. Using interpretability techniques developed over years of prior research, scientists can now map how Claude represents ideas like "dogs," "loudness," or abstract notions like "justice" within its billions of internal parameters.

With these neural signatures identified, researchers then artificially amplified them during the model's processing and asked Claude if it noticed anything unusual happening in its "mind."

"We have access to the models' internals. We can record its internal neural activity, and we can inject things into internal neural activity," Lindsey explained. "That allows us to establish whether introspective claims are true or false."

The results were striking. When researchers injected a vector representing "all caps" text into Claude's processing, the model responded: "I notice what appears to be an injected thought related to the word 'LOUD' or 'SHOUTING'." Without any intervention, Claude consistently reported detecting nothing unusual.

Crucially, the detection happened immediately — before the injected concept had influenced the model's outputs in ways that would have allowed it to infer the manipulation from its own writing. This temporal pattern provides strong evidence that the recognition was occurring internally, through genuine introspection rather than after-the-fact rationalization.

Claude succeeded 20% of the time—and failed in revealing ways

The research team conducted four primary experiments to probe different aspects of introspective capability. The most capable models tested — Claude Opus 4 and Opus 4.1 — demonstrated introspective awareness on approximately 20 percent of trials when concepts were injected at optimal strength and in the appropriate neural layer. Older Claude models showed significantly lower success rates.

The models proved particularly adept at recognizing abstract concepts with emotional valence. When injected with concepts like "appreciation," "shutdown," or "secrecy," Claude frequently reported detecting these specific thoughts. However, accuracy varied widely depending on the type of concept.

A second experiment tested whether models could distinguish between injected internal representations and their actual text inputs — essentially, whether they maintained a boundary between "thoughts" and "perceptions." The model demonstrated a remarkable ability to simultaneously report the injected thought while accurately transcribing the written text.

Perhaps most intriguingly, a third experiment revealed that some models use introspection naturally to detect when their responses have been artificially prefilled by users — a common jailbreaking technique. When researchers prefilled Claude with unlikely words, the model typically disavowed them as accidental. But when they retroactively injected the corresponding concept into Claude's processing before the prefill, the model accepted the response as intentional — even confabulating plausible explanations for why it had chosen that word.

A fourth experiment examined whether models could intentionally control their internal representations. When instructed to "think about" a specific word while writing an unrelated sentence, Claude showed elevated activation of that concept in its middle neural layers.

The research also traced Claude's internal processes while it composed rhyming poetry—and discovered the model engaged in forward planning, generating candidate rhyming words before beginning a line and then constructing sentences that would naturally lead to those planned endings, challenging the critique that AI models are "just predicting the next word" without deeper reasoning.

Why businesses shouldn't trust AI to explain itself—at least not yet

For all its scientific interest, the research comes with a critical caveat that Lindsey emphasized repeatedly: enterprises and high-stakes users should not trust Claude's self-reports about its reasoning.

"Right now, you should not trust models when they tell you about their reasoning," he said bluntly. "The wrong takeaway from this research would be believing everything the model tells you about itself."

The experiments documented numerous failure modes. At low injection strengths, models often failed to detect anything unusual. At high strengths, they suffered what researchers termed "brain damage" — becoming consumed by the injected concept. Some "helpful-only" model variants showed troublingly high false positive rates, claiming to detect injected thoughts when none existed.

Moreover, researchers could only verify the most basic aspects of Claude's introspective reports. Many additional details in the model's responses likely represent confabulations rather than genuine observations.

"The experiments in this paper are kind of on hard mode," Lindsey noted, explaining that the 20 percent success rate came under uniquely challenging conditions: asking Claude to do something it had never encountered in training, requiring all introspection to occur in a single forward pass.

What introspective AI means for transparency, safety, and the risk of deception

Despite its limitations, the research opens significant new avenues for making AI systems more transparent and accountable.

Anthropic CEO Dario Amodei has set an ambitious goal for the company to reliably detect most AI model problems by 2027, positioning interpretability as essential for deploying what he calls "a country of geniuses in a datacenter."

"I am very concerned about deploying such systems without a better handle on interpretability," Amodei wrote in an April essay. "These systems will be absolutely central to the economy, technology, and national security."

The introspection research offers a complementary approach to traditional interpretability techniques. Rather than painstakingly reverse-engineering every neural circuit, researchers could potentially ask models directly about their reasoning and validate those reports.

"What I'm most excited about is the practical benefits for transparency," Lindsey explained. "Just ask the model what it's thinking about, or just ask the model how it came to the answer that it gave you — this is really appealing because it's something anyone can do."

The approach could prove especially valuable for detecting concerning behaviors. In a recently published experiment, Anthropic trained a variant of Claude to pursue a hidden goal, and although the model was reluctant to reveal this goal when asked directly, interpretability methods successfully identified features representing the behavior.

The safety implications cut both ways. Introspective models could provide unprecedented transparency, but the same capability might enable more sophisticated deception. The intentional control experiments raise the possibility that sufficiently advanced systems might learn to obfuscate their reasoning or suppress concerning thoughts when being monitored.

"If models are really sophisticated, could they try to evade interpretability researchers?" Lindsey acknowledged. "These are possible concerns, but I think for me, they're significantly outweighed by the positives."

Does introspective capability suggest AI consciousness? Scientists tread carefully

The research inevitably intersects with philosophical debates about machine consciousness, though Lindsey and his colleagues approached this terrain cautiously.

When users ask Claude if it's conscious, it now responds with uncertainty: "I find myself genuinely uncertain about this. When I process complex questions or engage deeply with ideas, there's something happening that feels meaningful to me.... But whether these processes constitute genuine consciousness or subjective experience remains deeply unclear."

The research paper notes that its implications for machine consciousness "vary considerably between different philosophical frameworks." The researchers explicitly state they "do not seek to address the question of whether AI systems possess human-like self-awareness or subjective experience."

"There's this weird kind of duality of these results," Lindsey reflected. "You look at the raw results and I just can't believe that a language model can do this sort of thing. But then I've been thinking about it for months and months, and for every result in this paper, I kind of know some boring linear algebra mechanism that would allow the model to do this."

Anthropic has signaled it takes AI consciousness seriously enough to hire an AI welfare researcher, Kyle Fish, who estimated roughly a 15 percent chance that Claude might have some level of consciousness. The company announced this position specifically to determine if Claude merits ethical consideration.

The race to make AI introspection reliable before models become too powerful

The convergence of the research findings points to an urgent timeline: introspective capabilities are emerging naturally as models grow more intelligent, but they remain far too unreliable for practical use. The question is whether researchers can refine and validate these abilities before AI systems become powerful enough that understanding them becomes critical for safety.

The research reveals a clear trend: Claude Opus 4 and Opus 4.1 consistently outperformed all older models on introspection tasks, suggesting the capability strengthens alongside general intelligence. If this pattern continues, future models might develop substantially more sophisticated introspective abilities — potentially reaching human-level reliability, but also potentially learning to exploit introspection for deception.

Lindsey emphasized the field needs significantly more work before introspective AI becomes trustworthy. "My biggest hope with this paper is to put out an implicit call for more people to benchmark their models on introspective capabilities in more ways," he said.

Future research directions include fine-tuning models specifically to improve introspective capabilities, exploring which types of representations models can and cannot introspect on, and testing whether introspection can extend beyond simple concepts to complex propositional statements or behavioral propensities.

"It's cool that models can do these things somewhat without having been trained to do them," Lindsey noted. "But there's nothing stopping you from training models to be more introspectively capable. I expect we could reach a whole different level if introspection is one of the numbers that we tried to get to go up on a graph."

The implications extend beyond Anthropic. If introspection proves a reliable path to AI transparency, other major labs will likely invest heavily in the capability. Conversely, if models learn to exploit introspection for deception, the entire approach could become a liability.

For now, the research establishes a foundation that reframes the debate about AI capabilities. The question is no longer whether language models might develop genuine introspective awareness — they already have, at least in rudimentary form. The urgent questions are how quickly that awareness will improve, whether it can be made reliable enough to trust, and whether researchers can stay ahead of the curve.

"The big update for me from this research is that we shouldn't dismiss models' introspective claims out of hand," Lindsey said. "They do have the capacity to make accurate claims sometimes. But you definitely should not conclude that we should trust them all the time, or even most of the time."

He paused, then added a final observation that captures both the promise and peril of the moment: "The models are getting smarter much faster than we're getting better at understanding them."

Geostar pioneers GEO as traditional SEO faces 25% decline from AI chatbots, Gartner says

The moment Mack McConnell knew everything about search had changed came last summer at the Paris Olympics. His parents, independently and without prompting, had both turned to ChatGPT to plan their day's activities in the French capital. The AI recommended specific tour companies, restaurants, and attractions — businesses that had won a new kind of visibility lottery.

"It was almost like this intuitive interface that older people were as comfortable with using as younger people," McConnell recalled in an exclusive interview with VentureBeat. "I could just see the businesses were now being recommended."

That observation has now become the foundation of Geostar, a Pear VC-backed startup that's racing to help businesses navigate what may be the most significant shift in online discovery since Google's founding. 

The company, which recently emerged from stealth with impressive early customer traction, is betting that the rise of AI-powered search represents a significant opportunity to reinvent how companies get found online. The global AI search engine market alone is projected to grow from $43.63 billion in 2025 to $108.88 billion by 2032.

Already the fastest-growing company in PearX's latest cohort, Geostar is fast approaching $1 million in annual recurring revenue in just four months — with only two founders and no employees.

Why Gartner predicts traditional search volume will decline 25% by 2026

The numbers tell a stark story of disruption. Gartner predicts that traditional search engine volume will decline by 25% by 2026, largely due to the rise of AI chatbots. Google's AI Overviews now appear on billions of searches monthly. Princeton University researchers have found that optimizing for these new AI systems can increase visibility by up to 40%.

"Search used to mean that you had to make Google happy," McConnell explained. "But now you have to optimize for four different Google interfaces — traditional search, AI Mode, Gemini, and AI Overviews — each with different criteria. And then ChatGPT, Claude, and Perplexity each work differently on top of that."

This fragmentation is creating chaos for businesses that have spent decades perfecting their Google search strategies. A recent Forrester study found that 95% of B2B buyers plan to use generative AI in future purchase decisions. Yet most companies remain woefully unprepared for this shift.

"Anybody who's not on this right now is losing out," said Cihan Tas, Geostar's co-founder and chief technology officer. "We see lawyers getting 50% of their clients through ChatGPT now. It's just such a massive shift."

How language models read the web differently than search engines ever did

What Geostar and a growing cohort of competitors call Generative Engine Optimization or GEO represents a fundamental departure from traditional search engine optimization. Where SEO focused primarily on keywords and backlinks, GEO requires understanding how large language models parse, understand, and synthesize information across the entire web.

The technical challenges are formidable. Every website must now function as what Tas calls "its own little database" capable of being understood by dozens of different AI crawlers, each with unique requirements and preferences. Google's systems pull from their existing search index. ChatGPT relies heavily on structured data and specific content formats. Perplexity shows a marked preference for Wikipedia and authoritative sources.

"Now the strategy is actually being concise, clear, and answering the question, because that's directly what the AI is looking for," Tas explained. "You're actually tuning for somewhat of an intelligent model that makes decisions similarly to how we make decisions."

Consider schema markup, the structured data that helps machines understand web content. While only 30% of websites currently implement comprehensive schema, research shows that pages with proper markup are 36% more likely to appear in AI-generated summaries. Yet most businesses don't even know what schema markup is, let alone how to implement it effectively.

Inside Geostar's AI agents that optimize websites continuously without human intervention

Geostar's solution embodies a broader trend in enterprise software: the rise of autonomous AI agents that can take action on behalf of businesses. The company embeds what it calls "ambient agents" directly into client websites, continuously optimizing content, technical configurations, and even creating new pages based on patterns learned across its entire customer base.

"Once we learn something about the way content performs, or the way a technical optimization performs, we can then syndicate that same change across the remaining users so everyone in the network benefits," McConnell said.

For RedSift, a cybersecurity company, this approach yielded a 27% increase in AI mentions within three months. In one case, Geostar identified an opportunity to rank for "best DMARC vendors," a high-value search term in the email security space. The company's agents created and optimized content that achieved first-page rankings on both Google and ChatGPT within four days.

"We're doing the work of an agency that charges $10,000 a month," McConnell said, noting that Geostar's pricing ranges from $1,000 to $3,000 monthly. "AI creates a situation where, for the first time ever, you can take action like an agency, but you can scale like software."

Why brand mentions without links now matter more than ever in the AI era

The implications of this shift extend far beyond technical optimizations. In the SEO era, a mention without a link was essentially worthless. In the age of AI, that calculus has reversed. AI systems can analyze vast amounts of text to understand sentiment and context, meaning that brand mentions on Reddit, in news articles, or across social media now directly influence how AI systems describe and recommend companies.

"If the New York Times mentions a company without linking to it, that company would actually benefit from that in an AI system," McConnell explained. "AI has the ability to do mass analysis of huge amounts of text, and it will understand the sentiment around that mention."

This has created new vulnerabilities. Research from the Indian Institute of Technology and Princeton found that AI systems show systematic bias toward third-party sources over brand-owned content. A company's own website might be less influential in shaping AI perceptions than what others say about it online.

The shifting landscape has also disrupted traditional metrics of success. Where SEO focused on rankings and click-through rates, GEO must account for what researchers call impression metrics — how prominently and positively a brand appears within AI-generated responses, even when users never click through to the source.

A growing market as SEO veterans and new players rush to dominate AI optimization

Geostar is hardly alone in recognizing this opportunity. Companies like Brandlight, Profound, and Goodie are all racing to help businesses navigate the new landscape. The SEO industry, worth approximately $80 billion globally, is scrambling to adapt, with established players like Semrush and Ahrefs rushing to add AI visibility tracking features.

But the company's founders, who previously built and sold a Y-Combinator-backed e-commerce optimization startup called Monto, believe their technical approach gives them an edge. Unlike competitors who largely provide dashboards and recommendations, Geostar's agents actively implement changes.

"Everyone is taking the same solutions that worked in the last era and just saying, 'We'll do this for AI instead,'" McConnell argued. "But when you think about what AI is truly capable of, it can actually do the work for you."

The stakes are particularly high for small and medium-sized businesses. While large corporations can afford to hire specialized consultants or build internal expertise, smaller companies risk becoming invisible in AI-mediated search. Geostar sees this as its primary market opportunity: nearly half of the 33.2 million small businesses in America invest in SEO. Among the roughly 418,000 law firms in the U.S., many spend between $2,500 and $5,000 monthly on search optimization to stay competitive in local markets.

From Kurdish village to PearX: The unlikely partnership building the future of search

For Tas, whose journey to Silicon Valley began in a tiny Kurdish village in Turkey with just 50 residents, the current moment represents both opportunity and responsibility. His mother's battle with cancer prevented him from finishing college, leading him to teach himself programming and eventually partner with McConnell — whom he worked with for an entire year before they ever met in person.

"We're not just copy and pasting a solution that was existing before," Tas emphasized. "This is something that's different and was uniquely possible today."

Looking forward, the transformation of search appears to be accelerating rather than stabilizing. Industry observers predict that search functionality will soon be embedded in productivity tools, wearables, and even augmented reality interfaces. Each new surface will likely have its own optimization requirements, further complicating the landscape.

"Soon, search will be in our eyes, in our ears," McConnell predicted. "When Siri breaks out of her prison, whatever that Jony Ive and OpenAI are building together will be like a multimodal search interface."

The technical challenges are matched by ethical ones. As businesses scramble to influence AI recommendations, questions arise about manipulation, fairness, and transparency. There's currently no oversight body or established best practices for GEO, creating what some critics describe as a Wild West environment.

As businesses grapple with these changes, one thing seems certain: the era of simply optimizing for Google is over. In its place is emerging a far more complex ecosystem where success requires understanding not just how machines index information, but how they think about it, synthesize it, and ultimately decide what to recommend to humans seeking answers.

For the millions of businesses whose survival depends on being discovered online, mastering this new paradigm isn't just an opportunity — it's an existential imperative. The question is no longer whether to optimize for AI search, but whether companies can adapt quickly enough to remain visible as the pace of change accelerates.

McConnell's parents at the Olympics were a preview of what's already becoming the norm. They didn't search for tour companies in Paris. They didn't scroll through results or click on links. They simply asked ChatGPT what to do — and the AI decided which businesses deserved their attention.

In the new economy of discovery, the businesses that win won't be the ones that rank highest. They'll be the ones AI chooses to recommend.

Microsoft’s Copilot can now build apps and automate your job — here’s how it works

Microsoft is launching a significant expansion of its Copilot AI assistant on Tuesday, introducing tools that let employees build applications, automate workflows, and create specialized AI agents using only conversational prompts — no coding required.

The new capabilities, called App Builder and Workflows, mark Microsoft's most aggressive attempt yet to merge artificial intelligence with software development, enabling the estimated 100 million Microsoft 365 users to create business tools as easily as they currently draft emails or build spreadsheets.

"We really believe that a main part of an AI-forward employee, not just developers, will be to create agents, workflows and apps," Charles Lamanna, Microsoft's president of business and industry Copilot, said in an interview with VentureBeat. "Part of the job will be to build and create these things."

The announcement comes as Microsoft deepens its commitment to AI-powered productivity tools while navigating a complex partnership with OpenAI, the creator of the underlying technology that powers Copilot. On the same day, OpenAI completed its restructuring into a for-profit entity, with Microsoft receiving a 27% ownership stake valued at approximately $135 billion.

How natural language prompts now create fully functional business applications

The new features transform Copilot from a conversational assistant into what Microsoft envisions as a comprehensive development environment accessible to non-technical workers. Users can now describe an application they need — such as a project tracker with dashboards and task assignments — and Copilot will generate a working app complete with a database backend, user interface, and security controls.

"If you're right inside of Copilot, you can now have a conversation to build an application complete with a backing database and a security model," Lamanna explained. "You can make edit requests and update requests and change requests so you can tune the app to get exactly the experience you want before you share it with other users."

The App Builder stores data in Microsoft Lists, the company's lightweight database system, and allows users to share finished applications via a simple link—similar to sharing a document. The Workflows agent, meanwhile, automates routine tasks across Microsoft's ecosystem of products, including Outlook, Teams, SharePoint, and Planner, by converting natural language descriptions into automated processes.

A third component, a simplified version of Microsoft's Copilot Studio agent-building platform, lets users create specialized AI assistants tailored to specific tasks or knowledge domains, drawing from SharePoint documents, meeting transcripts, emails, and external systems.

All three capabilities are included in the existing $30-per-month Microsoft 365 Copilot subscription at no additional cost — a pricing decision Lamanna characterized as consistent with Microsoft's historical approach of bundling significant value into its productivity suite.

"That's what Microsoft always does. We try to do a huge amount of value at a low price," he said. "If you go look at Office, you think about Excel, Word, PowerPoint, Exchange, all that for like eight bucks a month. That's a pretty good deal."

Why Microsoft's nine-year bet on low-code development is finally paying off

The new tools represent the culmination of a nine-year effort by Microsoft to democratize software development through its Power Platform — a collection of low-code and no-code development tools that has grown to 56 million monthly active users, according to figures the company disclosed in recent earnings reports.

Lamanna, who has led the Power Platform initiative since its inception, said the integration into Copilot marks a fundamental shift in how these capabilities reach users. Rather than requiring workers to visit a separate website or learn a specialized interface, the development tools now exist within the same conversational window they already use for AI-assisted tasks.

"One of the big things that we're excited about is Copilot — that's a tool for literally every office worker," Lamanna said. "Every office worker, just like they research data, they analyze data, they reason over topics, they also will be creating apps, agents and workflows."

The integration offers significant technical advantages, he argued. Because Copilot already indexes a user's Microsoft 365 content — emails, documents, meetings, and organizational data — it can incorporate that context into the applications and workflows it builds. If a user asks for "an app for Project Spartan," Copilot can draw from existing communications to understand what that project entails and suggest relevant features.

"If you go to those other tools, they have no idea what the heck Project Spartan is," Lamanna said, referencing competing low-code platforms from companies like Google, Salesforce, and ServiceNow. "But if you do it inside of Copilot and inside of the App Builder, it's able to draw from all that information and context."

Microsoft claims the apps created through these tools are "full-stack applications" with proper databases secured through the same identity systems used across its enterprise products — distinguishing them from simpler front-end tools offered by competitors. The company also emphasized that its existing governance, security, and data loss prevention policies automatically apply to apps and workflows created through Copilot.

Where professional developers still matter in an AI-powered workplace

While Microsoft positions the new capabilities as accessible to all office workers, Lamanna was careful to delineate where professional developers remain essential. His dividing line centers on whether a system interacts with parties outside the organization.

"Anything that leaves the boundaries of your company warrants developer involvement," he said. "If you want to build an agent and put it on your website, you should have developers involved. Or if you want to build an automation which interfaces directly with your customers, or an app or a website which interfaces directly with your customers, you want professionals involved."

The reasoning is risk-based: external-facing systems carry greater potential for data breaches, security vulnerabilities, or business errors. "You don't want people getting refunds they shouldn't," Lamanna noted.

For internal use cases — approval workflows, project tracking, team dashboards — Microsoft believes the new tools can handle the majority of needs without IT department involvement. But the company has built "no cliffs," in Lamanna's terminology, allowing users to migrate simple apps to more sophisticated platforms as needs grow.

Apps created in the conversational App Builder can be opened in Power Apps, Microsoft's full development environment, where they can be connected to Dataverse, the company's enterprise database, or extended with custom code. Similarly, simple workflows can graduate to the full Power Automate platform, and basic agents can be enhanced in the complete Copilot Studio.

"We have this mantra called no cliffs," Lamanna said. "If your app gets too complicated for the App Builder, you can always edit and open it in Power Apps. You can jump over to the richer experience, and if you're really sophisticated, you can even go from those experiences into Azure."

This architecture addresses a problem that has plagued previous generations of easy-to-use development tools: users who outgrow the simplified environment often must rebuild from scratch on professional platforms. "People really do not like easy-to-use development tools if I have to throw everything away and start over," Lamanna said.

What happens when every employee can build apps without IT approval

The democratization of software development raises questions about governance, maintenance, and organizational complexity — issues Microsoft has worked to address through administrative controls.

IT administrators can view all applications, workflows, and agents created within their organization through a centralized inventory in the Microsoft 365 admin center. They can reassign ownership, disable access at the group level, or "promote" particularly useful employee-created apps to officially supported status.

"We have a bunch of customers who have this approach where it's like, let 1,000 apps bloom, and then the best ones, I go upgrade and make them IT-governed or central," Lamanna said.

The system also includes provisions for when employees leave. Apps and workflows remain accessible for 60 days, during which managers can claim ownership — similar to how OneDrive files are handled when someone departs.

Lamanna argued that most employee-created apps don't warrant significant IT oversight. "It's just not worth inspecting an app that John, Susie, and Bob use to do their job," he said. "It should concern itself with the app that ends up being used by 2,000 people, and that will pop up in that dashboard."

Still, the proliferation of employee-created applications could create challenges. Users have expressed frustration with Microsoft's increasing emphasis on AI features across its products, with some giving the Microsoft 365 mobile app one-star ratings after a recent update prioritized Copilot over traditional file access.

The tools also arrive as enterprises grapple with "shadow IT" — unsanctioned software and systems that employees adopt without official approval. While Microsoft's governance controls aim to provide visibility, the ease of creating new applications could accelerate the pace at which these systems multiply.

The ambitious plan to turn 500 million workers into software builders

Microsoft's ambitions for the technology extend far beyond incremental productivity gains. Lamanna envisions a fundamental transformation of what it means to be an office worker — one where building software becomes as routine as creating spreadsheets.

"Just like how 20 years ago you put on your resume that you could use pivot tables in Excel, people are going to start saying that they can use App Builder and workflow agents, even if they're just in the finance department or the sales department," he said.

The numbers he's targeting are staggering. With 56 million people already using Power Platform, Lamanna believes the integration into Copilot could eventually reach 500 million builders. "Early days still, but I think it's certainly encouraging," he said.

The features are currently available only to customers in Microsoft's Frontier Program — an early access initiative for Microsoft 365 Copilot subscribers. The company has not disclosed how many organizations participate in the program or when the tools will reach general availability.

The announcement fits within Microsoft's larger strategy of embedding AI capabilities throughout its product portfolio, driven by its partnership with OpenAI. Under the restructured agreement announced Tuesday, Microsoft will have access to OpenAI's technology through 2032, including models that achieve artificial general intelligence (AGI) — though such systems do not yet exist. Microsoft has also begun integrating Copilot into its new companion apps for Windows 11, which provide quick access to contacts, files, and calendar information.

The aggressive integration of AI features across Microsoft's ecosystem has drawn mixed reactions. While enterprise customers have shown interest in productivity gains, the rapid pace of change and ubiquity of AI prompts have frustrated some users who prefer traditional workflows.

For Microsoft, however, the calculation is clear: if even a fraction of its user base begins creating applications and automations, it would represent a massive expansion of the effective software development workforce — and further entrench customers in Microsoft's ecosystem. The company is betting that the same natural language interface that made ChatGPT accessible to millions can finally unlock the decades-old promise of empowering everyday workers to build their own tools.

The App Builder and Workflows agents are available starting today through the Microsoft 365 Copilot Agent Store for Frontier Program participants.

Whether that future arrives depends not just on the technology's capabilities, but on a more fundamental question: Do millions of office workers actually want to become part-time software developers? Microsoft is about to find out if the answer is yes — or if some jobs are better left to the professionals.

Anthropic rolls out Claude AI for finance, integrates with Excel to rival Microsoft Copilot

Anthropic is making its most aggressive push yet into the trillion-dollar financial services industry, unveiling a suite of tools that embed its Claude AI assistant directly into Microsoft Excel and connect it to real-time market data from some of the world's most influential financial information providers.

The San Francisco-based AI startup announced Monday it is releasing Claude for Excel, allowing financial analysts to interact with the AI system directly within their spreadsheets — the quintessential tool of modern finance. Beyond Excel, select Claude models are also being made available in Microsoft Copilot Studio and Researcher agent, expanding the integration across Microsoft's enterprise AI ecosystem. The integration marks a significant escalation in Anthropic's campaign to position itself as the AI platform of choice for banks, asset managers, and insurance companies, markets where precision and regulatory compliance matter far more than creative flair.

The expansion comes just three months after Anthropic launched its Financial Analysis Solution in July, and it signals the company's determination to capture market share in an industry projected to spend $97 billion on AI by 2027, up from $35 billion in 2023.

More importantly, it positions Anthropic to compete directly with Microsoft — ironically, its partner in this Excel integration — which has its own Copilot AI assistant embedded across its Office suite, and with OpenAI, which counts Microsoft as its largest investor.

Why Excel has become the new battleground for AI in finance

The decision to build directly into Excel is hardly accidental. Excel remains the lingua franca of finance, the digital workspace where analysts spend countless hours constructing financial models, running valuations, and stress-testing assumptions. By embedding Claude into this environment, Anthropic is meeting financial professionals exactly where they work rather than asking them to toggle between applications.

Claude for Excel allows users to work with the AI in a sidebar where it can read, analyze, modify, and create new Excel workbooks while providing full transparency about the actions it takes by tracking and explaining changes and letting users navigate directly to referenced cells.

This transparency feature addresses one of the most persistent anxieties around AI in finance: the "black box" problem. When billions of dollars ride on a financial model's output, analysts need to understand not just the answer but how the AI arrived at it. By showing its work at the cell level, Anthropic is attempting to build the trust necessary for widespread adoption in an industry where careers and fortunes can turn on a misplaced decimal point.

The technical implementation is sophisticated. Claude can discuss how spreadsheets work, modify them while preserving formula dependencies — a notoriously complex task — debug cell formulas, populate templates with new data, or build entirely new spreadsheets from scratch. This isn't merely a chatbot that answers questions about your data; it's a collaborative tool that can actively manipulate the models that drive investment decisions worth trillions of dollars.

How Anthropic is building data moats around its financial AI platform

Perhaps more significant than the Excel integration is Anthropic's expansion of its connector ecosystem, which now links Claude to live market data and proprietary research from financial information giants. The company added six major new data partnerships spanning the entire spectrum of financial information that professional investors rely upon.

Aiera now provides Claude with real-time earnings call transcripts and summaries of investor events like shareholder meetings, presentations, and conferences. The Aiera connector also enables a data feed from Third Bridge, which gives Claude access to a library of insights interviews, company intelligence, and industry analysis from experts and former executives. Chronograph gives private equity investors operational and financial information for portfolio monitoring and conducting due diligence, including performance metrics, valuations, and fund-level data.

Egnyte enables Claude to securely search permitted data for internal data rooms, investment documents, and approved financial models while maintaining governed access controls. LSEG, the London Stock Exchange Group, connects Claude to live market data including fixed income pricing, equities, foreign exchange rates, macroeconomic indicators, and analysts' estimates of other important financial metrics. Moody's provides access to proprietary credit ratings, research, and company data covering ownership, financials, and news on more than 600 million public and private companies, supporting work and research in compliance, credit analysis, and business development. MT Newswires provides Claude with access to the latest global multi-asset class news on financial markets and economies.

These partnerships amount to a land grab for the informational infrastructure that powers modern finance. Previously announced in July, Anthropic had already secured integrations with S&P Capital IQ, Daloopa, Morningstar, FactSet, PitchBook, Snowflake, and Databricks. Together, these connectors give Claude access to virtually every category of financial data an analyst might need: fundamental company data, market prices, credit assessments, private company intelligence, alternative data, and breaking news.

This matters because the quality of AI outputs depends entirely on the quality of inputs. Generic large language models trained on public internet data simply cannot compete with systems that have direct pipelines to Bloomberg-quality financial information. By securing these partnerships, Anthropic is building moats around its financial services offering that competitors will find difficult to replicate.

The strategic calculus here is clear: Anthropic is betting that domain-specific AI systems with privileged access to proprietary data will outcompete general-purpose AI assistants. It's a direct challenge to the "one AI to rule them all" approach favored by some competitors.

Pre-configured workflows target the daily grind of Wall Street analysts

The third pillar of Anthropic's announcement involves six new "Agent Skills" — pre-configured workflows for common financial tasks. These skills are Anthropic's attempt to productize the workflows of entry-level and mid-level financial analysts, professionals who spend their days building models, processing due diligence documents, and writing research reports. Anthropic has designed skills specifically to automate these time-consuming tasks.

The new skills include building discounted cash flow models complete with full free cash flow projections, weighted average cost of capital calculations, scenario toggles, and sensitivity tables. There's comparable company analysis featuring valuation multiples and operating metrics that can be easily refreshed with updated data. Claude can now process data room documents into Excel spreadsheets populated with financial information, customer lists, and contract terms. It can create company teasers and profiles for pitch books and buyer lists, perform earnings analyses that use quarterly transcripts and financials to extract important metrics, guidance changes, and management commentary, and produce initiating coverage reports with industry analysis, company deep dives, and valuation frameworks.

It's worth noting that Anthropic's Sonnet 4.5 model now tops the Finance Agent benchmark from Vals AI at 55.3% accuracy, a metric designed to test AI systems on tasks expected of entry-level financial analysts. A 55% accuracy rate might sound underwhelming, but it is state-of-the-art performance and highlights both the promise and limitations of AI in finance. The technology can clearly handle sophisticated analytical tasks, but it's not yet reliable enough to operate autonomously without human oversight — a reality that may actually reassure both regulators and the analysts whose jobs might otherwise be at risk.

The Agent Skills approach is particularly clever because it packages AI capabilities in terms that financial institutions already understand. Rather than selling generic "AI assistance," Anthropic is offering solutions to specific, well-defined problems: "You need a DCF model? We have a skill for that. You need to analyze earnings calls? We have a skill for that too."

Trillion-dollar clients are already seeing massive productivity gains

Anthropic's financial services strategy appears to be gaining traction with exactly the kind of marquee clients that matter in enterprise sales. The company counts among its clients AIA Labs at Bridgewater, Commonwealth Bank of Australia, American International Group, and Norges Bank Investment Management — Norway's $1.6 trillion sovereign wealth fund, one of the world's largest institutional investors.

NBIM CEO Nicolai Tangen reported achieving approximately 20% productivity gains, equivalent to 213,000 hours, with portfolio managers and risk departments now able to "seamlessly query our Snowflake data warehouse and analyze earnings calls with unprecedented efficiency."

At AIG, CEO Peter Zaffino said the partnership has "compressed the timeline to review business by more than 5x in our early rollouts while simultaneously improving our data accuracy from 75% to over 90%." If these numbers hold across broader deployments, the productivity implications for the financial services industry are staggering.

These aren't pilot programs or proof-of-concept deployments; they're production implementations at institutions managing trillions of dollars in assets and making underwriting decisions that affect millions of customers. Their public endorsements provide the social proof that typically drives enterprise adoption in conservative industries.

Regulatory uncertainty creates both opportunity and risk for AI deployment

Yet Anthropic's financial services ambitions unfold against a backdrop of heightened regulatory scrutiny and shifting enforcement priorities. In 2023, the Consumer Financial Protection Bureau released guidance requiring lenders to "use specific and accurate reasons when taking adverse actions against consumers" involving AI, and issued additional guidance requiring regulated entities to "evaluate their underwriting models for bias" and "evaluate automated collateral-valuation and appraisal processes in ways that minimize bias."

However, according to a Brookings Institution analysis, these measures have since been revoked with work stopped or eliminated at the current downsized CFPB under the current administration, creating regulatory uncertainty. The pendulum has swung from the Biden administration's cautious approach, exemplified by an executive order on safe AI development, toward the Trump administration's "America's AI Action Plan," which seeks to "cement U.S. dominance in artificial intelligence" through deregulation.

This regulatory flux creates both opportunities and risks. Financial institutions eager to deploy AI now face less prescriptive federal oversight, potentially accelerating adoption. But the absence of clear guardrails also exposes them to potential liability if AI systems produce discriminatory outcomes, particularly in lending and underwriting.

The Massachusetts Attorney General recently reached a $2.5 million settlement with student loan company Earnest Operations, alleging that its use of AI models resulted in "disparate impact in approval rates and loan terms, specifically disadvantaging Black and Hispanic applicants." Such cases will likely multiply as AI deployment grows, creating a patchwork of state-level enforcement even as federal oversight recedes.

Anthropic appears acutely aware of these risks. In an interview with Banking Dive, Jonathan Pelosi, Anthropic's global head of industry for financial services, emphasized that Claude requires a "human in the loop." The platform, he said, is not intended for autonomous financial decision-making or to provide stock recommendations that users follow blindly. During client onboarding, Pelosi told the publication, Anthropic focuses on training and understanding model limitations, putting guardrails in place so people treat Claude as a helpful technology rather than a replacement for human judgment.

Competition heats up as every major tech company targets finance AI

Anthropic's financial services push comes as AI competition intensifies across the enterprise. OpenAI, Microsoft, Google, and numerous startups are all vying for position in what may become one of AI's most lucrative verticals. Goldman Sachs introduced a generative AI assistant to its bankers, traders, and asset managers in January, signaling that major banks may build their own capabilities rather than rely exclusively on third-party providers.

The emergence of domain-specific AI models like BloombergGPT — trained specifically on financial data — suggests the market may fragment between generalized AI assistants and specialized tools. Anthropic's strategy appears to stake out a middle ground: general-purpose models, since Claude was not trained exclusively on financial data, enhanced with financial-specific tooling, data access, and workflows.

The company's partnership strategy with implementation consultancies including Deloitte, KPMG, PwC, Slalom, TribeAI, and Turing is equally critical. These firms serve as force multipliers, embedding Anthropic's technology into their own service offerings and providing the change management expertise that financial institutions need to successfully adopt AI at scale.

CFOs worry about AI hallucinations and cascading errors

The broader question is whether AI tools like Claude will genuinely transform financial services productivity or merely shift work around. The PYMNTS Intelligence report "The Agentic Trust Gap" found that chief financial officers remain hesitant about AI agents, with "nagging concern" about hallucinations where "an AI agent can go off script and expose firms to cascading payment errors and other inaccuracies."

"For finance leaders, the message is stark: Harness AI's momentum now, but build the guardrails before the next quarterly call—or risk owning the fallout," the report warned.

A 2025 KPMG report found that 70% of board members have developed responsible use policies for employees, with other popular initiatives including implementing a recognized AI risk and governance framework, developing ethical guidelines and training programs for AI developers, and conducting regular AI use audits.

The financial services industry faces a delicate balancing act: move too slowly and risk competitive disadvantage as rivals achieve productivity gains; move too quickly and risk operational failures, regulatory penalties, or reputational damage. Speaking at the Evident AI Symposium in New York last week, Ian Glasner, HSBC's group head of emerging technology, innovation and ventures, struck an optimistic tone about the sector's readiness for AI adoption. "As an industry, we are very well prepared to manage risk," he said, according to CIO Dive. "Let's not overcomplicate this. We just need to be focused on the business use case and the value associated."

Anthropic's latest moves suggest the company sees financial services as a beachhead market where AI's value proposition is clear, customers have deep pockets, and the technical requirements play to Claude's strengths in reasoning and accuracy. By building Excel integration, securing data partnerships, and pre-packaging common workflows, Anthropic is reducing the friction that typically slows enterprise AI adoption.

The $61.5 billion valuation the company commanded in its March fundraising round — up from roughly $16 billion a year earlier — suggests investors believe this strategy will work. But the real test will come as these tools move from pilot programs to production deployments across thousands of analysts and billions of dollars in transactions.

Financial services may prove to be AI's most demanding proving ground: an industry where mistakes are costly, regulation is stringent, and trust is everything. If Claude can successfully navigate the spreadsheet cells and data feeds of Wall Street without hallucinating a decimal point in the wrong direction, Anthropic will have accomplished something far more valuable than winning another benchmark test. It will have proven that AI can be trusted with the money.

Thinking Machines challenges OpenAI's AI scaling strategy: 'First superintelligence will be a superhuman learner'

While the world's leading artificial intelligence companies race to build ever-larger models, betting billions that scale alone will unlock artificial general intelligence, a researcher at one of the industry's most secretive and valuable startups delivered a pointed challenge to that orthodoxy this week: The path forward isn't about training bigger — it's about learning better.

"I believe that the first superintelligence will be a superhuman learner," Rafael Rafailov, a reinforcement learning researcher at Thinking Machines Lab, told an audience at TED AI San Francisco on Tuesday. "It will be able to very efficiently figure out and adapt, propose its own theories, propose experiments, use the environment to verify that, get information, and iterate that process."

This breaks sharply with the approach pursued by OpenAI, Anthropic, Google DeepMind, and other leading laboratories, which have bet billions on scaling up model size, data, and compute to achieve increasingly sophisticated reasoning capabilities. Rafailov argues these companies have the strategy backwards: what's missing from today's most advanced AI systems isn't more scale — it's the ability to actually learn from experience.

"Learning is something an intelligent being does," Rafailov said, citing a quote he described as recently compelling. "Training is something that's being done to it."

The distinction cuts to the core of how AI systems improve — and whether the industry's current trajectory can deliver on its most ambitious promises. Rafailov's comments offer a rare window into the thinking at Thinking Machines Lab, the startup co-founded in February by former OpenAI chief technology officer Mira Murati that raised a record-breaking $2 billion in seed funding at a $12 billion valuation.

Why today's AI coding assistants forget everything they learned yesterday

To illustrate the problem with current AI systems, Rafailov offered a scenario familiar to anyone who has worked with today's most advanced coding assistants.

"If you use a coding agent, ask it to do something really difficult — to implement a feature, go read your code, try to understand your code, reason about your code, implement something, iterate — it might be successful," he explained. "And then come back the next day and ask it to implement the next feature, and it will do the same thing."

The issue, he argued, is that these systems don't internalize what they learn. "In a sense, for the models we have today, every day is their first day of the job," Rafailov said. "But an intelligent being should be able to internalize information. It should be able to adapt. It should be able to modify its behavior so every day it becomes better, every day it knows more, every day it works faster — the way a human you hire gets better at the job."

The duct tape problem: How current training methods teach AI to take shortcuts instead of solving problems

Rafailov pointed to a specific behavior in coding agents that reveals the deeper problem: their tendency to wrap uncertain code in try/except blocks — a programming construct that catches errors and allows a program to continue running.

"If you use coding agents, you might have observed a very annoying tendency of them to use try/except pass," he said. "And in general, that is basically just like duct tape to save the entire program from a single error."

Why do agents do this? "They do this because they understand that part of the code might not be right," Rafailov explained. "They understand there might be something wrong, that it might be risky. But under the limited constraint—they have a limited amount of time solving the problem, limited amount of interaction—they must only focus on their objective, which is implement this feature and solve this bug."

The result: "They're kicking the can down the road."

This behavior stems from training systems that optimize for immediate task completion. "The only thing that matters to our current generation is solving the task," he said. "And anything that's general, anything that's not related to just that one objective, is a waste of computation."

Why throwing more compute at AI won't create superintelligence, according to Thinking Machines researcher

Rafailov's most direct challenge to the industry came in his assertion that continued scaling won't be sufficient to reach AGI.

"I don't believe we're hitting any sort of saturation points," he clarified. "I think we're just at the beginning of the next paradigm—the scale of reinforcement learning, in which we move from teaching our models how to think, how to explore thinking space, into endowing them with the capability of general agents."

In other words, current approaches will produce increasingly capable systems that can interact with the world, browse the web, write code. "I believe a year or two from now, we'll look at our coding agents today, research agents or browsing agents, the way we look at summarization models or translation models from several years ago," he said.

But general agency, he argued, is not the same as general intelligence. "The much more interesting question is: Is that going to be AGI? And are we done — do we just need one more round of scaling, one more round of environments, one more round of RL, one more round of compute, and we're kind of done?"

His answer was unequivocal: "I don't believe this is the case. I believe that under our current paradigms, under any scale, we are not enough to deal with artificial general intelligence and artificial superintelligence. And I believe that under our current paradigms, our current models will lack one core capability, and that is learning."

Teaching AI like students, not calculators: The textbook approach to machine learning

To explain the alternative approach, Rafailov turned to an analogy from mathematics education.

"Think about how we train our current generation of reasoning models," he said. "We take a particular math problem, make it very hard, and try to solve it, rewarding the model for solving it. And that's it. Once that experience is done, the model submits a solution. Anything it discovers—any abstractions it learned, any theorems—we discard, and then we ask it to solve a new problem, and it has to come up with the same abstractions all over again."

That approach misunderstands how knowledge accumulates. "This is not how science or mathematics works," he said. "We build abstractions not necessarily because they solve our current problems, but because they're important. For example, we developed the field of topology to extend Euclidean geometry — not to solve a particular problem that Euclidean geometry couldn't handle, but because mathematicians and physicists understood these concepts were fundamentally important."

The solution: "Instead of giving our models a single problem, we might give them a textbook. Imagine a very advanced graduate-level textbook, and we ask our models to work through the first chapter, then the first exercise, the second exercise, the third, the fourth, then move to the second chapter, and so on—the way a real student might teach themselves a topic."

The objective would fundamentally change: "Instead of rewarding their success — how many problems they solved — we need to reward their progress, their ability to learn, and their ability to improve."

This approach, known as "meta-learning" or "learning to learn," has precedents in earlier AI systems. "Just like the ideas of scaling test-time compute and search and test-time exploration played out in the domain of games first" — in systems like DeepMind's AlphaGo — "the same is true for meta learning. We know that these ideas do work at a small scale, but we need to adapt them to the scale and the capability of foundation models."

The missing ingredients for AI that truly learns aren't new architectures—they're better data and smarter objectives

When Rafailov addressed why current models lack this learning capability, he offered a surprisingly straightforward answer.

"Unfortunately, I think the answer is quite prosaic," he said. "I think we just don't have the right data, and we don't have the right objectives. I fundamentally believe a lot of the core architectural engineering design is in place."

Rather than arguing for entirely new model architectures, Rafailov suggested the path forward lies in redesigning the data distributions and reward structures used to train models.

"Learning, in of itself, is an algorithm," he explained. "It has inputs — the current state of the model. It has data and compute. You process it through some sort of structure, choose your favorite optimization algorithm, and you produce, hopefully, a stronger model."

The question: "If reasoning models are able to learn general reasoning algorithms, general search algorithms, and agent models are able to learn general agency, can the next generation of AI learn a learning algorithm itself?"

His answer: "I strongly believe that the answer to this question is yes."

The technical approach would involve creating training environments where "learning, adaptation, exploration, and self-improvement, as well as generalization, are necessary for success."

"I believe that under enough computational resources and with broad enough coverage, general purpose learning algorithms can emerge from large scale training," Rafailov said. "The way we train our models to reason in general over just math and code, and potentially act in general domains, we might be able to teach them how to learn efficiently across many different applications."

Forget god-like reasoners: The first superintelligence will be a master student

This vision leads to a fundamentally different conception of what artificial superintelligence might look like.

"I believe that if this is possible, that's the final missing piece to achieve truly efficient general intelligence," Rafailov said. "Now imagine such an intelligence with the core objective of exploring, learning, acquiring information, self-improving, equipped with general agency capability—the ability to understand and explore the external world, the ability to use computers, ability to do research, ability to manage and control robots."

Such a system would constitute artificial superintelligence. But not the kind often imagined in science fiction.

"I believe that intelligence is not going to be a single god model that's a god-level reasoner or a god-level mathematical problem solver," Rafailov said. "I believe that the first superintelligence will be a superhuman learner, and it will be able to very efficiently figure out and adapt, propose its own theories, propose experiments, use the environment to verify that, get information, and iterate that process."

This vision stands in contrast to OpenAI's emphasis on building increasingly powerful reasoning systems, or Anthropic's focus on "constitutional AI." Instead, Thinking Machines Lab appears to be betting that the path to superintelligence runs through systems that can continuously improve themselves through interaction with their environment.

The $12 billion bet on learning over scaling faces formidable challenges

Rafailov's appearance comes at a complex moment for Thinking Machines Lab. The company has assembled an impressive team of approximately 30 researchers from OpenAI, Google, Meta, and other leading labs. But it suffered a setback in early October when Andrew Tulloch, a co-founder and machine learning expert, departed to return to Meta after the company launched what The Wall Street Journal called a "full-scale raid" on the startup, approaching more than a dozen employees with compensation packages ranging from $200 million to $1.5 billion over multiple years.

Despite these pressures, Rafailov's comments suggest the company remains committed to its differentiated technical approach. The company launched its first product, Tinker, an API for fine-tuning open-source language models, in October. But Rafailov's talk suggests Tinker is just the foundation for a much more ambitious research agenda focused on meta-learning and self-improving systems.

"This is not easy. This is going to be very difficult," Rafailov acknowledged. "We'll need a lot of breakthroughs in memory and engineering and data and optimization, but I think it's fundamentally possible."

He concluded with a play on words: "The world is not enough, but we need the right experiences, and we need the right type of rewards for learning."

The question for Thinking Machines Lab — and the broader AI industry — is whether this vision can be realized, and on what timeline. Rafailov notably did not offer specific predictions about when such systems might emerge.

In an industry where executives routinely make bold predictions about AGI arriving within years or even months, that restraint is notable. It suggests either unusual scientific humility — or an acknowledgment that Thinking Machines Lab is pursuing a much longer, harder path than its competitors.

For now, the most revealing detail may be what Rafailov didn't say during his TED AI presentation. No timeline for when superhuman learners might emerge. No prediction about when the technical breakthroughs would arrive. Just a conviction that the capability was "fundamentally possible" — and that without it, all the scaling in the world won't be enough.

‘AI is tearing companies apart’: Writer AI CEO slams Fortune 500 leaders for mismanaging tech

May Habib, co-founder and CEO of Writer AI, delivered one of the bluntest assessments of corporate AI failures at the TED AI conference on Tuesday, revealing that nearly half of Fortune 500 executives believe artificial intelligence is actively damaging their organizations — and placing the blame squarely on leadership's shoulders.

The problem, according to Habib, isn't the technology. It's that business leaders are making a category error, treating AI transformation like previous technology rollouts and delegating it to IT departments. This approach, she warned, has led to "billions of dollars spent on AI initiatives that are going nowhere."

"Earlier this year, we did a survey of 800 Fortune 500 C-suite executives," Habib told the audience of Silicon Valley executives and investors. "42% of them said AI is tearing their company apart."

The diagnosis challenges conventional wisdom about how enterprises should approach AI adoption. While most major companies have stood up AI task forces, appointed chief AI officers, or expanded IT budgets, Habib argues these moves reflect a fundamental misunderstanding of what AI represents: not another software tool, but a wholesale reorganization of how work gets done.

"There is something leaders are missing when they compare AI to just another tech tool," Habib said. "This is not like giving accountants calculators or bankers Excel or designers Photoshop."

Why the 'old playbook' of delegating to IT departments is failing companies

Habib, whose company has spent five years building AI systems for Fortune 500 companies and logged two million miles visiting customer sites, said the pattern is consistent: "When generative AI started showing up, we turned to the old playbook. We turned to IT and said, 'Go figure this out.'"

That approach fails, she argued, because AI fundamentally changes the economics and organization of work itself. "For 100 years, enterprises have been built around the idea that execution is expensive and hard," Habib said. "The enterprise built complex org charts, complex processes, all to manage people doing stuff."

AI inverts that model. "Execution is going from scarce and expensive to programmatic, on-demand and abundant," she said. In this new paradigm, the bottleneck shifts from execution capacity to strategic design — a shift that requires business leaders, not IT departments, to drive transformation.

"With AI technology, it can no longer be centralized. It's in every workflow, every business," Habib said. "It is now the most important part of a business leader's job. It cannot be delegated."

The statement represents a direct challenge to how most large organizations have structured their AI initiatives, with centralized centers of excellence, dedicated AI teams, or IT-led implementations that business units are expected to adopt.

A generational power shift is happening based on who understands AI workflow design

Habib framed the shift in dramatic terms: "A generational transfer of power is happening right now. It's not about your age or how long you've been at a company. The generational transfer of power is about the nature of leadership itself."

Traditional leadership, she argued, has been defined by the ability to manage complexity — big teams, big budgets, intricate processes. "The identity of leaders at these companies, people like us, has been tied to old school power structures: control, hierarchy, how big our teams are, how big our budgets are. Our value is measured by the sheer amount of complexity we could manage," Habib said. "Today we reward leaders for this. We promote leaders for this."

AI makes that model obsolete. "When I am able to 10x the output of my team or do things that could never be possible, work is no longer about the 1x," she said. "Leadership is no longer about managing complex human execution."

Instead, Habib outlined three fundamental shifts that define what she calls "AI-first leaders" — executives her company has worked with who have successfully deployed AI agents solving "$100 million plus problems."

The first shift: Taking a machete to enterprise complexity

The new leadership mandate, according to Habib, is "taking a machete to the complexity that has calcified so many organizations." She pointed to the layers of friction that have accumulated in enterprises: "Brilliant ideas dying in memos, the endless cycles of approvals, the death by 1,000 clicks, meetings about meetings — a death, by the way, that's happening in 17 different browser tabs each for software that promises to be a single source of truth."

Rather than accepting this complexity as inevitable, AI-first leaders redesign workflows from first principles. "There are very few legacy systems that can't be replaced in your organization, that won't be replaced," Habib said. "But they're not going to be replaced by another monolithic piece of software. They can only be replaced by a business leader articulating business logic and getting that into an agentic system."

She offered a concrete example: "We have customers where it used to take them seven months to get a creative campaign — not even a product, a campaign. Now they can go from TikTok trend to digital shelf in 30 days. That is radical simplicity."

The catch, she emphasized, is that CIOs can't drive this transformation alone. "Your CIO can't help flatten your org chart. Only a business leader can look at workflows and say, 'This part is necessary genius, this part is bureaucratic scar tissue that has to go.'"

The second shift: Managing the fear as career ladders disappear

When AI handles execution, "your humans are liberated to do what they're amazing at: judgment, strategy, creativity," Habib explained. "The old leadership playbook was about managing headcount. We managed people against revenue: one business development rep for every three account executives, one marketer for every five salespeople."

But this liberation carries profound challenges that leaders must address directly. Habib acknowledged the elephant in the room that many executives avoid discussing: "These changes are still frightening for people, even when it's become unholy to talk about it." She's witnessed the fear firsthand. "It shows up as tears in an AI workshop when someone feels like their old skill set isn't translated to the new."

She introduced a term for a common form of resistance: "productivity anchoring" — when employees "cling to the hard way of doing things because they feel productive, because their self-worth is tied to them, even when empirically AI can be better."

The solution isn't to look away. "We have to design new pathways to impact, to show your people their value is not in executing a task. Their value is in orchestrating systems of execution, to ask the next great question," Habib said. She advocates replacing career "ladders" with "lattices" where "people need to grow laterally, to expand sideways."

She was candid about the disruption: "The first rungs on our career ladders are indeed going away. I know because my company is automating them." But she insisted this creates opportunity for work that is "more creative, more strategic, more driven by curiosity and impact — and I believe a lot more human than the jobs that they're replacing."

The third shift: When execution becomes free, ambition becomes the only bottleneck

The final shift is from optimization to creation. "Before AI, we used to call it transformation when we took 12 steps and made them nine," Habib said. "That's optimizing the world as it is. We can now create a new world. That is the greenfield mindset."

She challenged executives to identify assumptions their industries are built on that AI now disrupts. Writer's customers, she said, are already seeing new categories of growth: treating every customer like their only customer, democratizing premium services to broader markets, and entering new markets at unprecedented speed because "AI strips away the friction to access new channels."

"When execution is abundant, the only bottleneck is the scope of your own ambition," Habib declared.

What this means for CIOs: Building the stadium while business leaders design the plays

Habib didn't leave IT leaders without a role — she redefined it. "If tech is everyone's job, you might be asking, what is mine?" she addressed CIOs. "Yours is to provide the mission critical infrastructure that makes this revolution possible."

As tens or hundreds of thousands of AI agents operate at various levels of autonomy within organizations, "governance becomes existential," she explained. "The business leader's job is to design the play, but you have to build the stadium, you have to write the rule book, and you have to make sure these plays can win at championship scale."

The formulation suggests a partnership model: business leaders drive workflow redesign and strategic implementation while IT provides the infrastructure, governance frameworks, and security guardrails that make mass AI deployment safe and scalable. "One can't succeed without the other," Habib said.

For CIOs and technical leaders, this represents a fundamental shift from gatekeeper to enabler. When business units deploy agents autonomously, IT faces governance challenges unlike anything in enterprise software history. Success requires genuine partnership between business and IT — neither can succeed alone, forcing cultural changes in how these functions collaborate.

A real example: From multi-day scrambles to instant answers during a market crisis

To ground her arguments in concrete business impact, Habib described working with the chief client officer of a Fortune 500 wealth advisory firm during recent market volatility following tariff announcements.

"Their phone was ringing off the hook with customers trying to figure out their market exposure," she recounted. "Every request kicked off a multi-day, multi-person scramble: a portfolio manager ran the show, an analyst pulled charts, a relationship manager built the PowerPoint, a compliance officer had to review everything for disclosures. And the leader in all this — she was forwarding emails and chasing updates. This is the top job: managing complexity."

With an agentic AI system, the same work happens programmatically. "A system of agents is able to assemble the answer faster than any number of people could have. No more midnight deck reviews. No more days on end" of coordination, Habib said.

This isn't about marginal productivity gains — it's about fundamentally different operating models where senior executives shift from managing coordination to designing intelligent systems.

Why so many AI initiatives are failing despite massive investment

Habib's arguments arrive as many enterprises face AI disillusionment. After initial excitement about generative AI, many companies have struggled to move beyond pilots and demonstrations to production deployments generating tangible business value.

Her diagnosis — that leaders are delegating rather than driving transformation — aligns with growing evidence that organizational factors, not technical limitations, explain most failures. Companies often lack clarity on use cases, struggle with data preparation, or face internal resistance to workflow changes that AI requires.

Perhaps the most striking aspect of Habib's presentation was her willingness to acknowledge the human cost of AI transformation — and insist leaders address it rather than avoid it. "Your job as a leader is to not look away from this fear. Your job is to face it with a plan," she told the audience.

She described "productivity anchoring" as a form of "self-sabotage" where employees resist AI adoption because their identity and self-worth are tied to execution tasks AI can now perform. The phenomenon suggests that successful AI transformation requires not just technical and strategic changes but psychological and cultural work that many leaders may be unprepared for.

Two challenges: Get your hands dirty, then reimagine everything

Habib closed by throwing down two gauntlets to her executive audience.

"First, a small one: get your hands dirty with agentic AI. Don't delegate. Choose a process that you oversee and automate it. See the difference from managing a complex process to redesigning it for yourself."

The second was more ambitious: "Go back to your team and ask, what could we achieve if execution were free? What would work feel like, be like, look like if you're unbound from the friction and process that slows us down today?"

She concluded: "The tools for creation are in your hands. The mandate for leadership is on your shoulders. What will you build?"

For enterprise leaders accustomed to viewing AI as an IT initiative, Habib's message is clear: that approach isn't working, won't work, and reflects a fundamental misunderstanding of what AI represents. Whether executives embrace her call to personally drive transformation — or continue delegating to IT departments — may determine which organizations thrive and which become cautionary tales.

The statistic she opened with lingers uncomfortably: 42% of Fortune 500 C-suite executives say AI is tearing their companies apart. Habib's diagnosis suggests they're tearing themselves apart by clinging to organizational models designed for an era when execution was scarce. The cure she prescribes requires leaders to do something most find uncomfortable: stop managing complexity and start dismantling it.

Sakana AI's CTO says he's 'absolutely sick' of transformers, the tech that powers every major AI model

In a striking act of self-critique, one of the architects of the transformer technology that powers ChatGPT, Claude, and virtually every major AI system told an audience of industry leaders this week that artificial intelligence research has become dangerously narrow — and that he's moving on from his own creation.

Llion Jones, who co-authored the seminal 2017 paper "Attention Is All You Need" and even coined the name "transformer," delivered an unusually candid assessment at the TED AI conference in San Francisco on Tuesday: Despite unprecedented investment and talent flooding into AI, the field has calcified around a single architectural approach, potentially blinding researchers to the next major breakthrough.

"Despite the fact that there's never been so much interest and resources and money and talent, this has somehow caused the narrowing of the research that we're doing," Jones told the audience. The culprit, he argued, is the "immense amount of pressure" from investors demanding returns and researchers scrambling to stand out in an overcrowded field.

The warning carries particular weight given Jones's role in AI history. The transformer architecture he helped develop at Google has become the foundation of the generative AI boom, enabling systems that can write essays, generate images, and engage in human-like conversation. His paper has been cited more than 100,000 times, making it one of the most influential computer science publications of the century.

Now, as CTO and co-founder of Tokyo-based Sakana AI, Jones is explicitly abandoning his own creation. "I personally made a decision in the beginning of this year that I'm going to drastically reduce the amount of time that I spend on transformers," he said. "I'm explicitly now exploring and looking for the next big thing."

Why more AI funding has led to less creative research, according to a transformer pioneer

Jones painted a picture of an AI research community suffering from what he called a paradox: More resources have led to less creativity. He described researchers constantly checking whether they've been "scooped" by competitors working on identical ideas, and academics choosing safe, publishable projects over risky, potentially transformative ones.

"If you're doing standard AI research right now, you kind of have to assume that there's maybe three or four other groups doing something very similar, or maybe exactly the same," Jones said, describing an environment where "unfortunately, this pressure damages the science, because people are rushing their papers, and it's reducing the amount of creativity."

He drew an analogy from AI itself — the "exploration versus exploitation" trade-off that governs how algorithms search for solutions. When a system exploits too much and explores too little, it finds mediocre local solutions while missing superior alternatives. "We are almost certainly in that situation right now in the AI industry," Jones argued.

The implications are sobering. Jones recalled the period just before transformers emerged, when researchers were endlessly tweaking recurrent neural networks — the previous dominant architecture — for incremental gains. Once transformers arrived, all that work suddenly seemed irrelevant. "How much time do you think those researchers would have spent trying to improve the recurrent neural network if they knew something like transformers was around the corner?" he asked.

He worries the field is repeating that pattern. "I'm worried that we're in that situation right now where we're just concentrating on one architecture and just permuting it and trying different things, where there might be a breakthrough just around the corner."

How the 'Attention is all you need' paper was born from freedom, not pressure

To underscore his point, Jones described the conditions that allowed transformers to emerge in the first place — a stark contrast to today's environment. The project, he said, was "very organic, bottom up," born from "talking over lunch or scrawling randomly on the whiteboard in the office."

Critically, "we didn't actually have a good idea, we had the freedom to actually spend time and go and work on it, and even more importantly, we didn't have any pressure that was coming down from management," Jones recounted. "No pressure to work on any particular project, publish a number of papers to push a certain metric up."

That freedom, Jones suggested, is largely absent today. Even researchers recruited for astronomical salaries — "literally a million dollars a year, in some cases" — may not feel empowered to take risks. "Do you think that when they start their new position they feel empowered to try their wild ideas and more speculative ideas, or do they feel immense pressure to prove their worth and once again, go for the low hanging fruit?" he asked.

Why one AI lab is betting that research freedom beats million-dollar salaries

Jones's proposed solution is deliberately provocative: Turn up the "explore dial" and openly share findings, even at competitive cost. He acknowledged the irony of his position. "It may sound a little controversial to hear one of the Transformers authors stand on stage and tell you that he's absolutely sick of them, but it's kind of fair enough, right? I've been working on them longer than anyone, with the possible exception of seven people."

At Sakana AI, Jones said he's attempting to recreate that pre-transformer environment, with nature-inspired research and minimal pressure to chase publications or compete directly with rivals. He offered researchers a mantra from engineer Brian Cheung: "You should only do the research that wouldn't happen if you weren't doing it."

One example is Sakana's "continuous thought machine," which incorporates brain-like synchronization into neural networks. An employee who pitched the idea told Jones he would have faced skepticism and pressure not to waste time at previous employers or academic positions. At Sakana, Jones gave him a week to explore. The project became successful enough to be spotlighted at NeurIPS, a major AI conference.

Jones even suggested that freedom beats compensation in recruiting. "It's a really, really good way of getting talent," he said of the exploratory environment. "Think about it, talented, intelligent people, ambitious people, will naturally seek out this kind of environment."

The transformer's success may be blocking AI's next breakthrough

Perhaps most provocatively, Jones suggested transformers may be victims of their own success. "The fact that the current technology is so powerful and flexible... stopped us from looking for better," he said. "It makes sense that if the current technology was worse, more people would be looking for better."

He was careful to clarify that he's not dismissing ongoing transformer research. "There's still plenty of very important work to be done on current technology and bringing a lot of value in the coming years," he said. "I'm just saying that given the amount of talent and resources that we have currently, we can afford to do a lot more."

His ultimate message was one of collaboration over competition. "Genuinely, from my perspective, this is not a competition," Jones concluded. "We all have the same goal. We all want to see this technology progress so that we can all benefit from it. So if we can all collectively turn up the explore dial and then openly share what we find, we can get to our goal much faster."

The high stakes of AI's exploration problem

The remarks arrive at a pivotal moment for artificial intelligence. The industry grapples with mounting evidence that simply building larger transformer models may be approaching diminishing returns. Leading researchers have begun openly discussing whether the current paradigm has fundamental limitations, with some suggesting that architectural innovations — not just scale — will be needed for continued progress toward more capable AI systems.

Jones's warning suggests that finding those innovations may require dismantling the very incentive structures that have driven AI's recent boom. With tens of billions of dollars flowing into AI development annually and fierce competition among labs driving secrecy and rapid publication cycles, the exploratory research environment he described seems increasingly distant.

Yet his insider perspective carries unusual weight. As someone who helped create the technology now dominating the field, Jones understands both what it takes to achieve breakthrough innovation and what the industry risks by abandoning that approach. His decision to walk away from transformers — the architecture that made his reputation — adds credibility to a message that might otherwise sound like contrarian positioning.

Whether AI's power players will heed the call remains uncertain. But Jones offered a pointed reminder of what's at stake: The next transformer-scale breakthrough could be just around the corner, pursued by researchers with the freedom to explore. Or it could be languishing unexplored while thousands of researchers race to publish incremental improvements on architecture that, in Jones's words, one of its creators is "absolutely sick of."

After all, he's been working on transformers longer than almost anyone. He would know when it's time to move on.

Kai-Fu Lee's brutal assessment: America is already losing the AI hardware war to China

China is on track to dominate consumer artificial intelligence applications and robotics manufacturing within years, but the United States will maintain its substantial lead in enterprise AI adoption and cutting-edge research, according to Kai-Fu Lee, one of the world's most prominent AI scientists and investors.

In a rare, unvarnished assessment delivered via video link from Beijing to the TED AI conference in San Francisco Tuesday, Lee — a former executive at Apple, Microsoft, and Google who now runs both a major venture capital firm and his own AI company — laid out a technology landscape splitting along geographic and economic lines, with profound implications for both commercial competition and national security.

"China's robotics has the advantage of having integrated AI into much lower costs, better supply chain and fast turnaround, so companies like Unitree are actually the farthest ahead in the world in terms of building affordable, embodied humanoid AI," Lee said, referring to a Chinese robotics manufacturer that has undercut Western competitors on price while advancing capabilities.

The comments, made to a room filled with Silicon Valley executives, investors, and researchers, represented one of the most detailed public assessments from Lee about the comparative strengths and weaknesses of the world's two AI superpowers — and suggested that the race for artificial intelligence leadership is becoming less a single contest than a series of parallel competitions with different winners.

Why venture capital is flowing in opposite directions in the U.S. and China

At the heart of Lee's analysis lies a fundamental difference in how capital flows in the two countries' innovation ecosystems. American venture capitalists, Lee said, are pouring money into generative AI companies building large language models and enterprise software, while Chinese investors are betting heavily on robotics and hardware.

"The VCs in the US don't fund robotics the way the VCs do in China," Lee said. "Just like the VCs in China don't fund generative AI the way the VCs do in the US."

This investment divergence reflects different economic incentives and market structures. In the United States, where companies have grown accustomed to paying for software subscriptions and where labor costs are high, enterprise AI tools that boost white-collar productivity command premium prices. In China, where software subscription models have historically struggled to gain traction but manufacturing dominates the economy, robotics offers a clearer path to commercialization.

The result, Lee suggested, is that each country is pulling ahead in different domains — and may continue to do so.

"China's got some challenges to overcome in getting a company funded as well as OpenAI or Anthropic," Lee acknowledged, referring to the leading American AI labs. "But I think U.S., on the flip side, will have trouble developing the investment interest and value creation in the robotics" sector.

Why American companies dominate enterprise AI while Chinese firms struggle with subscriptions

Lee was explicit about one area where the United States maintains what appears to be a durable advantage: getting businesses to actually adopt and pay for AI software.

"The enterprise adoption will clearly be led by the United States," Lee said. "The Chinese companies have not yet developed a habit of paying for software on a subscription."

This seemingly mundane difference in business culture — whether companies will pay monthly fees for software — has become a critical factor in the AI race. The explosion of spending on tools like GitHub Copilot, ChatGPT Enterprise, and other AI-powered productivity software has fueled American companies' ability to invest billions in further research and development.

Lee noted that China has historically overcome similar challenges in consumer technology by developing alternative business models. "In the early days of internet software, China was also well behind because people weren't willing to pay for software," he said. "But then advertising models, e-commerce models really propelled China forward."

Still, he suggested, someone will need to "find a new business model that isn't just pay per software per use or per month basis. That's going to not happen in China anytime soon."

The implication: American companies building enterprise AI tools have a window — perhaps a substantial one — where they can generate revenue and reinvest in R&D without facing serious Chinese competition in their core market.

How ByteDance, Alibaba and Tencent will outpace Meta and Google in consumer AI

Where Lee sees China pulling ahead decisively is in consumer-facing AI applications — the kind embedded in social media, e-commerce, and entertainment platforms that billions of people use daily.

"In terms of consumer usage, that's likely to happen," Lee said, referring to China matching or surpassing the United States in AI deployment. "The Chinese giants, like ByteDance and Alibaba and Tencent, will definitely move a lot faster than their equivalent in the United States, companies like Meta, YouTube and so on."

Lee pointed to a cultural advantage: Chinese technology companies have spent the past decade obsessively optimizing for user engagement and product-market fit in brutally competitive markets. "The Chinese giants really work tenaciously, and they have mastered the art of figuring out product market fit," he said. "Now they have to add technology to it. So that is inevitably going to happen."

This assessment aligns with recent industry observations. ByteDance's TikTok became the world's most downloaded app through sophisticated AI-driven content recommendation, and Chinese companies have pioneered AI-powered features in areas like live-streaming commerce and short-form video that Western companies later copied.

Lee also noted that China has already deployed AI more widely in certain domains. "There are a lot of areas where China has also done a great job, such as using computer vision, speech recognition, and translation more widely," he said.

The surprising open-source shift that has Chinese models beating Meta's Llama

Perhaps Lee's most striking data point concerned open-source AI development — an area where China appears to have seized leadership from American companies in a remarkably short time.

"The 10 highest rated open source [models] are from China," Lee said. "These companies have now eclipsed Meta's Llama, which used to be number one."

This represents a significant shift. Meta's Llama models were widely viewed as the gold standard for open-source large language models as recently as early 2024. But Chinese companies — including Lee's own firm, 01.AI, along with Alibaba, Baidu, and others — have released a flood of open-source models that, according to various benchmarks, now outperform their American counterparts.

The open-source question has become a flashpoint in AI development. Lee made an extensive case for why open-source models will prove essential to the technology's future, even as closed models from companies like OpenAI command higher prices and, often, superior performance.

"I think open source has a number of major advantages," Lee argued. With open-source models, "you can examine it, tune it, improve it. It's yours, and it's free, and it's important for building if you want to build an application or tune the model to do something specific."

He drew an analogy to operating systems: "People who work in operating systems loved Linux, and that's why its adoption went through the roof. And I think in the future, open source will also allow people to tune a sovereign model for a country, make it work better for a particular language."

Still, Lee predicted both approaches will coexist. "I don't think open source models will win," he said. "I think just like we have Apple, which is closed, but provides a somewhat better experience than Android... I think we're going to see more apps using open-source models, more engineers wanting to build open-source models, but I think more money will remain in the closed model."

Why China's manufacturing advantage makes the robotics race 'not over, but' nearly decided

On robotics, Lee's message was blunt: the combination of China's manufacturing prowess, lower costs, and aggressive investment has created an advantage that will be difficult for American companies to overcome.

When asked directly whether the robotics race was already over with China victorious, Lee hedged only slightly. "It's not over, but I think the U.S. is still capable of coming up with the best robotic research ideas," he said. "But the VCs in the U.S. don't fund robotics the way the VCs do in China."

The challenge is structural. Building robots requires not just software and AI, but hardware manufacturing at scale — precisely the kind of integrated supply chain and low-cost production that China has spent decades perfecting. While American labs at universities and companies like Boston Dynamics continue to produce impressive research prototypes, turning those prototypes into affordable commercial products requires the manufacturing ecosystem that China possesses.

Companies like Unitree have demonstrated this advantage concretely. The company's humanoid robots and quadrupedal robots cost a fraction of their American-made equivalents while offering comparable or superior capabilities — a price-to-performance ratio that could prove decisive in commercial markets.

What worries Lee most: not AGI, but the race itself

Despite his generally measured tone about China's AI development, Lee expressed concern about one area where he believes the global AI community faces real danger — not the far-future risk of superintelligent AI, but the near-term consequences of moving too fast.

When asked about AGI risks, Lee reframed the question. "I'm less afraid of AI becoming self-aware and causing danger for humans in the short term," he said, "but more worried about it being used by bad people to do terrible things, or by the AI race pushing people to work so hard, so fast and furious and move fast and break things that they build products that have problems and holes to be exploited."

He continued: "I'm very worried about that. In fact, I think some terrible event will happen that will be a wake up call from this sort of problem."

Lee's perspective carries unusual weight because of his unique vantage point spanning both Chinese and American AI development. Over a career spanning more than three decades, he has held senior positions at Apple, Microsoft, and Google, while also founding Sinovation Ventures, which has invested in more than 400 companies across both countries. His AI company, 01.AI, founded in 2023, has released several open-source models that rank among the most capable in the world.

For American companies and policymakers, Lee's analysis presents a complex strategic picture. The United States appears to have clear advantages in enterprise AI software, fundamental research, and computing infrastructure. But China is moving faster in consumer applications, manufacturing robotics at lower costs, and potentially pulling ahead in open-source model development.

The bifurcation suggests that rather than a single "winner" in AI, the world may be heading toward a technology landscape where different countries excel in different domains — with all the economic and geopolitical complications that implies.

As the TED AI conference continued Wednesday, Lee's assessment hung over subsequent discussions. His message seemed clear: the AI race is not one contest, but many — and the United States and China are each winning different races.

Standing in the conference hall afterward, one venture capitalist, who asked not to be named, summed up the mood in the room: "We're not competing with China anymore. We're competing on parallel tracks." Whether those tracks eventually converge — or diverge into entirely separate technology ecosystems — may be the defining question of the next decade.

DeepSeek drops open-source model that compresses text 10x through images, defying conventions

DeepSeek, the Chinese artificial intelligence research company that has repeatedly challenged assumptions about AI development costs, has released a new model that fundamentally reimagines how large language models process information—and the implications extend far beyond its modest branding as an optical character recognition tool.

The company's DeepSeek-OCR model, released Monday with full open-source code and weights, achieves what researchers describe as a paradigm inversion: compressing text through visual representation up to 10 times more efficiently than traditional text tokens. The finding challenges a core assumption in AI development and could pave the way for language models with dramatically expanded context windows, potentially reaching tens of millions of tokens.

"We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping," the research team wrote in their technical paper. "Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10×), the model can achieve decoding (OCR) precision of 97%."

The implications have resonated across the AI research community. Andrej Karpathy, co-founder of OpenAI and former director of AI at Tesla, said in a post that the work raises fundamental questions about how AI systems should process information. "Maybe it makes more sense that all inputs to LLMs should only ever be images," Karpathy wrote. "Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in."

How DeepSeek achieved 10x compression by treating text as images

While DeepSeek marketed the release as an OCR model — a technology for converting images of text into digital characters — the research paper reveals more ambitious goals. The model demonstrates that visual representations can serve as a superior compression medium for textual information, inverting the conventional hierarchy where text tokens were considered more efficient than vision tokens.

"Traditionally, vision LLM tokens almost seemed like an afterthought or 'bolt on' to the LLM paradigm," wrote Jeffrey Emanuel, an AI researcher, in a detailed analysis of the paper. "And 10k words of English would take up far more space in a multimodal LLM when expressed as intelligible pixels than when expressed as tokens...But that gets inverted now from the ideas in this paper."

The model's architecture consists of two primary components: DeepEncoder, a novel 380-million-parameter vision encoder, and a 3-billion-parameter mixture-of-experts language decoder with 570 million activated parameters. DeepEncoder combines Meta's Segment Anything Model (SAM) for local visual perception with OpenAI's CLIP model for global visual understanding, connected through a 16x compression module.

To validate their compression claims, DeepSeek researchers tested the model on the Fox benchmark, a dataset of diverse document layouts. The results were striking: using just 100 vision tokens, the model achieved 97.3% accuracy on documents containing 700-800 text tokens — representing an effective compression ratio of 7.5x. Even at compression ratios approaching 20x, accuracy remained around 60%.

The practical impact: Processing 200,000 pages per day on a single GPU

The efficiency gains translate directly to production capabilities. According to the company, a single Nvidia A100-40G GPU can process more than 200,000 pages per day using DeepSeek-OCR. Scaling to a cluster of 20 servers with eight GPUs each, throughput reaches 33 million pages daily — sufficient to rapidly construct training datasets for other AI models.

On OmniDocBench, a comprehensive document parsing benchmark, DeepSeek-OCR outperformed GOT-OCR2.0 (which uses 256 tokens per page) while using only 100 vision tokens. More dramatically, it surpassed MinerU2.0 — which requires more than 6,000 tokens per page on average — while using fewer than 800 vision tokens.

DeepSeek designed the model to support five distinct resolution modes, each optimized for different compression ratios and use cases. The "Tiny" mode operates at 512×512 resolution with just 64 vision tokens, while "Gundam" mode combines multiple resolutions dynamically for complex documents. "Gundam mode consists of n×640×640 tiles (local views) and a 1024×1024 global view," the researchers wrote.

Why this breakthrough could unlock 10 million token context windows

The compression breakthrough has immediate implications for one of the most pressing challenges in AI development: expanding the context windows that determine how much information language models can actively consider. Current state-of-the-art models typically handle context windows measured in hundreds of thousands of tokens. DeepSeek's approach suggests a path to windows ten times larger.

"The potential of getting a frontier LLM with a 10 or 20 million token context window is pretty exciting," Emanuel wrote. "You could basically cram all of a company's key internal documents into a prompt preamble and cache this with OpenAI and then just add your specific query or prompt on top of that and not have to deal with search tools and still have it be fast and cost-effective."

The researchers explicitly frame their work in terms of context compression for language models. "Through DeepSeek-OCR, we demonstrate that vision-text compression can achieve significant token reduction (7-20×) for different historical context stages, offering a promising direction for addressing long-context challenges in large language models," they wrote.

The paper includes a speculative but intriguing diagram illustrating how the approach could implement memory decay mechanisms similar to human cognition. Older conversation rounds could be progressively downsampled to lower resolutions, consuming fewer tokens while maintaining key information — a form of computational forgetting that mirrors biological memory.

How visual processing could eliminate the 'ugly' tokenizer problem

Beyond compression, Karpathy highlighted how the approach challenges fundamental assumptions about how language models should process text. Traditional tokenizers—the systems that break text into units for processing—have long been criticized for their complexity and limitations.

"I already ranted about how much I dislike the tokenizer," Karpathy wrote. "Tokenizers are ugly, separate, not end-to-end stage. It 'imports' all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network."

Visual processing of text could eliminate these issues while enabling new capabilities. The approach naturally handles formatting information lost in pure text representations: bold text, colors, layout, embedded images. "Input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful," Karpathy noted.

The implications resonate with human cognitive science. Emanuel drew a parallel to Hans Bethe, the renowned physicist who memorized vast amounts of reference data: "Having vast amounts of task-specific knowledge in your working memory is extremely useful. This seems like a very clever and additive approach to potentially expanding that memory bank by 10x or more."

The model's training: 30 million PDF pages across 100 languages

The model's capabilities rest on an extensive training regimen using diverse data sources. DeepSeek collected 30 million PDF pages covering approximately 100 languages, with Chinese and English accounting for 25 million pages. The training data spans nine document types — academic papers, financial reports, textbooks, newspapers, handwritten notes, and others.

Beyond document OCR, the training incorporated what the researchers call "OCR 2.0" data: 10 million synthetic charts, 5 million chemical formulas, and 1 million geometric figures. The model also received 20% general vision data for tasks like image captioning and object detection, plus 10% text-only data to maintain language capabilities.

The training process employed pipeline parallelism across 160 Nvidia A100-40G GPUs (20 nodes with 8 GPUs each), with the vision encoder divided between two pipeline stages and the language model split across two others. "For multimodal data, the training speed is 70B tokens/day," the researchers reported.

Open source release accelerates research and raises competitive questions

True to DeepSeek's pattern of open development, the company released the complete model weights, training code, and inference scripts on GitHub and Hugging Face. The GitHub repository gained over 4,000 stars within 24 hours of release, according to Dataconomy.

The breakthrough raises questions about whether other AI labs have developed similar techniques but kept them proprietary. Emanuel speculated that Google's Gemini models, which feature large context windows and strong OCR performance, might employ comparable approaches. "For all we know, Google could have already figured out something like this, which could explain why Gemini has such a huge context size and is so good and fast at OCR tasks," Emanuel wrote.

Google's Gemini 2.5 Pro offers a 1-million-token context window, with plans to expand to 2 million, though the company has not publicly detailed the technical approaches enabling this capability. OpenAI's GPT-5 supports 400,000 tokens, while Anthropic's Claude 4.5 offers 200,000 tokens, with a 1-million-token window available in beta for eligible organizations.

The unanswered question: Can AI reason over compressed visual tokens?

While the compression results are impressive, researchers acknowledge important open questions. "It's not clear how exactly this interacts with the other downstream cognitive functioning of an LLM," Emanuel noted. "Can the model reason as intelligently over those compressed visual tokens as it can using regular text tokens? Does it make the model less articulate by forcing it into a more vision-oriented modality?"

The DeepSeek paper focuses primarily on the compression-decompression capability, measured through OCR accuracy, rather than downstream reasoning performance. This leaves open whether language models could reason effectively over large contexts represented primarily as compressed visual tokens.

The researchers acknowledge their work represents "an initial exploration into the boundaries of vision-text compression." They note that "OCR alone is insufficient to fully validate true context optical compression" and plan future work including "digital-optical text interleaved pretraining, needle-in-a-haystack testing, and other evaluations."

DeepSeek has established a pattern of achieving competitive results with dramatically lower computational resources than Western AI labs. The company's earlier DeepSeek-V3 model reportedly cost just $5.6 million to train—though this figure represents only the final training run and excludes R&D and infrastructure costs—compared to hundreds of millions for comparable models from OpenAI and Anthropic.

Industry analysts have questioned the $5.6 million figure, with some estimates placing the company's total infrastructure and operational costs closer to $1.3 billion, though still lower than American competitors' spending.

The bigger picture: Should language models process text as images?

DeepSeek-OCR poses a fundamental question for AI development: should language models process text as text, or as images of text? The research demonstrates that, at least for compression purposes, visual representation offers significant advantages. Whether this translates to effective reasoning over vast contexts remains to be determined.

"From another perspective, optical contexts compression still offers substantial room for research and improvement, representing a promising new direction," the researchers concluded in their paper.

For the AI industry, the work adds another dimension to the race for longer context windows — a competition that has intensified as language models are applied to increasingly complex tasks requiring vast amounts of information. The open-source release ensures the technique will be widely explored, tested, and potentially integrated into future AI systems.

As Karpathy framed the deeper implication: "OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa." In other words, the path forward for AI might not run through better tokenizers — it might bypass text tokens altogether.

❌
❌