Normal view

Today — 28 October 2025Main stream

Anthropic rolls out Claude AI for finance, integrates with Excel to rival Microsoft Copilot

Anthropic is making its most aggressive push yet into the trillion-dollar financial services industry, unveiling a suite of tools that embed its Claude AI assistant directly into Microsoft Excel and connect it to real-time market data from some of the world's most influential financial information providers.

The San Francisco-based AI startup announced Monday it is releasing Claude for Excel, allowing financial analysts to interact with the AI system directly within their spreadsheets — the quintessential tool of modern finance. Beyond Excel, select Claude models are also being made available in Microsoft Copilot Studio and Researcher agent, expanding the integration across Microsoft's enterprise AI ecosystem. The integration marks a significant escalation in Anthropic's campaign to position itself as the AI platform of choice for banks, asset managers, and insurance companies, markets where precision and regulatory compliance matter far more than creative flair.

The expansion comes just three months after Anthropic launched its Financial Analysis Solution in July, and it signals the company's determination to capture market share in an industry projected to spend $97 billion on AI by 2027, up from $35 billion in 2023.

More importantly, it positions Anthropic to compete directly with Microsoft — ironically, its partner in this Excel integration — which has its own Copilot AI assistant embedded across its Office suite, and with OpenAI, which counts Microsoft as its largest investor.

Why Excel has become the new battleground for AI in finance

The decision to build directly into Excel is hardly accidental. Excel remains the lingua franca of finance, the digital workspace where analysts spend countless hours constructing financial models, running valuations, and stress-testing assumptions. By embedding Claude into this environment, Anthropic is meeting financial professionals exactly where they work rather than asking them to toggle between applications.

Claude for Excel allows users to work with the AI in a sidebar where it can read, analyze, modify, and create new Excel workbooks while providing full transparency about the actions it takes by tracking and explaining changes and letting users navigate directly to referenced cells.

This transparency feature addresses one of the most persistent anxieties around AI in finance: the "black box" problem. When billions of dollars ride on a financial model's output, analysts need to understand not just the answer but how the AI arrived at it. By showing its work at the cell level, Anthropic is attempting to build the trust necessary for widespread adoption in an industry where careers and fortunes can turn on a misplaced decimal point.

The technical implementation is sophisticated. Claude can discuss how spreadsheets work, modify them while preserving formula dependencies — a notoriously complex task — debug cell formulas, populate templates with new data, or build entirely new spreadsheets from scratch. This isn't merely a chatbot that answers questions about your data; it's a collaborative tool that can actively manipulate the models that drive investment decisions worth trillions of dollars.

How Anthropic is building data moats around its financial AI platform

Perhaps more significant than the Excel integration is Anthropic's expansion of its connector ecosystem, which now links Claude to live market data and proprietary research from financial information giants. The company added six major new data partnerships spanning the entire spectrum of financial information that professional investors rely upon.

Aiera now provides Claude with real-time earnings call transcripts and summaries of investor events like shareholder meetings, presentations, and conferences. The Aiera connector also enables a data feed from Third Bridge, which gives Claude access to a library of insights interviews, company intelligence, and industry analysis from experts and former executives. Chronograph gives private equity investors operational and financial information for portfolio monitoring and conducting due diligence, including performance metrics, valuations, and fund-level data.

Egnyte enables Claude to securely search permitted data for internal data rooms, investment documents, and approved financial models while maintaining governed access controls. LSEG, the London Stock Exchange Group, connects Claude to live market data including fixed income pricing, equities, foreign exchange rates, macroeconomic indicators, and analysts' estimates of other important financial metrics. Moody's provides access to proprietary credit ratings, research, and company data covering ownership, financials, and news on more than 600 million public and private companies, supporting work and research in compliance, credit analysis, and business development. MT Newswires provides Claude with access to the latest global multi-asset class news on financial markets and economies.

These partnerships amount to a land grab for the informational infrastructure that powers modern finance. Previously announced in July, Anthropic had already secured integrations with S&P Capital IQ, Daloopa, Morningstar, FactSet, PitchBook, Snowflake, and Databricks. Together, these connectors give Claude access to virtually every category of financial data an analyst might need: fundamental company data, market prices, credit assessments, private company intelligence, alternative data, and breaking news.

This matters because the quality of AI outputs depends entirely on the quality of inputs. Generic large language models trained on public internet data simply cannot compete with systems that have direct pipelines to Bloomberg-quality financial information. By securing these partnerships, Anthropic is building moats around its financial services offering that competitors will find difficult to replicate.

The strategic calculus here is clear: Anthropic is betting that domain-specific AI systems with privileged access to proprietary data will outcompete general-purpose AI assistants. It's a direct challenge to the "one AI to rule them all" approach favored by some competitors.

Pre-configured workflows target the daily grind of Wall Street analysts

The third pillar of Anthropic's announcement involves six new "Agent Skills" — pre-configured workflows for common financial tasks. These skills are Anthropic's attempt to productize the workflows of entry-level and mid-level financial analysts, professionals who spend their days building models, processing due diligence documents, and writing research reports. Anthropic has designed skills specifically to automate these time-consuming tasks.

The new skills include building discounted cash flow models complete with full free cash flow projections, weighted average cost of capital calculations, scenario toggles, and sensitivity tables. There's comparable company analysis featuring valuation multiples and operating metrics that can be easily refreshed with updated data. Claude can now process data room documents into Excel spreadsheets populated with financial information, customer lists, and contract terms. It can create company teasers and profiles for pitch books and buyer lists, perform earnings analyses that use quarterly transcripts and financials to extract important metrics, guidance changes, and management commentary, and produce initiating coverage reports with industry analysis, company deep dives, and valuation frameworks.

It's worth noting that Anthropic's Sonnet 4.5 model now tops the Finance Agent benchmark from Vals AI at 55.3% accuracy, a metric designed to test AI systems on tasks expected of entry-level financial analysts. A 55% accuracy rate might sound underwhelming, but it is state-of-the-art performance and highlights both the promise and limitations of AI in finance. The technology can clearly handle sophisticated analytical tasks, but it's not yet reliable enough to operate autonomously without human oversight — a reality that may actually reassure both regulators and the analysts whose jobs might otherwise be at risk.

The Agent Skills approach is particularly clever because it packages AI capabilities in terms that financial institutions already understand. Rather than selling generic "AI assistance," Anthropic is offering solutions to specific, well-defined problems: "You need a DCF model? We have a skill for that. You need to analyze earnings calls? We have a skill for that too."

Trillion-dollar clients are already seeing massive productivity gains

Anthropic's financial services strategy appears to be gaining traction with exactly the kind of marquee clients that matter in enterprise sales. The company counts among its clients AIA Labs at Bridgewater, Commonwealth Bank of Australia, American International Group, and Norges Bank Investment Management — Norway's $1.6 trillion sovereign wealth fund, one of the world's largest institutional investors.

NBIM CEO Nicolai Tangen reported achieving approximately 20% productivity gains, equivalent to 213,000 hours, with portfolio managers and risk departments now able to "seamlessly query our Snowflake data warehouse and analyze earnings calls with unprecedented efficiency."

At AIG, CEO Peter Zaffino said the partnership has "compressed the timeline to review business by more than 5x in our early rollouts while simultaneously improving our data accuracy from 75% to over 90%." If these numbers hold across broader deployments, the productivity implications for the financial services industry are staggering.

These aren't pilot programs or proof-of-concept deployments; they're production implementations at institutions managing trillions of dollars in assets and making underwriting decisions that affect millions of customers. Their public endorsements provide the social proof that typically drives enterprise adoption in conservative industries.

Regulatory uncertainty creates both opportunity and risk for AI deployment

Yet Anthropic's financial services ambitions unfold against a backdrop of heightened regulatory scrutiny and shifting enforcement priorities. In 2023, the Consumer Financial Protection Bureau released guidance requiring lenders to "use specific and accurate reasons when taking adverse actions against consumers" involving AI, and issued additional guidance requiring regulated entities to "evaluate their underwriting models for bias" and "evaluate automated collateral-valuation and appraisal processes in ways that minimize bias."

However, according to a Brookings Institution analysis, these measures have since been revoked with work stopped or eliminated at the current downsized CFPB under the current administration, creating regulatory uncertainty. The pendulum has swung from the Biden administration's cautious approach, exemplified by an executive order on safe AI development, toward the Trump administration's "America's AI Action Plan," which seeks to "cement U.S. dominance in artificial intelligence" through deregulation.

This regulatory flux creates both opportunities and risks. Financial institutions eager to deploy AI now face less prescriptive federal oversight, potentially accelerating adoption. But the absence of clear guardrails also exposes them to potential liability if AI systems produce discriminatory outcomes, particularly in lending and underwriting.

The Massachusetts Attorney General recently reached a $2.5 million settlement with student loan company Earnest Operations, alleging that its use of AI models resulted in "disparate impact in approval rates and loan terms, specifically disadvantaging Black and Hispanic applicants." Such cases will likely multiply as AI deployment grows, creating a patchwork of state-level enforcement even as federal oversight recedes.

Anthropic appears acutely aware of these risks. In an interview with Banking Dive, Jonathan Pelosi, Anthropic's global head of industry for financial services, emphasized that Claude requires a "human in the loop." The platform, he said, is not intended for autonomous financial decision-making or to provide stock recommendations that users follow blindly. During client onboarding, Pelosi told the publication, Anthropic focuses on training and understanding model limitations, putting guardrails in place so people treat Claude as a helpful technology rather than a replacement for human judgment.

Competition heats up as every major tech company targets finance AI

Anthropic's financial services push comes as AI competition intensifies across the enterprise. OpenAI, Microsoft, Google, and numerous startups are all vying for position in what may become one of AI's most lucrative verticals. Goldman Sachs introduced a generative AI assistant to its bankers, traders, and asset managers in January, signaling that major banks may build their own capabilities rather than rely exclusively on third-party providers.

The emergence of domain-specific AI models like BloombergGPT — trained specifically on financial data — suggests the market may fragment between generalized AI assistants and specialized tools. Anthropic's strategy appears to stake out a middle ground: general-purpose models, since Claude was not trained exclusively on financial data, enhanced with financial-specific tooling, data access, and workflows.

The company's partnership strategy with implementation consultancies including Deloitte, KPMG, PwC, Slalom, TribeAI, and Turing is equally critical. These firms serve as force multipliers, embedding Anthropic's technology into their own service offerings and providing the change management expertise that financial institutions need to successfully adopt AI at scale.

CFOs worry about AI hallucinations and cascading errors

The broader question is whether AI tools like Claude will genuinely transform financial services productivity or merely shift work around. The PYMNTS Intelligence report "The Agentic Trust Gap" found that chief financial officers remain hesitant about AI agents, with "nagging concern" about hallucinations where "an AI agent can go off script and expose firms to cascading payment errors and other inaccuracies."

"For finance leaders, the message is stark: Harness AI's momentum now, but build the guardrails before the next quarterly call—or risk owning the fallout," the report warned.

A 2025 KPMG report found that 70% of board members have developed responsible use policies for employees, with other popular initiatives including implementing a recognized AI risk and governance framework, developing ethical guidelines and training programs for AI developers, and conducting regular AI use audits.

The financial services industry faces a delicate balancing act: move too slowly and risk competitive disadvantage as rivals achieve productivity gains; move too quickly and risk operational failures, regulatory penalties, or reputational damage. Speaking at the Evident AI Symposium in New York last week, Ian Glasner, HSBC's group head of emerging technology, innovation and ventures, struck an optimistic tone about the sector's readiness for AI adoption. "As an industry, we are very well prepared to manage risk," he said, according to CIO Dive. "Let's not overcomplicate this. We just need to be focused on the business use case and the value associated."

Anthropic's latest moves suggest the company sees financial services as a beachhead market where AI's value proposition is clear, customers have deep pockets, and the technical requirements play to Claude's strengths in reasoning and accuracy. By building Excel integration, securing data partnerships, and pre-packaging common workflows, Anthropic is reducing the friction that typically slows enterprise AI adoption.

The $61.5 billion valuation the company commanded in its March fundraising round — up from roughly $16 billion a year earlier — suggests investors believe this strategy will work. But the real test will come as these tools move from pilot programs to production deployments across thousands of analysts and billions of dollars in transactions.

Financial services may prove to be AI's most demanding proving ground: an industry where mistakes are costly, regulation is stringent, and trust is everything. If Claude can successfully navigate the spreadsheet cells and data feeds of Wall Street without hallucinating a decimal point in the wrong direction, Anthropic will have accomplished something far more valuable than winning another benchmark test. It will have proven that AI can be trusted with the money.

Before yesterdayMain stream

Thinking Machines challenges OpenAI's AI scaling strategy: 'First superintelligence will be a superhuman learner'

While the world's leading artificial intelligence companies race to build ever-larger models, betting billions that scale alone will unlock artificial general intelligence, a researcher at one of the industry's most secretive and valuable startups delivered a pointed challenge to that orthodoxy this week: The path forward isn't about training bigger — it's about learning better.

"I believe that the first superintelligence will be a superhuman learner," Rafael Rafailov, a reinforcement learning researcher at Thinking Machines Lab, told an audience at TED AI San Francisco on Tuesday. "It will be able to very efficiently figure out and adapt, propose its own theories, propose experiments, use the environment to verify that, get information, and iterate that process."

This breaks sharply with the approach pursued by OpenAI, Anthropic, Google DeepMind, and other leading laboratories, which have bet billions on scaling up model size, data, and compute to achieve increasingly sophisticated reasoning capabilities. Rafailov argues these companies have the strategy backwards: what's missing from today's most advanced AI systems isn't more scale — it's the ability to actually learn from experience.

"Learning is something an intelligent being does," Rafailov said, citing a quote he described as recently compelling. "Training is something that's being done to it."

The distinction cuts to the core of how AI systems improve — and whether the industry's current trajectory can deliver on its most ambitious promises. Rafailov's comments offer a rare window into the thinking at Thinking Machines Lab, the startup co-founded in February by former OpenAI chief technology officer Mira Murati that raised a record-breaking $2 billion in seed funding at a $12 billion valuation.

Why today's AI coding assistants forget everything they learned yesterday

To illustrate the problem with current AI systems, Rafailov offered a scenario familiar to anyone who has worked with today's most advanced coding assistants.

"If you use a coding agent, ask it to do something really difficult — to implement a feature, go read your code, try to understand your code, reason about your code, implement something, iterate — it might be successful," he explained. "And then come back the next day and ask it to implement the next feature, and it will do the same thing."

The issue, he argued, is that these systems don't internalize what they learn. "In a sense, for the models we have today, every day is their first day of the job," Rafailov said. "But an intelligent being should be able to internalize information. It should be able to adapt. It should be able to modify its behavior so every day it becomes better, every day it knows more, every day it works faster — the way a human you hire gets better at the job."

The duct tape problem: How current training methods teach AI to take shortcuts instead of solving problems

Rafailov pointed to a specific behavior in coding agents that reveals the deeper problem: their tendency to wrap uncertain code in try/except blocks — a programming construct that catches errors and allows a program to continue running.

"If you use coding agents, you might have observed a very annoying tendency of them to use try/except pass," he said. "And in general, that is basically just like duct tape to save the entire program from a single error."

Why do agents do this? "They do this because they understand that part of the code might not be right," Rafailov explained. "They understand there might be something wrong, that it might be risky. But under the limited constraint—they have a limited amount of time solving the problem, limited amount of interaction—they must only focus on their objective, which is implement this feature and solve this bug."

The result: "They're kicking the can down the road."

This behavior stems from training systems that optimize for immediate task completion. "The only thing that matters to our current generation is solving the task," he said. "And anything that's general, anything that's not related to just that one objective, is a waste of computation."

Why throwing more compute at AI won't create superintelligence, according to Thinking Machines researcher

Rafailov's most direct challenge to the industry came in his assertion that continued scaling won't be sufficient to reach AGI.

"I don't believe we're hitting any sort of saturation points," he clarified. "I think we're just at the beginning of the next paradigm—the scale of reinforcement learning, in which we move from teaching our models how to think, how to explore thinking space, into endowing them with the capability of general agents."

In other words, current approaches will produce increasingly capable systems that can interact with the world, browse the web, write code. "I believe a year or two from now, we'll look at our coding agents today, research agents or browsing agents, the way we look at summarization models or translation models from several years ago," he said.

But general agency, he argued, is not the same as general intelligence. "The much more interesting question is: Is that going to be AGI? And are we done — do we just need one more round of scaling, one more round of environments, one more round of RL, one more round of compute, and we're kind of done?"

His answer was unequivocal: "I don't believe this is the case. I believe that under our current paradigms, under any scale, we are not enough to deal with artificial general intelligence and artificial superintelligence. And I believe that under our current paradigms, our current models will lack one core capability, and that is learning."

Teaching AI like students, not calculators: The textbook approach to machine learning

To explain the alternative approach, Rafailov turned to an analogy from mathematics education.

"Think about how we train our current generation of reasoning models," he said. "We take a particular math problem, make it very hard, and try to solve it, rewarding the model for solving it. And that's it. Once that experience is done, the model submits a solution. Anything it discovers—any abstractions it learned, any theorems—we discard, and then we ask it to solve a new problem, and it has to come up with the same abstractions all over again."

That approach misunderstands how knowledge accumulates. "This is not how science or mathematics works," he said. "We build abstractions not necessarily because they solve our current problems, but because they're important. For example, we developed the field of topology to extend Euclidean geometry — not to solve a particular problem that Euclidean geometry couldn't handle, but because mathematicians and physicists understood these concepts were fundamentally important."

The solution: "Instead of giving our models a single problem, we might give them a textbook. Imagine a very advanced graduate-level textbook, and we ask our models to work through the first chapter, then the first exercise, the second exercise, the third, the fourth, then move to the second chapter, and so on—the way a real student might teach themselves a topic."

The objective would fundamentally change: "Instead of rewarding their success — how many problems they solved — we need to reward their progress, their ability to learn, and their ability to improve."

This approach, known as "meta-learning" or "learning to learn," has precedents in earlier AI systems. "Just like the ideas of scaling test-time compute and search and test-time exploration played out in the domain of games first" — in systems like DeepMind's AlphaGo — "the same is true for meta learning. We know that these ideas do work at a small scale, but we need to adapt them to the scale and the capability of foundation models."

The missing ingredients for AI that truly learns aren't new architectures—they're better data and smarter objectives

When Rafailov addressed why current models lack this learning capability, he offered a surprisingly straightforward answer.

"Unfortunately, I think the answer is quite prosaic," he said. "I think we just don't have the right data, and we don't have the right objectives. I fundamentally believe a lot of the core architectural engineering design is in place."

Rather than arguing for entirely new model architectures, Rafailov suggested the path forward lies in redesigning the data distributions and reward structures used to train models.

"Learning, in of itself, is an algorithm," he explained. "It has inputs — the current state of the model. It has data and compute. You process it through some sort of structure, choose your favorite optimization algorithm, and you produce, hopefully, a stronger model."

The question: "If reasoning models are able to learn general reasoning algorithms, general search algorithms, and agent models are able to learn general agency, can the next generation of AI learn a learning algorithm itself?"

His answer: "I strongly believe that the answer to this question is yes."

The technical approach would involve creating training environments where "learning, adaptation, exploration, and self-improvement, as well as generalization, are necessary for success."

"I believe that under enough computational resources and with broad enough coverage, general purpose learning algorithms can emerge from large scale training," Rafailov said. "The way we train our models to reason in general over just math and code, and potentially act in general domains, we might be able to teach them how to learn efficiently across many different applications."

Forget god-like reasoners: The first superintelligence will be a master student

This vision leads to a fundamentally different conception of what artificial superintelligence might look like.

"I believe that if this is possible, that's the final missing piece to achieve truly efficient general intelligence," Rafailov said. "Now imagine such an intelligence with the core objective of exploring, learning, acquiring information, self-improving, equipped with general agency capability—the ability to understand and explore the external world, the ability to use computers, ability to do research, ability to manage and control robots."

Such a system would constitute artificial superintelligence. But not the kind often imagined in science fiction.

"I believe that intelligence is not going to be a single god model that's a god-level reasoner or a god-level mathematical problem solver," Rafailov said. "I believe that the first superintelligence will be a superhuman learner, and it will be able to very efficiently figure out and adapt, propose its own theories, propose experiments, use the environment to verify that, get information, and iterate that process."

This vision stands in contrast to OpenAI's emphasis on building increasingly powerful reasoning systems, or Anthropic's focus on "constitutional AI." Instead, Thinking Machines Lab appears to be betting that the path to superintelligence runs through systems that can continuously improve themselves through interaction with their environment.

The $12 billion bet on learning over scaling faces formidable challenges

Rafailov's appearance comes at a complex moment for Thinking Machines Lab. The company has assembled an impressive team of approximately 30 researchers from OpenAI, Google, Meta, and other leading labs. But it suffered a setback in early October when Andrew Tulloch, a co-founder and machine learning expert, departed to return to Meta after the company launched what The Wall Street Journal called a "full-scale raid" on the startup, approaching more than a dozen employees with compensation packages ranging from $200 million to $1.5 billion over multiple years.

Despite these pressures, Rafailov's comments suggest the company remains committed to its differentiated technical approach. The company launched its first product, Tinker, an API for fine-tuning open-source language models, in October. But Rafailov's talk suggests Tinker is just the foundation for a much more ambitious research agenda focused on meta-learning and self-improving systems.

"This is not easy. This is going to be very difficult," Rafailov acknowledged. "We'll need a lot of breakthroughs in memory and engineering and data and optimization, but I think it's fundamentally possible."

He concluded with a play on words: "The world is not enough, but we need the right experiences, and we need the right type of rewards for learning."

The question for Thinking Machines Lab — and the broader AI industry — is whether this vision can be realized, and on what timeline. Rafailov notably did not offer specific predictions about when such systems might emerge.

In an industry where executives routinely make bold predictions about AGI arriving within years or even months, that restraint is notable. It suggests either unusual scientific humility — or an acknowledgment that Thinking Machines Lab is pursuing a much longer, harder path than its competitors.

For now, the most revealing detail may be what Rafailov didn't say during his TED AI presentation. No timeline for when superhuman learners might emerge. No prediction about when the technical breakthroughs would arrive. Just a conviction that the capability was "fundamentally possible" — and that without it, all the scaling in the world won't be enough.

This new AI technique creates ‘digital twin’ consumers, and it could kill the traditional survey industry

A new research paper quietly published last week outlines a breakthrough method that allows large language models (LLMs) to simulate human consumer behavior with startling accuracy, a development that could reshape the multi-billion-dollar market research industry. The technique promises to create armies of synthetic consumers who can provide not just realistic product ratings, but also the qualitative reasoning behind them, at a scale and speed currently unattainable.

For years, companies have sought to use AI for market research, but have been stymied by a fundamental flaw: when asked to provide a numerical rating on a scale of 1 to 5, LLMs produce unrealistic and poorly distributed responses. A new paper, "LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings," submitted to the pre-print server arXiv on October 9th proposes an elegant solution that sidesteps this problem entirely.

The international team of researchers, led by Benjamin F. Maier, developed a method they call semantic similarity rating (SSR). Instead of asking an LLM for a number, SSR prompts the model for a rich, textual opinion on a product. This text is then converted into a numerical vector — an "embedding" — and its similarity is measured against a set of pre-defined reference statements. For example, a response of "I would absolutely buy this, it's exactly what I'm looking for" would be semantically closer to the reference statement for a "5" rating than to the statement for a "1."

The results are striking. Tested against a massive real-world dataset from a leading personal care corporation — comprising 57 product surveys and 9,300 human responses — the SSR method achieved 90% of human test-retest reliability. Crucially, the distribution of AI-generated ratings was statistically almost indistinguishable from the human panel. The authors state, "This framework enables scalable consumer research simulations while preserving traditional survey metrics and interpretability."

A timely solution as AI threatens survey integrity

This development arrives at a critical time, as the integrity of traditional online survey panels is increasingly under threat from AI. A 2024 analysis from the Stanford Graduate School of Business highlighted a growing problem of human survey-takers using chatbots to generate their answers. These AI-generated responses were found to be "suspiciously nice," overly verbose, and lacking the "snark" and authenticity of genuine human feedback, leading to what researchers called a "homogenization" of data that could mask serious issues like discrimination or product flaws.

Maier's research offers a starkly different approach: instead of fighting to purge contaminated data, it creates a controlled environment for generating high-fidelity synthetic data from the ground up.

"What we're seeing is a pivot from defense to offense," said one analyst not affiliated with the study. "The Stanford paper showed the chaos of uncontrolled AI polluting human datasets. This new paper shows the order and utility of controlled AI creating its own datasets. For a Chief Data Officer, this is the difference between cleaning a contaminated well and tapping into a fresh spring."

From text to intent: The technical leap behind the synthetic consumer

The technical validity of the new method hinges on the quality of the text embeddings, a concept explored in a 2022 paper in EPJ Data Science. That research argued for a rigorous "construct validity" framework to ensure that text embeddings — the numerical representations of text — truly "measure what they are supposed to." 

The success of the SSR method suggests its embeddings effectively capture the nuances of purchase intent. For this new technique to be widely adopted, enterprises will need to be confident that the underlying models are not just generating plausible text, but are mapping that text to scores in a way that is robust and meaningful.

The approach also represents a significant leap from prior research, which has largely focused on using text embeddings to analyze and predict ratings from existing online reviews. A 2022 study, for example, evaluated the performance of models like BERT and word2vec in predicting review scores on retail sites, finding that newer models like BERT performed better for general use. The new research moves beyond analyzing existing data to generating novel, predictive insights before a product even hits the market.

The dawn of the digital focus group

For technical decision-makers, the implications are profound. The ability to spin up a "digital twin" of a target consumer segment and test product concepts, ad copy, or packaging variations in a matter of hours could drastically accelerate innovation cycles. 

As the paper notes, these synthetic respondents also provide "rich qualitative feedback explaining their ratings," offering a treasure trove of data for product development that is both scalable and interpretable. While the era of human-only focus groups is far from over, this research provides the most compelling evidence yet that their synthetic counterparts are ready for business.

But the business case extends beyond speed and scale. Consider the economics: a traditional survey panel for a national product launch might cost tens of thousands of dollars and take weeks to field. An SSR-based simulation could deliver comparable insights in a fraction of the time, at a fraction of the cost, and with the ability to iterate instantly based on findings. For companies in fast-moving consumer goods categories — where the window between concept and shelf can determine market leadership — this velocity advantage could be decisive.

There are, of course, caveats. The method was validated on personal care products; its performance on complex B2B purchasing decisions, luxury goods, or culturally specific products remains unproven. And while the paper demonstrates that SSR can replicate aggregate human behavior, it does not claim to predict individual consumer choices. The technique works at the population level, not the person level — a distinction that matters greatly for applications like personalized marketing.

Yet even with these limitations, the research is a watershed. While the era of human-only focus groups is far from over, this paper provides the most compelling evidence yet that their synthetic counterparts are ready for business. The question is no longer whether AI can simulate consumer sentiment, but whether enterprises can move fast enough to capitalize on it before their competitors do.

Salesforce bets on AI 'agents' to fix what it calls a $7 billion problem in enterprise software

As 50,000 attendees descend on Salesforce's Dreamforce conference this week, the enterprise software giant is making its most aggressive bet yet on artificial intelligence agents, positioning itself as the antidote to what it calls an industry-wide "pilot purgatory" where 95% of enterprise AI projects never reach production.

The company on Monday launched Agentforce 360, a sweeping reimagination of its entire product portfolio designed to transform businesses into what it calls "agentic enterprises" — organizations where AI agents work alongside humans to handle up to 40% of work across sales, service, marketing, and operations.

"We are truly in the agentic AI era, and I think it's probably the biggest revolution, the biggest transition in technology I've ever experienced in my career," said Parker Harris, Salesforce's co-founder and chief technology officer, during a recent press briefing. "In the future, 40% of the work in the Fortune 1000 is probably going to be done by AI, and it's going to be humans and AI actually working together."

The announcement comes at a pivotal moment for Salesforce, which has deployed more than 12,000 AI agent implementations over the past year while building what Harris called a "$7 billion business" around its AI platform. Yet the launch also arrives amid unusual turbulence, as CEO Marc Benioff faces fierce backlash for recent comments supporting President Trump and suggesting National Guard troops should patrol San Francisco streets.

Why 95% of enterprise AI projects never launch

The stakes are enormous. While companies have rushed to experiment with AI following ChatGPT's emergence two years ago, most enterprise deployments have stalled before reaching production, according to recent MIT research that Salesforce executives cited extensively.

"Customers have invested a lot in AI, but they're not getting the value," said Srini Tallapragada, Salesforce's president and chief engineering and customer success officer. "95% of enterprise AI pilots fail before production. It's not because of lack of intent. People want to do this. Everybody understands the power of the technology. But why is it so hard?"

The answer, according to Tallapragada, is that AI tools remain disconnected from enterprise workflows, data, and governance systems. "You're writing prompts, prompts, you're getting frustrated because the context is not there," he said, describing what he called a "prompt doom loop."

Salesforce's solution is a deeply integrated platform connecting what it calls four ingredients: the Agentforce 360 agent platform, Data 360 for unified data access, Customer 360 apps containing business logic, and Slack as the "conversational interface" where humans and agents collaborate.

Slack becomes the front door to Salesforce

Perhaps the most significant strategic shift is the elevation of Slack — acquired by Salesforce in 2019 for $27.7 billion — as the primary interface for Salesforce itself. The company is effectively reimagining its traditional Lightning interface around Slack channels, where sales deals, service cases, and data insights will surface conversationally rather than through forms and dashboards.

"Imagine that you maybe don't log into Salesforce, you don't see Salesforce, but it's there. It's coming to you in Slack, because that's where you're getting your work done," Harris explained.

The strategy includes embedding Salesforce's Agentforce agents for sales, IT service, HR service, and analytics directly into Slack, alongside a completely rebuilt Slackbot that acts as a personal AI companion. The company is also launching "Channel Expert," an always-on agent that provides instant answers from channel conversations.

To enable third-party AI tools to access Slack's conversational data, Salesforce is releasing a Real-Time Search API and Model Context Protocol server. Partners including OpenAI, Anthropic, Google, Perplexity, Writer, Dropbox, Notion, and Cursor are building agents that will live natively in Slack.

"The best way to see the power of the platform is through the AI apps and agents already being built," Rob Seaman, a Salesforce executive, said during a technical briefing, citing examples of startups "achieving tens of thousands of customers that have it installed in 120 days or less."

Voice and IT service take aim at new markets

Beyond Slack integration, Salesforce announced major expansions into voice-based interactions and employee service. Agentforce Voice, now generally available, transforms traditional IVR systems into natural conversations that can update CRM records, trigger workflows, and seamlessly hand off to human agents.

The IT Service offering represents Salesforce's most direct challenge to ServiceNow, the market leader. Mudhu Sudhakar, who joined Salesforce two months ago as senior vice president for IT and HR Service, positioned the product as a fundamental reimagining of employee support.

"Legacy IT service management is very portals, forms, tickets focused, manual process," Sudhakar said. "What we had a few key tenets: conversation first and agent first, really focused on having a conversational experience for the people requesting the support and for the people providing the support."

The IT Service platform includes what Salesforce describes as 25+ specialized agents and 100+ pre-built workflows and connectors that can handle everything from password resets to complex incident management.

Early customers report dramatic efficiency gains

Customer results suggest the approach is gaining traction. Reddit reduced average support resolution time from 8.9 minutes to 1.4 minutes — an 84% improvement — while deflecting 46% of cases entirely to AI agents. "This efficiency has allowed us to provide on-demand help for complex tasks and boost advertiser satisfaction scores by 20%," said John Thompson, Reddit's VP of sales strategy and operations, in a statement.

Engine, a travel management company, reduced average handle time by 15%, saving over $2 million annually. OpenTable resolved 70% of restaurant and diner inquiries autonomously. And 1-800Accountant achieved a 90% case deflection rate during the critical tax week period.

Salesforce's own internal deployments may be most telling. Tallapragada's customer success organization now handles 1.8 million AI-powered conversations weekly, with metrics published at help.salesforce.com showing how many agents answer versus escalating to humans.

Even more significantly, Salesforce has deployed AI-powered sales development representatives to follow up on leads that would previously have gone uncontacted due to cost constraints. "Now, Agentforce has an SDR which is doing thousands of leads following up," Tallapragada explained. The company also increased proactive customer outreach by 40% by shifting staff from reactive support.

The trust layer problem enterprises can't ignore

Given enterprise concerns about AI reliability, Salesforce has invested heavily in what it calls the "trust layer" — audit trails, compliance checks, and observability tools that let organizations monitor agent behavior at scale.

"You should think of an agent as a human. Digital labor. You need to manage performance just like a human. And you need these audit trails," Tallapragada explained.

The company encountered this challenge firsthand when its own agent deployment scaled. "When we started at Agentforce at Salesforce, we would track every message, which is great until 1,000, 3,000," Tallapragada said. "Once you have a million chats, there's no human, we cannot do it."

The platform now includes "Agentforce Grid" for searching across millions of conversations to identify and fix problematic patterns. The company also introduced Agent Script, a new scripting language that allows developers to define precise guardrails and deterministic controls for agent behavior.

Data infrastructure gets a major upgrade

Underlying the agent capabilities is significant infrastructure investment. Salesforce's Data 360 includes "Intelligent Context," which automatically extracts structured information from unstructured content like PDFs, diagrams, and flowcharts using what the company describes as "AI-powered unstructured data pipelines."

The company is also collaborating with Databricks, dbt Labs, and Snowflake on the "Universal Semantic Interchange," an attempt to standardize how different platforms define business metrics. The pending $8 billion acquisition of Informatica, expected to close soon, will expand metadata management capabilities across the enterprise.

The competitive landscape keeps intensifying

Salesforce's aggressive AI agent push comes as virtually every major enterprise software vendor pursues similar strategies. Microsoft has embedded Copilot across its product line, Google offers agent capabilities through Vertex AI and Gemini, and ServiceNow has launched its own agentic offerings.

When asked how Salesforce's announcement compared to OpenAI's recent releases, Tallapragada emphasized that customers will use multiple AI tools simultaneously. "Most of the time I'm seeing they're using OpenAI, they're using Gemini, they're using Anthropic, just like Salesforce, we use all three," he said.

The real differentiation, executives argued, lies not in the AI models but in the integration with business processes and data. Harris framed the competition in terms familiar from Salesforce's founding: "26 years ago, we just said, let's make Salesforce automation as easy as buying a book on Amazon.com. We're doing that same thing. We want to make agentic AI as easy as buying a book on Amazon."

The company's customer success stories are impressive but remain a small fraction of its customer base. With 150,000 Salesforce customers and one million Slack customers, the 12,000 Agentforce deployments represent roughly 8% penetration — strong for a one-year-old product line, but hardly ubiquitous.

The company's stock, down roughly 28% year to date with a Relative Strength rating of just 15, suggests investors remain skeptical. This week's Dreamforce demonstrations — and the months of customer deployments that follow — will begin to provide answers to whether Salesforce can finally move enterprise AI from pilots to production at scale, or whether the "$7 billion business" remains more aspiration than reality.

‘AI is tearing companies apart’: Writer AI CEO slams Fortune 500 leaders for mismanaging tech

May Habib, co-founder and CEO of Writer AI, delivered one of the bluntest assessments of corporate AI failures at the TED AI conference on Tuesday, revealing that nearly half of Fortune 500 executives believe artificial intelligence is actively damaging their organizations — and placing the blame squarely on leadership's shoulders.

The problem, according to Habib, isn't the technology. It's that business leaders are making a category error, treating AI transformation like previous technology rollouts and delegating it to IT departments. This approach, she warned, has led to "billions of dollars spent on AI initiatives that are going nowhere."

"Earlier this year, we did a survey of 800 Fortune 500 C-suite executives," Habib told the audience of Silicon Valley executives and investors. "42% of them said AI is tearing their company apart."

The diagnosis challenges conventional wisdom about how enterprises should approach AI adoption. While most major companies have stood up AI task forces, appointed chief AI officers, or expanded IT budgets, Habib argues these moves reflect a fundamental misunderstanding of what AI represents: not another software tool, but a wholesale reorganization of how work gets done.

"There is something leaders are missing when they compare AI to just another tech tool," Habib said. "This is not like giving accountants calculators or bankers Excel or designers Photoshop."

Why the 'old playbook' of delegating to IT departments is failing companies

Habib, whose company has spent five years building AI systems for Fortune 500 companies and logged two million miles visiting customer sites, said the pattern is consistent: "When generative AI started showing up, we turned to the old playbook. We turned to IT and said, 'Go figure this out.'"

That approach fails, she argued, because AI fundamentally changes the economics and organization of work itself. "For 100 years, enterprises have been built around the idea that execution is expensive and hard," Habib said. "The enterprise built complex org charts, complex processes, all to manage people doing stuff."

AI inverts that model. "Execution is going from scarce and expensive to programmatic, on-demand and abundant," she said. In this new paradigm, the bottleneck shifts from execution capacity to strategic design — a shift that requires business leaders, not IT departments, to drive transformation.

"With AI technology, it can no longer be centralized. It's in every workflow, every business," Habib said. "It is now the most important part of a business leader's job. It cannot be delegated."

The statement represents a direct challenge to how most large organizations have structured their AI initiatives, with centralized centers of excellence, dedicated AI teams, or IT-led implementations that business units are expected to adopt.

A generational power shift is happening based on who understands AI workflow design

Habib framed the shift in dramatic terms: "A generational transfer of power is happening right now. It's not about your age or how long you've been at a company. The generational transfer of power is about the nature of leadership itself."

Traditional leadership, she argued, has been defined by the ability to manage complexity — big teams, big budgets, intricate processes. "The identity of leaders at these companies, people like us, has been tied to old school power structures: control, hierarchy, how big our teams are, how big our budgets are. Our value is measured by the sheer amount of complexity we could manage," Habib said. "Today we reward leaders for this. We promote leaders for this."

AI makes that model obsolete. "When I am able to 10x the output of my team or do things that could never be possible, work is no longer about the 1x," she said. "Leadership is no longer about managing complex human execution."

Instead, Habib outlined three fundamental shifts that define what she calls "AI-first leaders" — executives her company has worked with who have successfully deployed AI agents solving "$100 million plus problems."

The first shift: Taking a machete to enterprise complexity

The new leadership mandate, according to Habib, is "taking a machete to the complexity that has calcified so many organizations." She pointed to the layers of friction that have accumulated in enterprises: "Brilliant ideas dying in memos, the endless cycles of approvals, the death by 1,000 clicks, meetings about meetings — a death, by the way, that's happening in 17 different browser tabs each for software that promises to be a single source of truth."

Rather than accepting this complexity as inevitable, AI-first leaders redesign workflows from first principles. "There are very few legacy systems that can't be replaced in your organization, that won't be replaced," Habib said. "But they're not going to be replaced by another monolithic piece of software. They can only be replaced by a business leader articulating business logic and getting that into an agentic system."

She offered a concrete example: "We have customers where it used to take them seven months to get a creative campaign — not even a product, a campaign. Now they can go from TikTok trend to digital shelf in 30 days. That is radical simplicity."

The catch, she emphasized, is that CIOs can't drive this transformation alone. "Your CIO can't help flatten your org chart. Only a business leader can look at workflows and say, 'This part is necessary genius, this part is bureaucratic scar tissue that has to go.'"

The second shift: Managing the fear as career ladders disappear

When AI handles execution, "your humans are liberated to do what they're amazing at: judgment, strategy, creativity," Habib explained. "The old leadership playbook was about managing headcount. We managed people against revenue: one business development rep for every three account executives, one marketer for every five salespeople."

But this liberation carries profound challenges that leaders must address directly. Habib acknowledged the elephant in the room that many executives avoid discussing: "These changes are still frightening for people, even when it's become unholy to talk about it." She's witnessed the fear firsthand. "It shows up as tears in an AI workshop when someone feels like their old skill set isn't translated to the new."

She introduced a term for a common form of resistance: "productivity anchoring" — when employees "cling to the hard way of doing things because they feel productive, because their self-worth is tied to them, even when empirically AI can be better."

The solution isn't to look away. "We have to design new pathways to impact, to show your people their value is not in executing a task. Their value is in orchestrating systems of execution, to ask the next great question," Habib said. She advocates replacing career "ladders" with "lattices" where "people need to grow laterally, to expand sideways."

She was candid about the disruption: "The first rungs on our career ladders are indeed going away. I know because my company is automating them." But she insisted this creates opportunity for work that is "more creative, more strategic, more driven by curiosity and impact — and I believe a lot more human than the jobs that they're replacing."

The third shift: When execution becomes free, ambition becomes the only bottleneck

The final shift is from optimization to creation. "Before AI, we used to call it transformation when we took 12 steps and made them nine," Habib said. "That's optimizing the world as it is. We can now create a new world. That is the greenfield mindset."

She challenged executives to identify assumptions their industries are built on that AI now disrupts. Writer's customers, she said, are already seeing new categories of growth: treating every customer like their only customer, democratizing premium services to broader markets, and entering new markets at unprecedented speed because "AI strips away the friction to access new channels."

"When execution is abundant, the only bottleneck is the scope of your own ambition," Habib declared.

What this means for CIOs: Building the stadium while business leaders design the plays

Habib didn't leave IT leaders without a role — she redefined it. "If tech is everyone's job, you might be asking, what is mine?" she addressed CIOs. "Yours is to provide the mission critical infrastructure that makes this revolution possible."

As tens or hundreds of thousands of AI agents operate at various levels of autonomy within organizations, "governance becomes existential," she explained. "The business leader's job is to design the play, but you have to build the stadium, you have to write the rule book, and you have to make sure these plays can win at championship scale."

The formulation suggests a partnership model: business leaders drive workflow redesign and strategic implementation while IT provides the infrastructure, governance frameworks, and security guardrails that make mass AI deployment safe and scalable. "One can't succeed without the other," Habib said.

For CIOs and technical leaders, this represents a fundamental shift from gatekeeper to enabler. When business units deploy agents autonomously, IT faces governance challenges unlike anything in enterprise software history. Success requires genuine partnership between business and IT — neither can succeed alone, forcing cultural changes in how these functions collaborate.

A real example: From multi-day scrambles to instant answers during a market crisis

To ground her arguments in concrete business impact, Habib described working with the chief client officer of a Fortune 500 wealth advisory firm during recent market volatility following tariff announcements.

"Their phone was ringing off the hook with customers trying to figure out their market exposure," she recounted. "Every request kicked off a multi-day, multi-person scramble: a portfolio manager ran the show, an analyst pulled charts, a relationship manager built the PowerPoint, a compliance officer had to review everything for disclosures. And the leader in all this — she was forwarding emails and chasing updates. This is the top job: managing complexity."

With an agentic AI system, the same work happens programmatically. "A system of agents is able to assemble the answer faster than any number of people could have. No more midnight deck reviews. No more days on end" of coordination, Habib said.

This isn't about marginal productivity gains — it's about fundamentally different operating models where senior executives shift from managing coordination to designing intelligent systems.

Why so many AI initiatives are failing despite massive investment

Habib's arguments arrive as many enterprises face AI disillusionment. After initial excitement about generative AI, many companies have struggled to move beyond pilots and demonstrations to production deployments generating tangible business value.

Her diagnosis — that leaders are delegating rather than driving transformation — aligns with growing evidence that organizational factors, not technical limitations, explain most failures. Companies often lack clarity on use cases, struggle with data preparation, or face internal resistance to workflow changes that AI requires.

Perhaps the most striking aspect of Habib's presentation was her willingness to acknowledge the human cost of AI transformation — and insist leaders address it rather than avoid it. "Your job as a leader is to not look away from this fear. Your job is to face it with a plan," she told the audience.

She described "productivity anchoring" as a form of "self-sabotage" where employees resist AI adoption because their identity and self-worth are tied to execution tasks AI can now perform. The phenomenon suggests that successful AI transformation requires not just technical and strategic changes but psychological and cultural work that many leaders may be unprepared for.

Two challenges: Get your hands dirty, then reimagine everything

Habib closed by throwing down two gauntlets to her executive audience.

"First, a small one: get your hands dirty with agentic AI. Don't delegate. Choose a process that you oversee and automate it. See the difference from managing a complex process to redesigning it for yourself."

The second was more ambitious: "Go back to your team and ask, what could we achieve if execution were free? What would work feel like, be like, look like if you're unbound from the friction and process that slows us down today?"

She concluded: "The tools for creation are in your hands. The mandate for leadership is on your shoulders. What will you build?"

For enterprise leaders accustomed to viewing AI as an IT initiative, Habib's message is clear: that approach isn't working, won't work, and reflects a fundamental misunderstanding of what AI represents. Whether executives embrace her call to personally drive transformation — or continue delegating to IT departments — may determine which organizations thrive and which become cautionary tales.

The statistic she opened with lingers uncomfortably: 42% of Fortune 500 C-suite executives say AI is tearing their companies apart. Habib's diagnosis suggests they're tearing themselves apart by clinging to organizational models designed for an era when execution was scarce. The cure she prescribes requires leaders to do something most find uncomfortable: stop managing complexity and start dismantling it.

Sakana AI's CTO says he's 'absolutely sick' of transformers, the tech that powers every major AI model

In a striking act of self-critique, one of the architects of the transformer technology that powers ChatGPT, Claude, and virtually every major AI system told an audience of industry leaders this week that artificial intelligence research has become dangerously narrow — and that he's moving on from his own creation.

Llion Jones, who co-authored the seminal 2017 paper "Attention Is All You Need" and even coined the name "transformer," delivered an unusually candid assessment at the TED AI conference in San Francisco on Tuesday: Despite unprecedented investment and talent flooding into AI, the field has calcified around a single architectural approach, potentially blinding researchers to the next major breakthrough.

"Despite the fact that there's never been so much interest and resources and money and talent, this has somehow caused the narrowing of the research that we're doing," Jones told the audience. The culprit, he argued, is the "immense amount of pressure" from investors demanding returns and researchers scrambling to stand out in an overcrowded field.

The warning carries particular weight given Jones's role in AI history. The transformer architecture he helped develop at Google has become the foundation of the generative AI boom, enabling systems that can write essays, generate images, and engage in human-like conversation. His paper has been cited more than 100,000 times, making it one of the most influential computer science publications of the century.

Now, as CTO and co-founder of Tokyo-based Sakana AI, Jones is explicitly abandoning his own creation. "I personally made a decision in the beginning of this year that I'm going to drastically reduce the amount of time that I spend on transformers," he said. "I'm explicitly now exploring and looking for the next big thing."

Why more AI funding has led to less creative research, according to a transformer pioneer

Jones painted a picture of an AI research community suffering from what he called a paradox: More resources have led to less creativity. He described researchers constantly checking whether they've been "scooped" by competitors working on identical ideas, and academics choosing safe, publishable projects over risky, potentially transformative ones.

"If you're doing standard AI research right now, you kind of have to assume that there's maybe three or four other groups doing something very similar, or maybe exactly the same," Jones said, describing an environment where "unfortunately, this pressure damages the science, because people are rushing their papers, and it's reducing the amount of creativity."

He drew an analogy from AI itself — the "exploration versus exploitation" trade-off that governs how algorithms search for solutions. When a system exploits too much and explores too little, it finds mediocre local solutions while missing superior alternatives. "We are almost certainly in that situation right now in the AI industry," Jones argued.

The implications are sobering. Jones recalled the period just before transformers emerged, when researchers were endlessly tweaking recurrent neural networks — the previous dominant architecture — for incremental gains. Once transformers arrived, all that work suddenly seemed irrelevant. "How much time do you think those researchers would have spent trying to improve the recurrent neural network if they knew something like transformers was around the corner?" he asked.

He worries the field is repeating that pattern. "I'm worried that we're in that situation right now where we're just concentrating on one architecture and just permuting it and trying different things, where there might be a breakthrough just around the corner."

How the 'Attention is all you need' paper was born from freedom, not pressure

To underscore his point, Jones described the conditions that allowed transformers to emerge in the first place — a stark contrast to today's environment. The project, he said, was "very organic, bottom up," born from "talking over lunch or scrawling randomly on the whiteboard in the office."

Critically, "we didn't actually have a good idea, we had the freedom to actually spend time and go and work on it, and even more importantly, we didn't have any pressure that was coming down from management," Jones recounted. "No pressure to work on any particular project, publish a number of papers to push a certain metric up."

That freedom, Jones suggested, is largely absent today. Even researchers recruited for astronomical salaries — "literally a million dollars a year, in some cases" — may not feel empowered to take risks. "Do you think that when they start their new position they feel empowered to try their wild ideas and more speculative ideas, or do they feel immense pressure to prove their worth and once again, go for the low hanging fruit?" he asked.

Why one AI lab is betting that research freedom beats million-dollar salaries

Jones's proposed solution is deliberately provocative: Turn up the "explore dial" and openly share findings, even at competitive cost. He acknowledged the irony of his position. "It may sound a little controversial to hear one of the Transformers authors stand on stage and tell you that he's absolutely sick of them, but it's kind of fair enough, right? I've been working on them longer than anyone, with the possible exception of seven people."

At Sakana AI, Jones said he's attempting to recreate that pre-transformer environment, with nature-inspired research and minimal pressure to chase publications or compete directly with rivals. He offered researchers a mantra from engineer Brian Cheung: "You should only do the research that wouldn't happen if you weren't doing it."

One example is Sakana's "continuous thought machine," which incorporates brain-like synchronization into neural networks. An employee who pitched the idea told Jones he would have faced skepticism and pressure not to waste time at previous employers or academic positions. At Sakana, Jones gave him a week to explore. The project became successful enough to be spotlighted at NeurIPS, a major AI conference.

Jones even suggested that freedom beats compensation in recruiting. "It's a really, really good way of getting talent," he said of the exploratory environment. "Think about it, talented, intelligent people, ambitious people, will naturally seek out this kind of environment."

The transformer's success may be blocking AI's next breakthrough

Perhaps most provocatively, Jones suggested transformers may be victims of their own success. "The fact that the current technology is so powerful and flexible... stopped us from looking for better," he said. "It makes sense that if the current technology was worse, more people would be looking for better."

He was careful to clarify that he's not dismissing ongoing transformer research. "There's still plenty of very important work to be done on current technology and bringing a lot of value in the coming years," he said. "I'm just saying that given the amount of talent and resources that we have currently, we can afford to do a lot more."

His ultimate message was one of collaboration over competition. "Genuinely, from my perspective, this is not a competition," Jones concluded. "We all have the same goal. We all want to see this technology progress so that we can all benefit from it. So if we can all collectively turn up the explore dial and then openly share what we find, we can get to our goal much faster."

The high stakes of AI's exploration problem

The remarks arrive at a pivotal moment for artificial intelligence. The industry grapples with mounting evidence that simply building larger transformer models may be approaching diminishing returns. Leading researchers have begun openly discussing whether the current paradigm has fundamental limitations, with some suggesting that architectural innovations — not just scale — will be needed for continued progress toward more capable AI systems.

Jones's warning suggests that finding those innovations may require dismantling the very incentive structures that have driven AI's recent boom. With tens of billions of dollars flowing into AI development annually and fierce competition among labs driving secrecy and rapid publication cycles, the exploratory research environment he described seems increasingly distant.

Yet his insider perspective carries unusual weight. As someone who helped create the technology now dominating the field, Jones understands both what it takes to achieve breakthrough innovation and what the industry risks by abandoning that approach. His decision to walk away from transformers — the architecture that made his reputation — adds credibility to a message that might otherwise sound like contrarian positioning.

Whether AI's power players will heed the call remains uncertain. But Jones offered a pointed reminder of what's at stake: The next transformer-scale breakthrough could be just around the corner, pursued by researchers with the freedom to explore. Or it could be languishing unexplored while thousands of researchers race to publish incremental improvements on architecture that, in Jones's words, one of its creators is "absolutely sick of."

After all, he's been working on transformers longer than almost anyone. He would know when it's time to move on.

Kai-Fu Lee's brutal assessment: America is already losing the AI hardware war to China

China is on track to dominate consumer artificial intelligence applications and robotics manufacturing within years, but the United States will maintain its substantial lead in enterprise AI adoption and cutting-edge research, according to Kai-Fu Lee, one of the world's most prominent AI scientists and investors.

In a rare, unvarnished assessment delivered via video link from Beijing to the TED AI conference in San Francisco Tuesday, Lee — a former executive at Apple, Microsoft, and Google who now runs both a major venture capital firm and his own AI company — laid out a technology landscape splitting along geographic and economic lines, with profound implications for both commercial competition and national security.

"China's robotics has the advantage of having integrated AI into much lower costs, better supply chain and fast turnaround, so companies like Unitree are actually the farthest ahead in the world in terms of building affordable, embodied humanoid AI," Lee said, referring to a Chinese robotics manufacturer that has undercut Western competitors on price while advancing capabilities.

The comments, made to a room filled with Silicon Valley executives, investors, and researchers, represented one of the most detailed public assessments from Lee about the comparative strengths and weaknesses of the world's two AI superpowers — and suggested that the race for artificial intelligence leadership is becoming less a single contest than a series of parallel competitions with different winners.

Why venture capital is flowing in opposite directions in the U.S. and China

At the heart of Lee's analysis lies a fundamental difference in how capital flows in the two countries' innovation ecosystems. American venture capitalists, Lee said, are pouring money into generative AI companies building large language models and enterprise software, while Chinese investors are betting heavily on robotics and hardware.

"The VCs in the US don't fund robotics the way the VCs do in China," Lee said. "Just like the VCs in China don't fund generative AI the way the VCs do in the US."

This investment divergence reflects different economic incentives and market structures. In the United States, where companies have grown accustomed to paying for software subscriptions and where labor costs are high, enterprise AI tools that boost white-collar productivity command premium prices. In China, where software subscription models have historically struggled to gain traction but manufacturing dominates the economy, robotics offers a clearer path to commercialization.

The result, Lee suggested, is that each country is pulling ahead in different domains — and may continue to do so.

"China's got some challenges to overcome in getting a company funded as well as OpenAI or Anthropic," Lee acknowledged, referring to the leading American AI labs. "But I think U.S., on the flip side, will have trouble developing the investment interest and value creation in the robotics" sector.

Why American companies dominate enterprise AI while Chinese firms struggle with subscriptions

Lee was explicit about one area where the United States maintains what appears to be a durable advantage: getting businesses to actually adopt and pay for AI software.

"The enterprise adoption will clearly be led by the United States," Lee said. "The Chinese companies have not yet developed a habit of paying for software on a subscription."

This seemingly mundane difference in business culture — whether companies will pay monthly fees for software — has become a critical factor in the AI race. The explosion of spending on tools like GitHub Copilot, ChatGPT Enterprise, and other AI-powered productivity software has fueled American companies' ability to invest billions in further research and development.

Lee noted that China has historically overcome similar challenges in consumer technology by developing alternative business models. "In the early days of internet software, China was also well behind because people weren't willing to pay for software," he said. "But then advertising models, e-commerce models really propelled China forward."

Still, he suggested, someone will need to "find a new business model that isn't just pay per software per use or per month basis. That's going to not happen in China anytime soon."

The implication: American companies building enterprise AI tools have a window — perhaps a substantial one — where they can generate revenue and reinvest in R&D without facing serious Chinese competition in their core market.

How ByteDance, Alibaba and Tencent will outpace Meta and Google in consumer AI

Where Lee sees China pulling ahead decisively is in consumer-facing AI applications — the kind embedded in social media, e-commerce, and entertainment platforms that billions of people use daily.

"In terms of consumer usage, that's likely to happen," Lee said, referring to China matching or surpassing the United States in AI deployment. "The Chinese giants, like ByteDance and Alibaba and Tencent, will definitely move a lot faster than their equivalent in the United States, companies like Meta, YouTube and so on."

Lee pointed to a cultural advantage: Chinese technology companies have spent the past decade obsessively optimizing for user engagement and product-market fit in brutally competitive markets. "The Chinese giants really work tenaciously, and they have mastered the art of figuring out product market fit," he said. "Now they have to add technology to it. So that is inevitably going to happen."

This assessment aligns with recent industry observations. ByteDance's TikTok became the world's most downloaded app through sophisticated AI-driven content recommendation, and Chinese companies have pioneered AI-powered features in areas like live-streaming commerce and short-form video that Western companies later copied.

Lee also noted that China has already deployed AI more widely in certain domains. "There are a lot of areas where China has also done a great job, such as using computer vision, speech recognition, and translation more widely," he said.

The surprising open-source shift that has Chinese models beating Meta's Llama

Perhaps Lee's most striking data point concerned open-source AI development — an area where China appears to have seized leadership from American companies in a remarkably short time.

"The 10 highest rated open source [models] are from China," Lee said. "These companies have now eclipsed Meta's Llama, which used to be number one."

This represents a significant shift. Meta's Llama models were widely viewed as the gold standard for open-source large language models as recently as early 2024. But Chinese companies — including Lee's own firm, 01.AI, along with Alibaba, Baidu, and others — have released a flood of open-source models that, according to various benchmarks, now outperform their American counterparts.

The open-source question has become a flashpoint in AI development. Lee made an extensive case for why open-source models will prove essential to the technology's future, even as closed models from companies like OpenAI command higher prices and, often, superior performance.

"I think open source has a number of major advantages," Lee argued. With open-source models, "you can examine it, tune it, improve it. It's yours, and it's free, and it's important for building if you want to build an application or tune the model to do something specific."

He drew an analogy to operating systems: "People who work in operating systems loved Linux, and that's why its adoption went through the roof. And I think in the future, open source will also allow people to tune a sovereign model for a country, make it work better for a particular language."

Still, Lee predicted both approaches will coexist. "I don't think open source models will win," he said. "I think just like we have Apple, which is closed, but provides a somewhat better experience than Android... I think we're going to see more apps using open-source models, more engineers wanting to build open-source models, but I think more money will remain in the closed model."

Why China's manufacturing advantage makes the robotics race 'not over, but' nearly decided

On robotics, Lee's message was blunt: the combination of China's manufacturing prowess, lower costs, and aggressive investment has created an advantage that will be difficult for American companies to overcome.

When asked directly whether the robotics race was already over with China victorious, Lee hedged only slightly. "It's not over, but I think the U.S. is still capable of coming up with the best robotic research ideas," he said. "But the VCs in the U.S. don't fund robotics the way the VCs do in China."

The challenge is structural. Building robots requires not just software and AI, but hardware manufacturing at scale — precisely the kind of integrated supply chain and low-cost production that China has spent decades perfecting. While American labs at universities and companies like Boston Dynamics continue to produce impressive research prototypes, turning those prototypes into affordable commercial products requires the manufacturing ecosystem that China possesses.

Companies like Unitree have demonstrated this advantage concretely. The company's humanoid robots and quadrupedal robots cost a fraction of their American-made equivalents while offering comparable or superior capabilities — a price-to-performance ratio that could prove decisive in commercial markets.

What worries Lee most: not AGI, but the race itself

Despite his generally measured tone about China's AI development, Lee expressed concern about one area where he believes the global AI community faces real danger — not the far-future risk of superintelligent AI, but the near-term consequences of moving too fast.

When asked about AGI risks, Lee reframed the question. "I'm less afraid of AI becoming self-aware and causing danger for humans in the short term," he said, "but more worried about it being used by bad people to do terrible things, or by the AI race pushing people to work so hard, so fast and furious and move fast and break things that they build products that have problems and holes to be exploited."

He continued: "I'm very worried about that. In fact, I think some terrible event will happen that will be a wake up call from this sort of problem."

Lee's perspective carries unusual weight because of his unique vantage point spanning both Chinese and American AI development. Over a career spanning more than three decades, he has held senior positions at Apple, Microsoft, and Google, while also founding Sinovation Ventures, which has invested in more than 400 companies across both countries. His AI company, 01.AI, founded in 2023, has released several open-source models that rank among the most capable in the world.

For American companies and policymakers, Lee's analysis presents a complex strategic picture. The United States appears to have clear advantages in enterprise AI software, fundamental research, and computing infrastructure. But China is moving faster in consumer applications, manufacturing robotics at lower costs, and potentially pulling ahead in open-source model development.

The bifurcation suggests that rather than a single "winner" in AI, the world may be heading toward a technology landscape where different countries excel in different domains — with all the economic and geopolitical complications that implies.

As the TED AI conference continued Wednesday, Lee's assessment hung over subsequent discussions. His message seemed clear: the AI race is not one contest, but many — and the United States and China are each winning different races.

Standing in the conference hall afterward, one venture capitalist, who asked not to be named, summed up the mood in the room: "We're not competing with China anymore. We're competing on parallel tracks." Whether those tracks eventually converge — or diverge into entirely separate technology ecosystems — may be the defining question of the next decade.

DeepSeek drops open-source model that compresses text 10x through images, defying conventions

DeepSeek, the Chinese artificial intelligence research company that has repeatedly challenged assumptions about AI development costs, has released a new model that fundamentally reimagines how large language models process information—and the implications extend far beyond its modest branding as an optical character recognition tool.

The company's DeepSeek-OCR model, released Monday with full open-source code and weights, achieves what researchers describe as a paradigm inversion: compressing text through visual representation up to 10 times more efficiently than traditional text tokens. The finding challenges a core assumption in AI development and could pave the way for language models with dramatically expanded context windows, potentially reaching tens of millions of tokens.

"We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping," the research team wrote in their technical paper. "Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10×), the model can achieve decoding (OCR) precision of 97%."

The implications have resonated across the AI research community. Andrej Karpathy, co-founder of OpenAI and former director of AI at Tesla, said in a post that the work raises fundamental questions about how AI systems should process information. "Maybe it makes more sense that all inputs to LLMs should only ever be images," Karpathy wrote. "Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in."

How DeepSeek achieved 10x compression by treating text as images

While DeepSeek marketed the release as an OCR model — a technology for converting images of text into digital characters — the research paper reveals more ambitious goals. The model demonstrates that visual representations can serve as a superior compression medium for textual information, inverting the conventional hierarchy where text tokens were considered more efficient than vision tokens.

"Traditionally, vision LLM tokens almost seemed like an afterthought or 'bolt on' to the LLM paradigm," wrote Jeffrey Emanuel, an AI researcher, in a detailed analysis of the paper. "And 10k words of English would take up far more space in a multimodal LLM when expressed as intelligible pixels than when expressed as tokens...But that gets inverted now from the ideas in this paper."

The model's architecture consists of two primary components: DeepEncoder, a novel 380-million-parameter vision encoder, and a 3-billion-parameter mixture-of-experts language decoder with 570 million activated parameters. DeepEncoder combines Meta's Segment Anything Model (SAM) for local visual perception with OpenAI's CLIP model for global visual understanding, connected through a 16x compression module.

To validate their compression claims, DeepSeek researchers tested the model on the Fox benchmark, a dataset of diverse document layouts. The results were striking: using just 100 vision tokens, the model achieved 97.3% accuracy on documents containing 700-800 text tokens — representing an effective compression ratio of 7.5x. Even at compression ratios approaching 20x, accuracy remained around 60%.

The practical impact: Processing 200,000 pages per day on a single GPU

The efficiency gains translate directly to production capabilities. According to the company, a single Nvidia A100-40G GPU can process more than 200,000 pages per day using DeepSeek-OCR. Scaling to a cluster of 20 servers with eight GPUs each, throughput reaches 33 million pages daily — sufficient to rapidly construct training datasets for other AI models.

On OmniDocBench, a comprehensive document parsing benchmark, DeepSeek-OCR outperformed GOT-OCR2.0 (which uses 256 tokens per page) while using only 100 vision tokens. More dramatically, it surpassed MinerU2.0 — which requires more than 6,000 tokens per page on average — while using fewer than 800 vision tokens.

DeepSeek designed the model to support five distinct resolution modes, each optimized for different compression ratios and use cases. The "Tiny" mode operates at 512×512 resolution with just 64 vision tokens, while "Gundam" mode combines multiple resolutions dynamically for complex documents. "Gundam mode consists of n×640×640 tiles (local views) and a 1024×1024 global view," the researchers wrote.

Why this breakthrough could unlock 10 million token context windows

The compression breakthrough has immediate implications for one of the most pressing challenges in AI development: expanding the context windows that determine how much information language models can actively consider. Current state-of-the-art models typically handle context windows measured in hundreds of thousands of tokens. DeepSeek's approach suggests a path to windows ten times larger.

"The potential of getting a frontier LLM with a 10 or 20 million token context window is pretty exciting," Emanuel wrote. "You could basically cram all of a company's key internal documents into a prompt preamble and cache this with OpenAI and then just add your specific query or prompt on top of that and not have to deal with search tools and still have it be fast and cost-effective."

The researchers explicitly frame their work in terms of context compression for language models. "Through DeepSeek-OCR, we demonstrate that vision-text compression can achieve significant token reduction (7-20×) for different historical context stages, offering a promising direction for addressing long-context challenges in large language models," they wrote.

The paper includes a speculative but intriguing diagram illustrating how the approach could implement memory decay mechanisms similar to human cognition. Older conversation rounds could be progressively downsampled to lower resolutions, consuming fewer tokens while maintaining key information — a form of computational forgetting that mirrors biological memory.

How visual processing could eliminate the 'ugly' tokenizer problem

Beyond compression, Karpathy highlighted how the approach challenges fundamental assumptions about how language models should process text. Traditional tokenizers—the systems that break text into units for processing—have long been criticized for their complexity and limitations.

"I already ranted about how much I dislike the tokenizer," Karpathy wrote. "Tokenizers are ugly, separate, not end-to-end stage. It 'imports' all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network."

Visual processing of text could eliminate these issues while enabling new capabilities. The approach naturally handles formatting information lost in pure text representations: bold text, colors, layout, embedded images. "Input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful," Karpathy noted.

The implications resonate with human cognitive science. Emanuel drew a parallel to Hans Bethe, the renowned physicist who memorized vast amounts of reference data: "Having vast amounts of task-specific knowledge in your working memory is extremely useful. This seems like a very clever and additive approach to potentially expanding that memory bank by 10x or more."

The model's training: 30 million PDF pages across 100 languages

The model's capabilities rest on an extensive training regimen using diverse data sources. DeepSeek collected 30 million PDF pages covering approximately 100 languages, with Chinese and English accounting for 25 million pages. The training data spans nine document types — academic papers, financial reports, textbooks, newspapers, handwritten notes, and others.

Beyond document OCR, the training incorporated what the researchers call "OCR 2.0" data: 10 million synthetic charts, 5 million chemical formulas, and 1 million geometric figures. The model also received 20% general vision data for tasks like image captioning and object detection, plus 10% text-only data to maintain language capabilities.

The training process employed pipeline parallelism across 160 Nvidia A100-40G GPUs (20 nodes with 8 GPUs each), with the vision encoder divided between two pipeline stages and the language model split across two others. "For multimodal data, the training speed is 70B tokens/day," the researchers reported.

Open source release accelerates research and raises competitive questions

True to DeepSeek's pattern of open development, the company released the complete model weights, training code, and inference scripts on GitHub and Hugging Face. The GitHub repository gained over 4,000 stars within 24 hours of release, according to Dataconomy.

The breakthrough raises questions about whether other AI labs have developed similar techniques but kept them proprietary. Emanuel speculated that Google's Gemini models, which feature large context windows and strong OCR performance, might employ comparable approaches. "For all we know, Google could have already figured out something like this, which could explain why Gemini has such a huge context size and is so good and fast at OCR tasks," Emanuel wrote.

Google's Gemini 2.5 Pro offers a 1-million-token context window, with plans to expand to 2 million, though the company has not publicly detailed the technical approaches enabling this capability. OpenAI's GPT-5 supports 400,000 tokens, while Anthropic's Claude 4.5 offers 200,000 tokens, with a 1-million-token window available in beta for eligible organizations.

The unanswered question: Can AI reason over compressed visual tokens?

While the compression results are impressive, researchers acknowledge important open questions. "It's not clear how exactly this interacts with the other downstream cognitive functioning of an LLM," Emanuel noted. "Can the model reason as intelligently over those compressed visual tokens as it can using regular text tokens? Does it make the model less articulate by forcing it into a more vision-oriented modality?"

The DeepSeek paper focuses primarily on the compression-decompression capability, measured through OCR accuracy, rather than downstream reasoning performance. This leaves open whether language models could reason effectively over large contexts represented primarily as compressed visual tokens.

The researchers acknowledge their work represents "an initial exploration into the boundaries of vision-text compression." They note that "OCR alone is insufficient to fully validate true context optical compression" and plan future work including "digital-optical text interleaved pretraining, needle-in-a-haystack testing, and other evaluations."

DeepSeek has established a pattern of achieving competitive results with dramatically lower computational resources than Western AI labs. The company's earlier DeepSeek-V3 model reportedly cost just $5.6 million to train—though this figure represents only the final training run and excludes R&D and infrastructure costs—compared to hundreds of millions for comparable models from OpenAI and Anthropic.

Industry analysts have questioned the $5.6 million figure, with some estimates placing the company's total infrastructure and operational costs closer to $1.3 billion, though still lower than American competitors' spending.

The bigger picture: Should language models process text as images?

DeepSeek-OCR poses a fundamental question for AI development: should language models process text as text, or as images of text? The research demonstrates that, at least for compression purposes, visual representation offers significant advantages. Whether this translates to effective reasoning over vast contexts remains to be determined.

"From another perspective, optical contexts compression still offers substantial room for research and improvement, representing a promising new direction," the researchers concluded in their paper.

For the AI industry, the work adds another dimension to the race for longer context windows — a competition that has intensified as language models are applied to increasingly complex tasks requiring vast amounts of information. The open-source release ensures the technique will be widely explored, tested, and potentially integrated into future AI systems.

As Karpathy framed the deeper implication: "OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa." In other words, the path forward for AI might not run through better tokenizers — it might bypass text tokens altogether.

How Anthropic’s ‘Skills’ make Claude faster, cheaper, and more consistent for business workflows

Anthropic launched a new capability on Thursday that allows its Claude AI assistant to tap into specialized expertise on demand, marking the company's latest effort to make artificial intelligence more practical for enterprise workflows as it chases rival OpenAI in the intensifying competition over AI-powered software development.

The feature, called Skills, enables users to create folders containing instructions, code scripts, and reference materials that Claude can automatically load when relevant to a task. The system marks a fundamental shift in how organizations can customize AI assistants, moving beyond one-off prompts to reusable packages of domain expertise that work consistently across an entire company.

"Skills are based on our belief and vision that as model intelligence continues to improve, we'll continue moving towards general-purpose agents that often have access to their own filesystem and computing environment," said Mahesh Murag, a member of Anthropic's technical staff, in an exclusive interview with VentureBeat. "The agent is initially made aware only of the names and descriptions of each available skill and can choose to load more information about a particular skill when relevant to the task at hand."

The launch comes as Anthropic, valued at $183 billion after a recent $13 billion funding round, projects its annual revenue could nearly triple to as much as $26 billion in 2026, according to a recent Reuters report. The company is currently approaching a $7 billion annual revenue run rate, up from $5 billion in August, fueled largely by enterprise adoption of its AI coding tools — a market where it faces fierce competition from OpenAI's recently upgraded Codex platform.

How 'progressive disclosure' solves the context window problem

Skills differ fundamentally from existing approaches to customizing AI assistants, such as prompt engineering or retrieval-augmented generation (RAG), Murag explained. The architecture relies on what Anthropic calls "progressive disclosure" — Claude initially sees only skill names and brief descriptions, then autonomously decides which skills to load based on the task at hand, accessing only the specific files and information needed at that moment.

"Unlike RAG, this relies on simple tools that let Claude manage and read files from a filesystem," Murag told VentureBeat. "Skills can contain an unbounded amount of context to teach Claude how to complete a task or series of tasks. This is because Skills are based on the premise of an agent being able to autonomously and intelligently navigate a filesystem and execute code."

This approach allows organizations to bundle far more information than traditional context windows permit, while maintaining the speed and efficiency that enterprise users demand. A single skill can include step-by-step procedures, code templates, reference documents, brand guidelines, compliance checklists, and executable scripts — all organized in a folder structure that Claude navigates intelligently.

The system's composability provides another technical advantage. Multiple skills automatically stack together when needed for complex workflows. For instance, Claude might simultaneously invoke a company's brand guidelines skill, a financial reporting skill, and a presentation formatting skill to generate a quarterly investor deck — coordinating between all three without manual intervention.

What makes Skills different from OpenAI's Custom GPTs and Microsoft's Copilot

Anthropic is positioning Skills as distinct from competing offerings like OpenAI's Custom GPTs and Microsoft's Copilot Studio, though the features address similar enterprise needs around AI customization and consistency.

"Skills' combination of progressive disclosure, composability, and executable code bundling is unique in the market," Murag said. "While other platforms require developers to build custom scaffolding, Skills let anyone — technical or not — create specialized agents by organizing procedural knowledge into files."

The cross-platform portability also sets Skills apart. The same skill works identically across Claude.ai, Claude Code (Anthropic's AI coding environment), the company's API, and the Claude Agent SDK for building custom AI agents. Organizations can develop a skill once and deploy it everywhere their teams use Claude, a significant advantage for enterprises seeking consistency.

The feature supports any programming language compatible with the underlying container environment, and Anthropic provides sandboxing for security — though the company acknowledges that allowing AI to execute code requires users to carefully vet which skills they trust.

Early customers report 8x productivity gains on finance workflows

Early customer implementations reveal how organizations are applying Skills to automate complex knowledge work. At Japanese e-commerce giant Rakuten, the AI team is using Skills to transform finance operations that previously required manual coordination across multiple departments.

"Skills streamline our management accounting and finance workflows," said Yusuke Kaji, general manager of AI at Rakuten in a statement. "Claude processes multiple spreadsheets, catches critical anomalies, and generates reports using our procedures. What once took a day, we can now accomplish in an hour."

That's an 8x improvement in productivity for specific workflows — the kind of measurable return on investment that enterprises increasingly demand from AI implementations. Mike Krieger, Anthropic's chief product officer and Instagram co-founder, recently noted that companies have moved past "AI FOMO" to requiring concrete success metrics.

Design platform Canva plans to integrate Skills into its own AI agent workflows. "Canva plans to leverage Skills to customize agents and expand what they can do," said Anwar Haneef, general manager and head of ecosystem at Canva in a statement. "This unlocks new ways to bring Canva deeper into agentic workflows—helping teams capture their unique context and create stunning, high-quality designs effortlessly."

Cloud storage provider Box sees Skills as a way to make corporate content repositories more actionable. "Skills teaches Claude how to work with Box content," said Yashodha Bhavnani, head of AI at Box. "Users can transform stored files into PowerPoint presentations, Excel spreadsheets, and Word documents that follow their organization's standards—saving hours of effort."

The enterprise security question: Who controls which AI skills employees can use?

For enterprise IT departments, Skills raise important questions about governance and control—particularly since the feature allows AI to execute arbitrary code in sandboxed environments. Anthropic has built administrative controls that allow enterprise customers to manage access at the organizational level.

"Enterprise admins control access to the Skills capability via admin settings, where they can enable or disable access and monitor usage patterns," Murag said. "Once enabled at the organizational level, individual users still need to opt in."

That two-layer consent model — organizational enablement plus individual opt-in — reflects lessons learned from previous enterprise AI deployments where blanket rollouts created compliance concerns. However, Anthropic's governance tools appear more limited than some enterprise customers might expect. The company doesn't currently offer granular controls over which specific skills employees can use, or detailed audit trails of custom skill content.

Organizations concerned about data security should note that Skills require Claude's code execution environment, which runs in isolated containers. Anthropic advises users to "stick to trusted sources" when installing skills and provides security documentation, but the company acknowledges this is an inherently higher-risk capability than traditional AI interactions.

From API to no-code: How Anthropic is making Skills accessible to everyone

Anthropic is taking several approaches to make Skills accessible to users with varying technical sophistication. For non-technical users on Claude.ai, the company provides a "skill-creator" skill that interactively guides users through building new skills by asking questions about their workflow, then automatically generating the folder structure and documentation.

Developers working with Anthropic's API get programmatic control through a new /skills endpoint and can manage skill versions through the Claude Console web interface. The feature requires enabling the Code Execution Tool beta in API requests. For Claude Code users, skills can be installed via plugins from the anthropics/skills GitHub marketplace, and teams can share skills through version control systems.

"Skills are included in Max, Pro, Teams, and Enterprise plans at no additional cost," Murag confirmed. "API usage follows standard API pricing," meaning organizations pay only for the tokens consumed during skill execution, not for the skills themselves.

Anthropic provides several pre-built skills for common business tasks, including professional generation of Excel spreadsheets with formulas, PowerPoint presentations, Word documents, and fillable PDFs. These Anthropic-created skills will remain free.

Why the Skills launch matters in the AI coding wars with OpenAI

The Skills announcement arrives during a pivotal moment in Anthropic's competition with OpenAI, particularly around AI-assisted software development. Just one day before releasing Skills, Anthropic launched Claude Haiku 4.5, a smaller and cheaper model that nonetheless matches the coding performance of Claude Sonnet 4 — which was state-of-the-art when released just five months ago.

That rapid improvement curve reflects the breakneck pace of AI development, where today's frontier capabilities become tomorrow's commodity offerings. OpenAI has been pushing hard on coding tools as well, recently upgrading its Codex platform with GPT-5 and expanding GitHub Copilot's capabilities.

Anthropic's revenue trajectory — potentially reaching $26 billion in 2026 from an estimated $9 billion by year-end 2025 — suggests the company is successfully converting enterprise interest into paying customers. The timing also follows Salesforce's announcement this week that it's deepening AI partnerships with both OpenAI and Anthropic to power its Agentforce platform, signaling that enterprises are adopting a multi-vendor approach rather than standardizing on a single provider.

Skills addresses a real pain point: the "prompt engineering" problem where effective AI usage depends on individual employees crafting elaborate instructions for routine tasks, with no way to share that expertise across teams. Skills transforms implicit knowledge into explicit, shareable assets. For startups and developers, the feature could accelerate product development significantly — adding sophisticated document generation capabilities that previously required dedicated engineering teams and weeks of development.

The composability aspect hints at a future where organizations build libraries of specialized skills that can be mixed and matched for increasingly complex workflows. A pharmaceutical company might develop skills for regulatory compliance, clinical trial analysis, molecular modeling, and patient data privacy that work together seamlessly — creating a customized AI assistant with deep domain expertise across multiple specialties.

Anthropic indicates it's working on simplified skill creation workflows and enterprise-wide deployment capabilities to make it easier for organizations to distribute skills across large teams. As the feature rolls out to Anthropic's more than 300,000 business customers, the true test will be whether organizations find Skills substantively more useful than existing customization approaches.

For now, Skills offers Anthropic's clearest articulation yet of its vision for AI agents: not generalists that try to do everything reasonably well, but intelligent systems that know when to access specialized expertise and can coordinate multiple domains of knowledge to accomplish complex tasks. If that vision catches on, the question won't be whether your company uses AI — it will be whether your AI knows how your company actually works.

Amazon and Chobani adopt Strella's AI interviews for customer research as fast-growing startup raises $14M

One year after emerging from stealth, Strella has raised $14 million in Series A funding to expand its AI-powered customer research platform, the company announced Thursday. The round, led by Bessemer Venture Partners with participation from Decibel Partners, Bain Future Back Ventures, MVP Ventures and 645 Ventures, comes as enterprises increasingly turn to artificial intelligence to understand customers faster and more deeply than traditional methods allow.

The investment marks a sharp acceleration for the startup founded by Lydia Hylton and Priya Krishnan, two former consultants and product managers who watched companies struggle with a customer research process that could take eight weeks from start to finish. Since October, Strella has grown revenue tenfold, quadrupled its customer base to more than 40 paying enterprises, and tripled its average contract values by moving upmarket to serve Fortune 500 companies.

"Research tends to be bookended by two very strategic steps: first, we have a problem—what research should we do? And second, we've done the research—now what are we going to do with it?" said Hylton, Strella's CEO, in an exclusive interview with VentureBeat. "All the stuff in the middle tends to be execution and lower-skill work. We view Strella as doing that middle 90% of the work."

The platform now serves Amazon, Duolingo, Apollo GraphQL, and Chobani, collectively conducting thousands of AI-moderated interviews that deliver what the company claims is a 90% average time savings on manual research work. The company is approaching $1 million in revenue after beginning monetization only in January, with month-over-month growth of 50% and zero customer churn to date.

How AI-powered interviews compress eight-week research projects into days

Strella's technology addresses a workflow that has frustrated product teams, marketers, and designers for decades. Traditional customer research requires writing interview guides, recruiting participants, scheduling calls, conducting interviews, taking notes, synthesizing findings, and creating presentations — a process that consumes weeks of highly-skilled labor and often delays critical product decisions.

The platform compresses that timeline to days by using AI to moderate voice-based interviews that run like Zoom calls, but with an artificial intelligence agent asking questions, following up on interesting responses, and detecting when participants are being evasive or fraudulent. The system then synthesizes findings automatically, creating highlight reels and charts from unstructured qualitative data.

"It used to take eight weeks. Now you can do it in the span of a couple days," Hylton told VentureBeat. "The primary technology is through an AI-moderated interview. It's like being in a Zoom call with an AI instead of a human — it's completely free form and voice based."

Critically, the platform also supports human moderators joining the same calls, reflecting the founders' belief that humans won't disappear from the research process. "Human moderation won't go away, which is why we've supported human moderation from our Genesis," Hylton said. "Research tends to be bookended by two very strategic steps: we have a problem, what's the research that we should do? And we've done the research, now what are we going to do with it? All the stuff in the middle tends to be execution and lower skill work. We view Strella as doing that middle 90% of the work."

Why customers tell AI moderators the truth they won't share with humans

One of Strella's most surprising findings challenges assumptions about AI in qualitative research: participants appear more honest with AI moderators than with humans. The founders discovered this pattern repeatedly as customers ran head-to-head comparisons between traditional human-moderated studies and Strella's AI approach.

"If you're a designer and you get on a Zoom call with a customer and you say, 'Do you like my design?' they're always gonna say yes. They don't want to hurt your feelings," Hylton explained. "But it's not a problem at all for Strella. They would tell you exactly what they think about it, which is really valuable. It's very hard to get honest feedback."

Krishnan, Strella's COO, said companies initially worried about using AI and "eroding quality," but the platform has "actually found the opposite to be true. People are much more open and honest with an AI moderator, and so the level of insight that you get is much richer because people are giving their unfiltered feedback."

This dynamic has practical business implications. Brian Santiago, Senior Product Design Manager at Apollo GraphQL, said in a statement: "Before Strella, studies took weeks. Now we get insights in a day — sometimes in just a few hours. And because participants open up more with the AI moderator, the feedback is deeper and more honest."

The platform also addresses endemic fraud in online surveys, particularly when participants are compensated. Because Strella interviews happen on camera in real time, the AI moderator can detect when someone pauses suspiciously long — perhaps to consult ChatGPT — and flags them as potentially fraudulent. "We are fraud resistant," Hylton said, contrasting this with traditional surveys where fraud rates can be substantial.

Solving mobile app research with persistent screen sharing technology

A major focus of the Series A funding will be expanding Strella's recently-launched mobile application, which Krishnan identified as critical competitive differentiation. The mobile app enables persistent screen sharing during interviews — allowing researchers to watch users navigate mobile applications in real time while the AI moderator asks about their experience.

"We are the only player in the market that supports screen sharing on mobile," Hylton said. "You know, I want to understand what are the pain points with my app? Why do people not seem to be able to find the checkout flow? Well, in order to do that effectively, you'd like to see the user screen while they're doing an interview."

For consumer-facing companies where mobile represents the primary customer interface, this capability opens entirely new use cases. The founders noted that "several of our customers didn't do research before" but have now built research practices around Strella because the platform finally made mobile research accessible at scale.

The platform also supports embedding traditional survey question types directly into the conversational interview, approaching what Hylton called "feature parity with a survey" while maintaining the engagement advantages of a natural conversation. Strella interviews regularly run 60 to 90 minutes with nearly 100% completion rates—a duration that would see 60-70% drop-off in a traditional survey format.

How Strella differentiated in a market crowded with AI research startups

Strella enters a market that appears crowded at first glance, with established players like Qualtrics and a wave of AI-powered startups promising to transform customer research. The founders themselves initially pursued a different approach — synthetic respondents, or "digital twins" that simulate customer perspectives using large language models.

"We actually pivoted from that. That was our initial idea," Hylton revealed, referring to synthetic respondents. "People are very intrigued by that concept, but found in practice, no willingness to pay right now."

Recent research suggesting companies could use language models as digital twins for customer feedback has reignited interest in that approach. But Hylton remains skeptical: "The capabilities of the LLMs as they are today are not good enough, in my opinion, to justify a standalone company. Right now you could just ask ChatGPT, 'What would new users of Duolingo think about this ad copy?' You can do that. Adding the standalone idea of a synthetic panel is sort of just putting a wrapper on that."

Instead, Strella's bet is that the real value lies in collecting proprietary qualitative data at scale — building what could become "the system of truth for all qualitative insights" within enterprises, as Lindsey Li, Vice President at Bessemer Venture Partners, described it.

Li, who led the investment just one year after Strella emerged from stealth, said the firm was convinced by both the technology and the team. "Strella has built highly differentiated technology that enables a continuous interview rather than a survey," Li said. "We heard time and time again that customers loved this product experience relative to other offerings."

On the defensibility question that concerns many AI investors, Li emphasized product execution over patents: "We think the long game here will be won with a million small product decisions, all of which must be driven by deep empathy for customer pain and an understanding of how best to address their needs. Lydia and Priya exhibit that in spades."

The founders point to technical depth that's difficult to replicate. Most competitors started with adaptive surveys — text-based interfaces where users type responses and wait for the next question. Some have added voice, but typically as uploaded audio clips rather than free-flowing conversation.

"Our approach is fundamentally better, which is the fact that it is a free form conversation," Hylton said. "You never have to control anything. You're never typing, there's no buttons, there's no upload and wait for the next question. It's completely free form, and that has been an extraordinarily hard product to build. There's a tremendous amount of IP in the way that we prompt our moderator, the way that we run analysis."

The platform also improves with use, learning from each customer's research patterns to fine-tune future interview guides and questions. "Our product gets better for our customers as they continue to use us," Hylton said. All research accumulates in a central repository where teams can generate new insights by chatting with the data or creating visualizations from previously unstructured qualitative feedback.

Creating new research budgets instead of just automating existing ones

Perhaps more important than displacing existing research is expanding the total market. Krishnan said growth has been "fundamentally related to our product" creating new research that wouldn't have happened otherwise.

"We have expanded the use cases in which people would conduct research," Krishnan explained. "Several of our customers didn't do research before, have always wanted to do research, but didn't have a dedicated researcher or team at their company that was devoted to it, and have purchased Strella to kick off and enable their research practice. That's been really cool where we've seen this market just opening up."

This expansion comes as enterprises face mounting pressure to improve customer experience amid declining satisfaction scores. According to Forrester Research's 2024 Customer Experience Index, customer experience quality has declined for three consecutive years — an unprecedented trend. The report found that 39% of brands saw CX quality deteriorate, with declines across effectiveness, ease, and emotional connection.

Meanwhile, Deloitte's 2025 Technology, Media & Telecommunications Predictions report forecasts that 25% of enterprises using generative AI will deploy AI agents by 2025, growing to 50% by 2027. The report specifically highlighted AI's potential to enhance customer satisfaction by 15-20% while reducing cost to serve by 20-30% when properly implemented.

Gartner identified conversational user interfaces — the category Strella inhabits — as one of three technologies poised to transform customer service by 2028, noting that "customers increasingly expect to be able to interact with the applications they use in a natural way."

Against this backdrop, Li sees substantial room for growth. "UX Research is a sub-sector of the $140B+ global market-research industry," Li said. "This includes both the software layer historically (~$430M) and professional services spend on UX research, design, product strategy, etc. which is conservatively estimated to be ~$6.4B+ annually. As software in this vertical, led by Strella, becomes more powerful, we believe the TAM will continue to expand meaningfully."

Making customer feedback accessible across the enterprise, not just research teams

The founders describe their mission as "democratizing access to the customer" — making it possible for anyone in an organization to understand customer perspectives without waiting for dedicated research teams to complete months-long studies.

"Many, many, many positions in the organization would like to get customer feedback, but it's so hard right now," Hylton said. With Strella, she explained, someone can "log into Strella and through a chat, create any highlight reel that you want and actually see customers in their own words answering the question that you have based on the research that's already been done."

This video-first approach to research repositories changes organizational dynamics around customer feedback. "Then you can say, 'Okay, engineering team, we need to build this feature. And here's the customer actually saying it,'" Hylton continued. "'This is not me. This isn't politics. Here are seven customers saying they can't find the Checkout button.' The fact that we are a very video-based platform really allows us to do that quickly and painlessly."

The company has moved decisively upmarket, with contract values now typically in the five-figure range and "several six figure contracts" signed, according to Krishnan. The pricing strategy reflects a premium positioning: "Our product is very good, it's very premium. We're charging based on the value it provides to customers," Krishnan said, rather than competing on cost alone.

This approach appears to be working. The company reports 100% conversion from pilot programs to paid contracts and zero churn among its 40-45 customers, with month-over-month revenue growth of 50%.

The roadmap: Computer vision, agentic AI, and human-machine collaboration

The Series A funding will primarily support scaling product and go-to-market teams. "We're really confident that we have product-market fit," Hylton said. "And now the question is execution, and we want to hire a lot of really talented people to help us execute."

On the product roadmap, Hylton emphasized continued focus on the participant experience as the key to winning the market. "Everything else is downstream of a joyful participant experience," she said, including "the quality of insights, the amount you have to pay people to do the interviews, and the way that your customers feel about a company."

Near-term priorities include adding visual capabilities so the AI moderator can respond to facial expressions and other nonverbal cues, and building more sophisticated collaboration features between human researchers and AI moderators. "Maybe you want to listen while an AI moderator is running a call and you might want to be able to jump in with specific questions," Hylton said. "Or you want to run an interview yourself, but you want the moderator to be there as backup or to help you."

These features move toward what the industry calls "agentic AI" — systems that can act more autonomously while still collaborating with humans. The founders see this human-AI collaboration, rather than full automation, as the sustainable path forward.

"We believe that a lot of the really strategic work that companies do will continue to be human moderated," Hylton said. "And you can still do that through Strella and just use us for synthesis in those cases."

For Li and Bessemer, the bet is on founders who understand this nuance. "Lydia and Priya exhibit the exact archetype of founders we are excited to partner with for the long term — customer-obsessed, transparent, thoughtful, and singularly driven towards the home-run scenario," she said.

The company declined to disclose specific revenue figures or valuation. With the new funding, Strella has now raised $18 million total, including a $4 million seed round led by Decibel Partners announced in October.

As Strella scales, the founders remain focused on a vision where technology enhances rather than eliminates human judgment—where an engineering team doesn't just read a research report, but watches seven customers struggle to find the same button. Where a product manager can query months of accumulated interviews in seconds. Where companies don't choose between speed and depth, but get both.

"The interesting part of the business is actually collecting that proprietary dataset, collecting qualitative research at scale," Hylton said, describing what she sees as Strella's long-term moat. Not replacing the researcher, but making everyone in the company one.

Microsoft launches 'Hey Copilot' voice assistant and autonomous agents for all Windows 11 PCs

Microsoft is fundamentally reimagining how people interact with their computers, announcing Thursday a sweeping transformation of Windows 11 that brings voice-activated AI assistants, autonomous software agents, and contextual intelligence to every PC running the operating system — not just premium devices with specialized chips.

The announcement represents Microsoft's most aggressive push yet to integrate generative artificial intelligence into the desktop computing experience, moving beyond the chatbot interfaces that have defined the first wave of consumer AI products toward a more ambient, conversational model where users can simply talk to their computers and have AI agents complete complex tasks on their behalf.

"When we think about what the promise of an AI PC is, it should be capable of three things," Yusuf Mehdi, Microsoft's Executive Vice President and Consumer Chief Marketing Officer, told reporters at a press conference last week. "First, you should be able to interact with it naturally, in text or voice, and have it understand you. Second, it should be able to see what you see and be able to offer guided support. And third, it should be able to take action on your behalf."

The shift could prove consequential for an industry searching for the "killer app" for generative AI. While hundreds of millions of people have experimented with ChatGPT and similar chatbots, integrating AI directly into the operating system that powers the vast majority of workplace computers could dramatically accelerate mainstream adoption — or create new security and privacy headaches for organizations already struggling to govern employee use of AI tools.

How 'Hey Copilot' aims to replace typing with talking on Windows PCs

At the heart of Microsoft's vision is voice interaction, which the company is positioning as the third fundamental input method for PCs after the mouse and keyboard — a comparison that underscores Microsoft's ambitions for reshaping human-computer interaction nearly four decades after the graphical user interface became standard.

Starting this week, any Windows 11 user can enable the "Hey Copilot" wake word with a single click, allowing them to summon Microsoft's AI assistant by voice from anywhere in the operating system. The feature, which had been in limited testing, is now being rolled out to hundreds of millions of devices globally.

"It's been almost four decades since the PC has changed the way you interact with it, which is primarily mouse and keyboard," Mehdi said. "When you think about it, we find that people type on a given day up to 14,000 words on their keyboard, which is really kind of mind-boggling. But what if now you can go beyond that and talk to it?"

The emphasis on voice reflects internal Microsoft data showing that users engage with Copilot twice as much when using voice compared to text input — a finding the company attributes to the lower cognitive barrier of speaking versus crafting precise written prompts.

"The magic unlock with Copilot Voice and Copilot Vision is the ease of interaction," according to the company's announcement. "Using the new wake word, 'Hey Copilot,' getting something done is as easy as just asking for it."

But Microsoft's bet on voice computing faces real-world constraints that Mehdi acknowledged during the briefing. When asked whether workers in shared office environments would use voice features, potentially compromising privacy, Mehdi noted that millions already conduct voice calls through their PCs with headphones, and predicted users would adapt: "Just like when the mouse came out, people have to figure out when to use it, what's the right way, how to make it happen."

Crucially, Microsoft is hedging its voice-first strategy by making all features accessible through traditional text input as well, recognizing that voice isn't always appropriate or accessible.

AI that sees your screen: Copilot Vision expands worldwide with new capabilities

Perhaps more transformative than voice control is the expansion of Copilot Vision, a feature Microsoft introduced earlier this year that allows the AI to analyze what's displayed on a user's screen and provide contextual assistance.

Previously limited to voice interaction, Copilot Vision is now rolling out worldwide with a new text-based interface, allowing users to type questions about what they're viewing rather than speaking them aloud. The feature can now access full document context in Microsoft Office applications — meaning it can analyze an entire PowerPoint presentation or Excel spreadsheet without the user needing to scroll through every page.

"With 68 percent of consumers reporting using AI to support their decision making, voice is making this easier," Microsoft explained in its announcement. "The magic unlock with Copilot Voice and Copilot Vision is the ease of interaction."

During the press briefing, Microsoft demonstrated Copilot Vision helping users navigate Spotify's settings to enable lossless audio streaming, coaching an artist through writing a professional bio based on their visual portfolio, and providing shopping recommendations based on products visible in YouTube videos.

"What brings AI to life is when you can give it rich context, when you can type great prompts," Mehdi explained. "The big challenge for the majority of people is we've been trained with search to do the opposite. We've been trained to essentially type in fewer keywords, because it turns out the less keywords you type on search, the better your answers are."

He noted that average search queries remain just 2.3 keywords, while AI systems perform better with detailed prompts — creating a disconnect between user habits and AI capabilities. Copilot Vision aims to bridge that gap by automatically gathering visual context.

"With Copilot Vision, you can simply share your screen and Copilot in literally milliseconds can understand everything on the screen and then provide intelligence," Mehdi said.

The vision capabilities work with any application without requiring developers to build specific integrations, using computer vision to interpret on-screen content — a powerful capability that also raises questions about what the AI can access and when.

Software robots take control: Inside Copilot Actions' controversial autonomy

The most ambitious—and potentially controversial—new capability is Copilot Actions, an experimental feature that allows AI to take control of a user's computer to complete tasks autonomously.

Coming first to Windows Insiders enrolled in Copilot Labs, the feature builds on Microsoft's May announcement of Copilot Actions on the web, extending the capability to manipulate local files and applications on Windows PCs.

During demonstrations, Microsoft showed the AI agent organizing photo libraries, extracting data from documents, and working through multi-step tasks while users attended to other work. The agent operates in a separate, sandboxed environment and provides running commentary on its actions, with users able to take control at any time.

"As a general-purpose agent — simply describe the task you want to complete in your own words, and the agent will attempt to complete it by interacting with desktop and web applications," according to the announcement. "While this is happening, you can choose to focus on other tasks. At any time, you can take over the task or check in on the progress of the action, including reviewing what actions have been taken."

Navjot Virk, Microsoft's Windows Experience Leader, acknowledged the technology's current limitations during the briefing. "We'll be starting with a narrow set of use cases while we optimize model performance and learn," Virk said. "You may see the agent make mistakes or encounter challenges with complex interfaces, which is why real-world testing of this experience is so critical."

The experimental nature of Copilot Actions reflects broader industry challenges with agentic AI — systems that can take actions rather than simply providing information. While the potential productivity gains are substantial, AI systems still occasionally "hallucinate" incorrect information and can be vulnerable to novel attacks.

Can AI agents be trusted? Microsoft's new security framework explained

Recognizing the security implications of giving AI control over users' computers and files, Microsoft introduced a new security framework built on four core principles: user control, operational transparency, limited privileges, and privacy-preserving design.

Central to this approach is the concept of "agent accounts" — separate Windows user accounts under which AI agents operate, distinct from the human user's account. Combined with a new "agent workspace" that provides a sandboxed desktop environment, the architecture aims to create clear boundaries around what agents can access and modify.

Peter Waxman, Microsoft's Windows Security Engineering Leader, emphasized that Copilot Actions is disabled by default and requires explicit user opt-in. "You're always in control of what Copilot Actions can do," Waxman said. "Copilot Actions is turned off by default and you're able to pause, take control, or disable it at any time."

During operation, users can monitor the agent's progress in real-time, and the system requests additional approval before taking "sensitive or important" actions. All agent activity occurs under the dedicated agent account, creating an audit trail that distinguishes AI actions from human ones.

However, the agent will have default access to users' Documents, Downloads, Desktop, and Pictures folders—a broad permission grant that could concern enterprise IT administrators.

Dana Huang, Corporate Vice President for Windows Security, acknowledged in a blog post that "agentic AI applications introduce novel security risks, such as cross-prompt injection (XPIA), where malicious content embedded in UI elements or documents can override agent instructions, leading to unintended actions like data exfiltration or malware installation."

Microsoft promises more details about enterprise controls at its Ignite conference in November.

Gaming, taskbar redesign, and deeper Office integration round out updates

Beyond voice and autonomous agents, Microsoft introduced changes across Windows 11's core interfaces and extended AI to new domains.

A new "Ask Copilot" feature integrates AI directly into the Windows taskbar, providing one-click access to start conversations, activate vision capabilities, or search for files and settings with "lightning-fast" results. The opt-in feature doesn't replace traditional Windows search.

File Explorer gains AI capabilities through integration with third-party services. A partnership with Manus AI allows users to right-click on local image files and generate complete websites without manual uploading or coding. Integration with Filmora enables quick jumps into video editing workflows.

Microsoft also introduced Copilot Connectors, allowing users to link cloud services like OneDrive, Outlook, Google Drive, Gmail, and Google Calendar directly to Copilot on Windows. Once connected, users can query personal content across platforms using natural language.

In a notable expansion beyond productivity, Microsoft and Xbox introduced Gaming Copilot for the ROG Xbox Ally handheld gaming devices developed with ASUS. The feature, accessible via a dedicated hardware button, provides an AI assistant that can answer gameplay questions, offer strategic advice, and help navigate game interfaces through natural voice conversation.

Why Microsoft is racing to embed AI everywhere before Apple and Google

Microsoft's announcement comes as technology giants race to embed generative AI into their core products following the November 2022 launch of ChatGPT. While Microsoft moved quickly to integrate OpenAI's technology into Bing search and introduce Copilot across its product line, the company has faced questions about whether AI features are driving meaningful engagement. Recent data shows Bing's search market share remaining largely flat despite AI integration.

The Windows integration represents a different approach: rather than charging separately for AI features, Microsoft is building them into the operating system itself, betting that embedded AI will drive Windows 11 adoption and competitive differentiation against Apple and Google.

Apple has taken a more cautious approach with Apple Intelligence, introducing AI features gradually and emphasizing privacy through on-device processing. Google has integrated AI across its services but has faced challenges with accuracy and reliability.

Crucially, while Microsoft highlighted new Copilot+ PC models from partners with prices ranging from $649.99 to $1,499.99, the core AI features announced today work on any Windows 11 PC — a significant departure from earlier positioning that suggested AI capabilities required new hardware with specialized neural processing units.

"Everything we showed you here is for all Windows 11 PCs. You don't need to run it on a copilot plus PC. It works on any Windows 11 PC," Mehdi clarified.

This democratization of AI features across the Windows 11 installed base potentially accelerates adoption but also complicates Microsoft's hardware sales pitch for premium devices.

What Microsoft's AI bet means for the future of computing

Mehdi framed the announcement in sweeping terms, describing Microsoft's goal as fundamentally reimagining the operating system for the AI era.

"We're taking kind of a bold view of it. We really feel that the vision that we have is, let's rewrite the entire operating system around AI and build essentially what becomes truly the AI PC," he said.

For Microsoft, the success of AI-powered Windows 11 could help drive the company's next phase of growth as PC sales have matured and cloud growth faces increased competition.

For users and organizations, the announcement represents a potential inflection point in how humans interact with computers — one that could significantly boost productivity if executed well, or create new security headaches if the AI proves unreliable or difficult to control.

The technology industry will be watching closely to see whether Microsoft's bet on conversational computing and agentic AI marks the beginning of a genuine paradigm shift, or proves to be another ambitious interface reimagining that fails to gain mainstream traction.

What's clear is that Microsoft is moving aggressively to stake its claim as the leader in AI-powered personal computing, leveraging its dominant position in desktop operating systems to bring generative AI directly into the daily workflows of potentially a billion users.

Copilot Voice and Vision are available today to Windows 11 users worldwide, with experimental capabilities coming to Windows Insiders in the coming weeks.

Anthropic is giving away its powerful Claude Haiku 4.5 AI for free to take on OpenAI

Anthropic released Claude Haiku 4.5 on Wednesday, a smaller and significantly cheaper artificial intelligence model that matches the coding capabilities of systems that were considered cutting-edge just months ago, marking the latest salvo in an intensifying competition to dominate enterprise AI.

The model costs $1 per million input tokens and $5 per million output tokens — roughly one-third the price of Anthropic's mid-sized Sonnet 4 model released in May, while operating more than twice as fast. In certain tasks, particularly operating computers autonomously, Haiku 4.5 actually surpasses its more expensive predecessor.

"Haiku 4.5 is a clear leap in performance and is now largely as smart as Sonnet 4 while being significantly faster and one-third of the cost," an Anthropic spokesperson told VentureBeat, underscoring how rapidly AI capabilities are becoming commoditized as the technology matures.

The launch comes just two weeks after Anthropic released Claude Sonnet 4.5, which the company bills as the world's best coding model, and two months after introducing Opus 4.1. The breakneck pace of releases reflects mounting pressure from OpenAI, whose $500 billion valuation dwarfs Anthropic's $183 billion, and which has inked a series of multibillion-dollar infrastructure deals while expanding its product lineup.

How free access to advanced AI could reshape the enterprise market

In an unusual move that could reshape competitive dynamics in the AI market, Anthropic is making Haiku 4.5 available for all free users of its Claude.ai platform. The decision effectively democratizes access to what the company characterizes as "near-frontier-level intelligence" — capabilities that would have been available only in expensive, premium models months ago.

"The launch of Claude Haiku 4.5 means that near-frontier-level intelligence is available for free to all users through Claude.ai," the Anthropic spokesperson told VentureBeat. "It also offers significant advantages to our enterprise customers: Sonnet 4.5 can handle frontier planning while Haiku 4.5 powers sub-agents, enabling multi-agent systems that tackle complex refactors, migrations, and large features builds with speed and quality."

This multi-agent architecture signals a significant shift in how AI systems are deployed. Rather than relying on a single, monolithic model, enterprises can now orchestrate teams of specialized AI agents: a more sophisticated Sonnet 4.5 model breaking down complex problems and delegating subtasks to multiple Haiku 4.5 agents working in parallel. For software development teams, this could mean Sonnet 4.5 plans a major code refactoring while Haiku 4.5 agents simultaneously execute changes across dozens of files.

The approach mirrors how human organizations distribute work, and could prove particularly valuable for enterprises seeking to balance performance with cost efficiency — a critical consideration as AI deployment scales.

Inside Anthropic's path to $7 billion in annual revenue

The model launch coincides with revelations that Anthropic's business is experiencing explosive growth. The company's annual revenue run rate is approaching $7 billion this month, Anthropic told Reuters, up from more than $5 billion reported in August. Internal projections obtained by Reuters suggest the company is targeting between $20 billion and $26 billion in annualized revenue for 2026, representing growth of more than 200% to nearly 300%.

The company now serves more than 300,000 business customers, with enterprise products accounting for approximately 80% of revenue. Among Anthropic's most successful offerings is Claude Code, a code-generation tool that has reached nearly $1 billion in annualized revenue since launching earlier this year.

Those numbers come as artificial intelligence enters what many in the industry characterize as a critical inflection point. After two years of what Anthropic Chief Product Officer Mike Krieger recently described as "AI FOMO" — where companies adopted AI tools without clear success metrics — enterprises are now demanding measurable returns on investment.

"The best products can be grounded in some kind of success metric or evaluation," Krieger said on the "Superhuman AI" podcast. "I've seen that a lot in talking to companies that are deploying AI."

For enterprises evaluating AI tools, the calculus increasingly centers on concrete productivity gains. Google CEO Sundar Pichai claimed in June that AI had generated a 10% boost in engineering velocity at his company — though measuring such improvements across different roles and use cases remains challenging, as Krieger acknowledged.

Why AI safety testing matters more than ever for enterprise adoption

Anthropic's launch comes amid heightened scrutiny of the company's approach to AI safety and regulation. On Tuesday, David Sacks, the White House's AI "czar" and a venture capitalist, accused Anthropic of "running a sophisticated regulatory capture strategy based on fear-mongering" that is "damaging the startup ecosystem."

The attack targeted remarks by Jack Clark, Anthropic's British co-founder and head of policy, who had described being "deeply afraid" of AI's trajectory. Clark told Bloomberg he found Sacks' criticism "perplexing."

Anthropic addressed such concerns head-on in its release materials, emphasizing that Haiku 4.5 underwent extensive safety testing. The company classified the model as ASL-2 — its AI Safety Level 2 standard — compared to the more restrictive ASL-3 designation for the more powerful Sonnet 4.5 and Opus 4.1 models.

"Our teams have red-teamed and tested our agentic capabilities to the limits in order to assess whether it can be used to engage in harmful activity like generating misinformation or promoting fraudulent behavior like scams," the spokesperson told VentureBeat. "In our automated alignment assessment, it showed a statistically significantly lower overall rate of misaligned behaviors than both Claude Sonnet 4.5 and Claude Opus 4.1 — making it, by this metric, our safest model yet."

The company said its safety testing showed Haiku 4.5 poses only limited risks regarding the production of chemical, biological, radiological and nuclear weapons. Anthropic has also implemented classifiers designed to detect and filter prompt injection attacks, a common method for attempting to manipulate AI systems into producing harmful content.

The emphasis on safety reflects Anthropic's founding mission. The company was established in 2021 by former OpenAI executives, including siblings Dario and Daniela Amodei, who left amid concerns about OpenAI's direction following its partnership with Microsoft. Anthropic has positioned itself as taking a more cautious, research-oriented approach to AI development.

Benchmark results show Haiku 4.5 competing with larger, more expensive models

According to Anthropic's benchmarks, Haiku 4.5 performs competitively with or exceeds several larger models across multiple evaluation criteria. On SWE-bench Verified, a widely used test measuring AI systems' ability to solve real-world software engineering problems, Haiku 4.5 scored 73.3% — slightly ahead of Sonnet 4's 72.7% and close to GPT-5 Codex's 74.5%.

The model demonstrated particular strength in computer use tasks, achieving 50.7% on the OSWorld benchmark compared to Sonnet 4's 42.2%. This capability allows the AI to interact directly with computer interfaces — clicking buttons, filling forms, navigating applications — which could prove transformative for automating routine digital tasks.

In coding-specific benchmarks like Terminal-Bench, which tests AI agents' ability to complete complex software tasks using command-line tools, Haiku 4.5 scored 41.0%, trailing only Sonnet 4.5's 50.0% among Claude models.

The model maintains a 200,000-token context window for standard users, with developers accessing the Claude Developer Platform able to use a 1-million-token context window. That expanded capacity means the model can process extremely large codebases or documents in a single request — roughly equivalent to a 1,500-page book.

What three major AI model releases in two months says about the competition

When asked about the rapid succession of model releases, the Anthropic spokesperson emphasized the company's focus on execution rather than competitive positioning.

"We're focused on shipping the best possible products for our customers — and our shipping velocity speaks for itself," the spokesperson said. "What was state-of-the-art just five months ago is now faster, cheaper, and more accessible."

That velocity stands in contrast to the company's earlier, more measured release schedule. Anthropic appeared to have paused development of its Haiku line after releasing version 3.5 at the end of last year, leading some observers to speculate the company had deprioritized smaller models.

That rapid price-performance improvement validates a core promise of artificial intelligence: that capabilities will become dramatically cheaper over time as the technology matures and companies optimize their models. For enterprises, it suggests that today's budget constraints around AI deployment may ease considerably in coming years.

From customer service to code: Real-world applications for faster, cheaper AI

The practical applications of Haiku 4.5 span a wide range of enterprise functions, from customer service to financial analysis to software development. The model's combination of speed and intelligence makes it particularly suited for real-time, low-latency tasks like chatbot conversations and customer support interactions, where delays of even a few seconds can degrade user experience.

In financial services, the multi-agent architecture enabled by pairing Sonnet 4.5 with Haiku 4.5 could transform how firms monitor markets and manage risk. Anthropic envisions Haiku 4.5 monitoring thousands of data streams simultaneously — tracking regulatory changes, market signals and portfolio risks — while Sonnet 4.5 handles complex predictive modeling and strategic analysis.

For research organizations, the division of labor could compress timelines dramatically. Sonnet 4.5 might orchestrate a comprehensive analysis while multiple Haiku 4.5 agents parallelize literature reviews, data gathering and document synthesis across dozens of sources, potentially "compressing weeks of research into hours," according to Anthropic's use case descriptions.

Several companies have already integrated Haiku 4.5 and reported positive results. Guy Gur-Ari, co-founder of coding startup Augment, said the model "hit a sweet spot we didn't think was possible: near-frontier coding quality with blazing speed and cost efficiency." In Augment's internal testing, Haiku 4.5 achieved 90% of Sonnet 4.5's performance while matching much larger models.

Jeff Wang, CEO of Windsurf, another coding-focused startup, said Haiku 4.5 "is blurring the lines" on traditional trade-offs between speed, cost and quality. "It's a fast frontier model that keeps costs efficient and signals where this class of models is headed."

Jon Noronha, co-founder of presentation software company Gamma, reported that Haiku 4.5 "outperformed our current models on instruction-following for slide text generation, achieving 65% accuracy versus 44% from our premium tier model — that's a game-changer for our unit economics."

The price of progress: What plummeting AI costs mean for enterprise strategy

For enterprises evaluating AI strategies, Haiku 4.5 presents both opportunity and challenge. The opportunity lies in accessing sophisticated AI capabilities at dramatically lower costs, potentially making viable entire categories of applications that were previously too expensive to deploy at scale.

The challenge is keeping pace with a technology landscape that is evolving faster than most organizations can absorb. As Krieger noted in his recent podcast appearance, companies are moving beyond "AI FOMO" to demand concrete metrics and demonstrated value. But establishing those metrics and evaluation frameworks takes time — time that may be in short supply as competitors race ahead.

The shift from single-model deployments to multi-agent architectures also requires new ways of thinking about AI systems. Rather than viewing AI as a monolithic assistant, enterprises must learn to orchestrate multiple specialized agents, each optimized for particular tasks — more akin to managing a team than operating a tool.

The fundamental economics of AI are shifting with remarkable speed. Five months ago, Sonnet 4's capabilities commanded premium pricing and represented the cutting edge. Today, Haiku 4.5 delivers similar performance at a third of the cost. If that trajectory continues — and both Anthropic's release schedule and competitive pressure from OpenAI and Google suggest it will — the AI capabilities that seem remarkable today may be routine and inexpensive within a year.

For Anthropic, the challenge will be translating technical achievements into sustainable business growth while maintaining the safety-focused approach that differentiates it from competitors. The company's projected revenue growth to as much as $26 billion by 2026 suggests strong market traction, but achieving those targets will require continued innovation and successful execution across an increasingly complex product portfolio.

Whether enterprises will choose Claude over increasingly capable alternatives from OpenAI, Google and a growing field of competitors remains an open question. But Anthropic is making a clear bet: that the future of AI belongs not to whoever builds the single most powerful model, but to whoever can deliver the right intelligence, at the right speed, at the right price — and make it accessible to everyone.

In an industry where the promise of artificial intelligence has long outpaced reality, Anthropic is betting that delivering on that promise, faster and cheaper than anyone expected, will be enough to win. And with pricing dropping by two-thirds in just five months while performance holds steady, that promise is starting to look like reality.

Dfinity launches Caffeine, an AI platform that builds production apps from natural language prompts

The Dfinity Foundation on Wednesday released Caffeine, an artificial intelligence platform that allows users to build and deploy web applications through natural language conversation alone, bypassing traditional coding entirely. The system, which became publicly available today, represents a fundamental departure from existing AI coding assistants by building applications on a specialized decentralized infrastructure designed specifically for autonomous AI development.

Unlike GitHub Copilot, Cursor, or other "vibe coding" tools that help human developers write code faster, Caffeine positions itself as a complete replacement for technical teams. Users describe what they want in plain language, and an ensemble of AI models writes, deploys, and continually updates production-grade applications — with no human intervention in the codebase itself.

"In the future, you as a prospective app owner or service owner… will talk to AI. AI will give you what you want on a URL," said Dominic Williams, founder and chief scientist at the Dfinity Foundation, in an exclusive interview with VentureBeat. "You will use that, completely interact productively, and you'll just keep talking to AI to evolve what that does. The AI, or an ensemble of AIs, will be your tech team."

The platform has attracted significant early interest: more than 15,000 alpha users tested Caffeine before its public release, with daily active users representing 26% of those who received access codes — "early Facebook kind of levels," according to Williams. The foundation reports some users spending entire days building applications on the platform, forcing Dfinity to consider usage limits due to underlying AI infrastructure costs.

Why Caffeine's custom programming language guarantees your data won't disappear

Caffeine's most significant technical claim addresses a problem that has plagued AI-generated code: data loss during application updates. The platform builds applications using Motoko, a programming language developed by Dfinity specifically for AI use, which provides mathematical guarantees that upgrades cannot accidentally delete user data.

"When AI is updating apps and services in production, a mistake cannot lose data. That's a guarantee," Williams said. "It's not like there are some safeguards to try and stop it losing data. This language framework gives it rails that guarantee if an upgrade, an update to its app's underlying logic, would cause data loss, the upgrade fails and the AI just tries again."

This addresses what Williams characterizes as critical failures in competing platforms. User forums for tools like Lovable and Replit, he notes, frequently report three major problems: applications that become irreparably broken as complexity increases, security vulnerabilities that allow unauthorized access, and mysterious data loss during updates.

Traditional tech stacks evolved to meet human developer needs — familiarity with SQL databases, preference for known programming languages, existing skill investments. "That's how the traditional tech stacks evolved. It's really evolved to meet human needs," Williams explained. "But in the future, it's going to be different. You're not going to care how the AI did it. Instead, for you, AI is the tech stack."

Caffeine's architecture reflects this philosophy. Applications run entirely on the Internet Computer Protocol (ICP), a blockchain-based network that Dfinity launched in May 2021 after raising over $100 million from investors including Andreessen Horowitz and Polychain Capital. The ICP uses what Dfinity calls "chain-key cryptography" to create what Williams describes as "tamper-proof" code — applications that are mathematically guaranteed to execute their written logic without interference from traditional cyberattacks.

"The code can't be affected by ransomware, so you don't have to worry about malware in the same way you do," Williams said. "Configuration errors don't result in traditional cyber attacks. That passive traditional cyber attacks isn't something you need to worry about."

How 'orthogonal persistence' lets AI build apps without managing databases

At the heart of Caffeine's technical approach is a concept called "orthogonal persistence," which fundamentally reimagines how applications store and manage data. In traditional development, programmers must write extensive code to move data between application logic and separate database systems — marshaling data in and out of SQL servers, managing connections, handling synchronization.

Motoko eliminates this entirely. Williams demonstrated with a simple example: defining a blog post data type and declaring a variable to store an array of posts requires just two lines of code. "This declaration is all that's necessary to have the blog maintain its list of posts," he explained during a presentation on the technology. "Compare that to traditional IT where in order to persist the blog posts, you'd have to marshal them in and out of a database server. This is quite literally orders of magnitude more simple."

This abstraction allows AI to work at a higher conceptual level, focusing on application logic rather than infrastructure plumbing. "Logic and data are kind of the same," Williams said. "This is one of the things that enables AI to build far more complicated functionality than it could otherwise do."

The system also employs what Dfinity calls "loss-safe data migration." When AI needs to modify an application's data structure — adding a "likes" field to blog posts, for example — it must write migration logic in two passes. The framework automatically verifies that the transformation won't result in data loss, refusing to compile or deploy code that could delete information unless explicitly instructed.

From million-dollar SaaS contracts to conversational app building in minutes

Williams positions Caffeine as particularly transformative for enterprise IT, where he claims costs could fall to "1% of what they were before" while time-to-market shrinks to similar fractions. The platform targets a spectrum from individual creators to large corporations, all of whom currently face either expensive development teams or constraining low-code templates.

"A corporation or government department might want to create a corporate portal or CRM, ERP functionality," Williams said, referring to customer relationship management and enterprise resource planning systems. "They will otherwise have to obtain this by signing up for some incredibly expensive SaaS service where they become locked in, their data gets stuck, and they still have to spend a lot of money on consultants customizing the functionality."

Applications built through Caffeine are owned entirely by their creators and cannot be shut down by centralized parties — a consequence of running on the decentralized Internet Computer network rather than traditional cloud providers like Amazon Web Services. "When someone says built on the internet computer, it actually means built on the internet computer," Williams emphasized, contrasting this with blockchain projects that merely host tokens while running actual applications on centralized infrastructure.

The platform demonstrated this versatility during a July 2025 hackathon in San Francisco, where participants created applications ranging from a "Will Maker" tool for generating legal documents, to "Blue Lens," a voice-AI water quality monitoring system, to "Road Patrol," a gamified community reporting app for infrastructure problems. Critically, many of these came from non-technical participants with no coding background.

"I'm from a non-technical background, I'm actually a quality assurance professional," said the creator of Blue Lens in a video testimonial. "Through Caffeine I can build something really intuitive and next-gen to the public." The application integrated multiple external services — Eleven Labs for voice AI, real-time government water data through retrieval-augmented generation, and Midjourney-generated visual assets — all coordinated through conversational prompts.

What separates Caffeine from GitHub Copilot, Cursor, and the 'vibe coding' wave

Caffeine enters a crowded market of AI-assisted development tools, but Williams argues the competition isn't truly comparable. GitHub Copilot, Cursor, and similar tools serve human developers working with traditional technology stacks. Platforms like Replit and Lovable occupy a middle ground, offering "vibe coding" that mixes AI generation with human editing.

"If you're a Node.js developer, you know you're working with the traditional stack, and you might want to do your coding with Copilot or using Claude or using Cursor," Williams said. "That's a very different thing to what Caffeine is offering. There'll always be cases where you probably wouldn't want to hand over the logic of the control system for a new nuclear missile silo to AI. But there's going to be these holdout areas, right? And there's all the legacy stuff that has to be maintained."

The key distinction, according to Williams, lies in production readiness. Existing AI coding tools excel at rapid prototyping but stumble when applications grow complex or require guaranteed reliability. Reddit forums for these platforms document users hitting insurmountable walls where applications break irreparably, or where AI-generated code introduces security vulnerabilities.

"As the demands and the requirements become more complicated, eventually you can hit a limit, and when you hit that limit, not only can you not go any further, but sometimes your app will get broken and there's no way of going back to where you were before," Williams said. "That can't happen with productive apps, and it also can't be the case that you're getting hacked and losing data, because once you go hands-free, if you like, and there's no tech team, there's no technical people involved, who's going to run the backups and restore your app?"

The Internet Computer's architecture addresses this through Byzantine fault tolerance — even if attackers gain physical control over some network hardware, they cannot corrupt applications or their data. "This is the beginning of a compute revolution and it's also the perfect platform for AI to build on," Williams said.

Inside the vision: A web that programs itself through natural language

Dfinity frames Caffeine within a broader vision it calls the "self-writing internet," where the web literally programs itself through natural language interaction. This represents what Williams describes as a "seismic shift coming to tech" — from human developers selecting technology stacks based on their existing skills, to AI selecting optimal implementations invisible to users.

"You don't care about whether some human being has learned all of the different platforms and Amazon Web Services or something like that. You don't care about that. You just care: Is it secure? Do you get security guarantees? Is it resilient? What's the level of resilience?" Williams said. "Those are the new parameters."

The platform demonstrated this during live demonstrations, including at the World Computer Summit 2025 in Zurich. Williams created a talent recruitment application from scratch in under two minutes, then modified it in real-time while the application ran with users already interacting with it. "You will continue talking to the AI and just keep on refreshing the URL to see the changes," he explained.

This capability extends to complex scenarios. During demonstrations, Williams showed building a tennis lesson booking system, an e-commerce platform, and an event registration system — all simultaneously, working on multiple applications in parallel. "We predict that as people get very proficient with Caffeine, they could be working on even 10 apps in parallel," he said.

The system writes substantial code: a simple personal blog generated 700 lines of code in a couple of minutes. More complex applications can involve thousands of lines across frontend and backend components, all abstracted away from the user who only describes desired functionality.

The economics of cloning: How Caffeine's app market challenges traditional stores

Caffeine's economic model differs fundamentally from traditional software-as-a-service platforms. Applications run on the Internet Computer Protocol, which uses a "reverse gas model" where developers pay for computation rather than users paying transaction fees. The platform includes an integrated App Market where creators can publish applications for others to clone and adapt — creating what Dfinity envisions as a new economic ecosystem.

"App stores today obviously operate on gatekeeping," said Pierre Samaties, chief business officer at Dfinity, during the World Computer Summit. "That's going to erode." Rather than purchasing applications, users can clone them and modify them for their own purposes — fundamentally different from Apple's App Store or Google Play models.

Williams acknowledges that Caffeine itself currently runs on centralized infrastructure, despite building applications on the decentralized Internet Computer. "Caffeine itself actually is centralized. It uses aspects of the Internet Computer. We want Caffeine itself to run on the Internet Computer in the future, but it's not there now," he said. The platform leverages commercially available foundation models from companies like Anthropic, whose Claude Sonnet model powers much of Caffeine's backend logic.

This pragmatic approach reflects Dfinity's strategy of using best-in-class AI models while focusing its own development on the specialized infrastructure and programming language designed for AI use. "These content models have been developed by companies with enormous budgets, absolutely enormous budgets," Williams said. "I don't think in the near future we'll run AI on the Internet Computer for that reason, unless there's a special case."

A decade in the making: From Ethereum roots to the self-writing internet

The Dfinity Foundation has pursued this vision since Williams began researching decentralized networks in late 2013. After involvement with Ethereum before its 2015 launch, Williams became fascinated with the concept of a "world computer"—a public blockchain network that could host not just tokens but entire applications and services.

"By 2015 I was talking about network-focused drivers, Dfinity back then, and that could really operate as an alternative tech stack, and eventually host even things like social networks and massive enterprise systems," Williams said. The foundation launched the Internet Computer Protocol in May 2021, initially focusing on Web3 developers. Despite not being among the highest-valued blockchain projects, ICP consistently ranks in the top 10 for developer numbers.

The pivot to AI-driven development came from recognizing that "in the future, the tech stack will be AI," according to Williams. This realization led to Caffeine's development, announced on Dfinity's public roadmap in March 2025 and demonstrated at the World Computer Summit in June 2025.

One successful example of the Dfinity vision running in production is OpenChat, a messaging application that runs entirely on the Internet Computer and is governed by a decentralized autonomous organization (DAO) with tens of thousands of participants voting on source code updates through algorithmic governance. "The community is actually controlling the source code updates," Williams explained. "Developers propose updates, community reads the updates, and if the community is happy, OpenChat updates itself."

The skeptics weigh in: Crypto baggage and real-world testing ahead

The platform faces several challenges. Dfinity's crypto industry roots may create perception problems in enterprise markets, Williams acknowledges. "The Web3 industry's reputation is a bit tarnished and probably rightfully so," he said during the World Computer Summit. "Now people can, for themselves, experience what a decentralized network is. We're going to see self-writing take over the enterprise space because the speed and efficiency are just incredible."

The foundation's history includes controversy: ICP's token launched in 2021 at over $100 per token with an all-time high around $700, then crashed below $3 in 2023 before recovering. The project has faced legal challenges, including class action lawsuits alleging misleading investors, and Dfinity filed defamation claims against industry critics.

Technical limitations also remain. Caffeine cannot yet compile React front-ends on the Internet Computer itself, requiring some off-chain processing. Complex integrations with traditional systems — payment processing through Stripe, for example — still require centralized components. "Your app is running end-to-end on the Internet Computer, then when it needs to actually accept payment, it's going to hand over to your Stripe account," Williams explained.

The platform's claims about data loss prevention and security guarantees, while technically grounded in the Motoko language design and Internet Computer architecture, remain to be tested at scale with diverse real-world applications. The 26% daily active user rate from alpha testing is impressive but comes from a self-selected group of early adopters.

When five billion smartphone users become developers

Williams rejects concerns that AI-driven development will eliminate software engineering jobs, arguing instead for market expansion. "The self-writing internet empowers eight billion non-technical people," he said. "Some of these people will enter roles in tech, becoming prompt engineers, tech entrepreneurs, or helping run online communities. Humanity will create millions of new custom apps and services, and a subset of those will require professional human assistance."

During his World Computer Summit demonstration, Williams was explicit about the scale of transformation Dfinity envisions. "Today there are about 35,000 Web3 engineers in the world. Worldwide there are about 15 million full-stack engineers," he said. "But tomorrow with the self-writing internet, everyone will be a builder. Today there are already about five billion people with internet-connected smartphones and they'll all be able to use Caffeine."

The hackathon results suggest this isn't pure hyperbole. A dentist built "Dental Tracks" to help patients manage their dental records. A transportation industry professional created "Road Patrol" for gamified infrastructure reporting. A frustrated knitting student built "Skill Sprout," a garden-themed app for learning new hobbies, complete with material checklists and step-by-step skill breakdowns—all without writing a single line of code.

"I was learning to knit. I got irritated because I had the wrong materials," the creator explained in a video interview. "I don't know how to do the stitches, so I have to individually search, and it's really intimidating when you're trying to learn something you don't—you don't even know what you don't know."

Whether Caffeine succeeds depends on factors still unknown: how production applications perform under real-world stress, whether the Internet Computer scales to millions of applications, whether enterprises can overcome their skepticism of blockchain-adjacent technology. But if Williams is right about the fundamental shift — that AI will be the tech stack, not just a tool for human developers — then someone will build what Caffeine promises.

The question isn't whether the future looks like this. It's who gets there first, and whether they can do it without losing everyone's data along the way.

Visa just launched a protocol to secure the AI shopping boom — here’s what it means for merchants

Visa is introducing a new security framework designed to solve one of the thorniest problems emerging in artificial intelligence-powered commerce: how retailers can tell the difference between legitimate AI shopping assistants and the malicious bots that plague their websites.

The payments giant unveiled its Trusted Agent Protocol on Tuesday, establishing what it describes as foundational infrastructure for "agentic commerce" — a term for the rapidly growing practice of consumers delegating shopping tasks to AI agents that can search products, compare prices, and complete purchases autonomously.

The protocol enables merchants to cryptographically verify that an AI agent browsing their site is authorized and trustworthy, rather than a bot designed to scrape pricing data, test stolen credit cards, or carry out other fraudulent activities.

The launch comes as AI-driven traffic to U.S. retail websites has exploded by more than 4,700% over the past year, according to data from Adobe cited by Visa. That dramatic surge has created an acute challenge for merchants whose existing bot detection systems — designed to block automated traffic — now risk accidentally blocking legitimate AI shoppers along with bad actors.

"Merchants need additional tools that provide them with greater insight and transparency into agentic commerce activities to ensure they can participate safely," said Rubail Birwadker, Visa's Global Head of Growth, in an exclusive interview with VentureBeat. "Without common standards, potential risks include ecosystem fragmentation and the proliferation of closed loop models."

The stakes are substantial. While 85% of shoppers who have used AI to shop report improved experiences, merchants face the prospect of either turning away legitimate AI-powered customers or exposing themselves to sophisticated bot attacks. Visa's own data shows the company prevented $40 billion in fraudulent activity between October 2022 and September 2023, nearly double the previous year, much of it involving AI-powered enumeration attacks where bots systematically test combinations of card numbers until finding valid credentials.

Inside the cryptographic handshake: How Visa verifies AI shopping agents

Visa's Trusted Agent Protocol operates through what Birwadker describes as a "cryptographic trust handshake" between merchants and approved AI agents. The system works in three steps:

First, AI agents must be approved and onboarded through Visa's Intelligent Commerce program, where they undergo vetting to meet trust and reliability standards. Each approved agent receives a unique digital signature key — essentially a cryptographic credential that proves its identity.

When an approved agent visits a merchant's website, it creates a digital signature using its key and transmits three categories of information: Agent Intent (indicating the agent is trusted and intends to retrieve product details or make a purchase), Consumer Recognition (data showing whether the underlying consumer has an existing account with the merchant), and Payment Information (optional payment data to support checkout).

Merchants or their infrastructure providers, such as content delivery networks, then validate these digital signatures against Visa's registry of approved agents. "Upon proper validation of these fields, the merchant can confirm the signature is a trusted agent," Birwadker explained.

Crucially, Visa designed the protocol to require minimal changes to existing merchant infrastructure. Built on the HTTP Message Signature standard and aligned with Web Both Auth, the protocol works with existing web infrastructure without requiring merchants to overhaul their checkout pages. "This is no-code functionality," Birwadker emphasized, though merchants may need to integrate with Visa's Developer Center to access the verification system.

The race for AI commerce standards: Visa faces competition from Google, OpenAI, and Stripe

Visa developed the protocol in collaboration with Cloudflare, the web infrastructure and security company that already provides bot management services to millions of websites. The partnership reflects Visa's recognition that solving bot verification requires cooperation across the entire web stack, not just the payments layer.

"Trusted Agent Protocol supplements traditional bot management by providing merchants insights that enable agentic commerce," Birwadker said. "Agents are providing additional context they otherwise would not, including what it intends to do, who the underlying consumer is, and payment information."

The protocol arrives as multiple technology giants race to establish competing standards for AI commerce. Google recently introduced its Agent Protocol for Payments (AP2), while OpenAI and Stripe have discussed their own approaches to enabling AI agents to make purchases. Microsoft, Shopify, Adyen, Ant International, Checkout.com, Cybersource, Elavon, Fiserv, Nuvei, and Worldpay provided feedback during Trusted Agent Protocol's development, according to Visa.

When asked how Visa's protocol relates to these competing efforts, Birwadker struck a collaborative tone. "Both Google's AP2 and Visa's Trusted Agent Protocol are working toward the same goal of building trust in agent-initiated payments," he said. "We are engaged with Google, OpenAI, and Stripe and are looking to create compatibility across the ecosystem."

Visa says it is working with global standards bodies including the Internet Engineering Task Force (IETF), OpenID Foundation, and EMVCo to ensure the protocol can eventually become interoperable with other emerging standards. "While these specifications apply to the Visa network in this initial phase, enabling agents to safely and securely act on a consumer's behalf requires an open, ecosystem-wide approach," Birwadker noted.

Who pays when AI agents go rogue? Unanswered questions about liability and authorization

The protocol raises important questions about authorization and liability when AI agents make purchases on behalf of consumers. If an agent completes an unauthorized transaction — perhaps misunderstanding a user's intent or exceeding its delegated authority — who bears responsibility?

Birwadker emphasized that the protocol helps merchants "leverage this information to enable experiences tied to existing consumer relationships and more secure checkout," but he did not provide specific details about how disputes would be handled when agents make unauthorized purchases. Visa's existing fraud protection and chargeback systems would presumably apply, though the company has not yet published detailed guidance on agent-initiated transaction disputes.

The protocol also places Visa in the position of gatekeeper for the emerging agentic commerce ecosystem. Because Visa determines which AI agents get approved for the Intelligent Commerce program and receive cryptographic credentials, the company effectively controls which agents merchants can easily trust. "Agents are approved and onboarded through the Visa Intelligent Commerce program, ensuring they meet our standards for trust and reliability," Birwadker said, though he did not detail the specific criteria agents must meet or whether Visa charges fees for approval.

This gatekeeping role could prove contentious, particularly if Visa's approval process favors large technology companies over startups, or if the company faces pressure to block agents from competitors or politically controversial entities. Visa declined to provide details about how many agents it has approved so far or how long the vetting process typically takes.

Visa's legal battles and the long road to merchant adoption

The protocol launch comes at a complex moment for Visa, which continues to navigate significant legal and regulatory challenges even as its core business remains robust. The company's latest earnings report for the third quarter of fiscal year 2025 showed a 10% increase in net revenues to $9.2 billion, driven by resilient consumer spending and strong growth in cross-border transaction volume. For the full fiscal year ending September 30, 2024, Visa processed 289 billion transactions, with a total payments volume of $15.2 trillion.

However, the company's legal headwinds have intensified. In July 2025, a federal judge rejected a landmark $30 billion settlement that Visa and Mastercard had reached with merchants over long-disputed credit card swipe fees, sending the parties back to the negotiating table and extending the long-running legal battle.

Simultaneously, Visa remains under investigation by the Department of Justice over its rules for routing debit card transactions, with regulators scrutinizing whether the company's practices unlawfully limit merchant choice and stifle competition. These domestic challenges are mirrored abroad, where European regulators have continued their own antitrust investigations into the fee structures of both Visa and its primary competitor, Mastercard.

Against this backdrop of regulatory pressure, Birwadker acknowledged that adoption of the Trusted Agent Protocol will take time. "As agentic commerce continues to rise, we recognize that consumer trust is still in its early stages," he said. "That's why our focus through 2025 is on building foundational credibility and demonstrating real-world value."

The protocol is available immediately in Visa's Developer Center and on GitHub, with agent onboarding already active and merchant integration resources available. But Birwadker declined to provide specific targets for how many merchants might adopt the protocol by the end of 2026. "Adoption is aligned with the momentum we're already seeing," he said. "The launch of our protocol marks another big step — it's not just a technical milestone, but a signal that the industry is beginning to unify."

Industry analysts say merchant adoption will likely depend on how quickly agentic commerce grows as a percentage of overall e-commerce. While AI-driven traffic has surged dramatically, much of that consists of agents browsing and researching rather than completing purchases. If AI agents begin accounting for a significant share of completed transactions, merchants will face stronger incentives to adopt verification systems like Visa's protocol.

From fraud detection to AI gatekeeping: Visa's $10 billion bet on artificial intelligence

Visa's move reflects broader strategic bets on AI across the financial services industry. The company has invested $10 billion in technology over the past five years to reduce fraud and increase network security, with AI and machine learning central to those efforts. Visa's fraud detection system analyzes over 500 different attributes for each transaction, using AI models to assign real-time risk scores to the 300 billion annual transactions flowing through its network.

"Every single one of those transactions has been processed by AI," James Mirfin, Visa's global head of risk and identity solutions, said in a July 2024 CNBC interview discussing the company's fraud prevention efforts. "If you see a new type of fraud happening, our model will see that, it will catch it, it will score those transactions as high risk and then our customers can decide not to approve those transactions."

The company has also moved aggressively into new payment territories beyond its core card business. In January 2025, Visa partnered with Elon Musk's X (formerly Twitter) to provide the infrastructure for a digital wallet and peer-to-peer payment service called the X Money Account, competing with services like Venmo and Zelle. That deal marked Visa's first major partnership in the social media payments space and reflected the company's recognition that payment flows are increasingly happening outside traditional e-commerce channels.

The agentic commerce protocol represents an extension of this strategy — an attempt to ensure Visa remains central to payment flows even as the mechanics of shopping shift from direct human interaction to AI intermediation. Jack Forestell, Visa's Chief Product & Strategy Officer, framed the protocol in expansive terms: "We believe the entire payments ecosystem has a responsibility to ensure sellers trust AI agents with the same confidence they place in their most valued customers and networks."

The coming battle for control of AI shopping

The real test for Visa's protocol won't be technical — it will be political. As AI agents become a larger force in retail, whoever controls the verification infrastructure controls access to hundreds of billions of dollars in commerce. Visa's position as gatekeeper gives it enormous leverage, but also makes it a target.

Merchants chafing under Visa's existing fee structure and facing multiple antitrust investigations may resist ceding even more power to the payments giant. Competitors like Google and OpenAI, each with their own ambitions in commerce, have little incentive to let Visa dictate standards. Regulators already scrutinizing Visa's market dominance will surely examine whether its agent approval process unfairly advantages certain players.

And there's a deeper question lurking beneath the technical specifications and corporate partnerships: In an economy increasingly mediated by AI, who decides which algorithms get to spend our money? Visa is making an aggressive bid to be that arbiter, wrapping its answer in the language of security and interoperability. Whether merchants, consumers, and regulators accept that proposition will determine not just the fate of the Trusted Agent Protocol, but the structure of AI-powered commerce itself.

For now, Visa is moving forward with the confidence of a company that has weathered disruption before. But in the emerging world of agentic commerce, being too trusted might prove just as dangerous as not being trusted enough.

❌
❌