Normal view

Yesterday — 28 October 2025Main stream

Google's 'Watch & Learn' framework cracks the data bottleneck for training computer-use agents

24 October 2025 at 03:00

A new framework developed by researchers at Google Cloud and DeepMind aims to address one of the key challenges of developing computer use agents (CUAs): Gathering high-quality training examples at scale.

The framework, dubbed Watch & Learn (W&L), addresses the problem of training data generation in a way that doesn’t require human annotation and can automatically extract demonstrations from raw videos.

Their experiments show that data generated W&L can be used to train or fine-tune existing computer use and foundation models to improve their performance on computer-use tasks. But equally important, the same approach can be used to create in-context learning (ICL) examples for computer use agents, enabling companies to create CUAs for bespoke internal tasks without the need for costly training of specialized models.

The data bottleneck of CUA

The web is rich with video tutorials and screencasts that describe complex workflows for using applications. These videos are a gold mine that can provide computer use agents with domain knowledge and instructions for accomplishing different tasks through user interface interactions.

However, before they can be used to train CUA agents, these videos need to be transformed into annotated trajectories (that is, a set of task descriptions, screenshots and actions), a process that is prohibitively expensive and time-consuming when done manually.

Existing approaches to address this data bottleneck rely on annotating these videos through the use of multimodal language models, which usually result in low precision and faulty examples. A different approach uses self-play agents that autonomously explore user interfaces to collect trajectories. However, techniques using this approach usually create simple examples that are not useful in unpredictable real-world situations.

As the researchers note in their paper, “Overall, these approaches either rely on brittle heuristics, are costly as they rely on explorations in real environments or generate low-complexity demonstrations misaligned with human intent.”

Watch & Learn

The Watch & Learn framework tries to address the challenges of creating CUA demonstrations by rethinking the problem formulation.

Instead of directly generating trajectories or depending on complex multi-stage pipelines, the researchers frame the problem as an “inverse dynamics objective”: Given two consecutive observations, predict the intermediate action that produced the transition.

According to the researchers, this formulation is “easier to learn, avoids hand-crafted heuristics and generalizes robustly across applications.”

The W&L framework can be broken down into three key stages: Training an inverse dynamics model (IDM), retrieving raw videos, and training CUA agents.

In the first phase, the researchers used agents to interact with live web pages to create a large corpus of 500,000 state transitions (two consecutive observations and the action that resulted in the transition). They then used this data (along with 132,000 human-annotated transitions from existing open datasets) to train an inverse dynamics model (IDM) that takes in two consecutive observations and predicts the transition action. Their trained IDM, which is a small transformer model, outperformed off-the-shelf foundation models in predicting transition actions.

The researchers then designed a pipeline that retrieves videos from platforms such as YouTube and runs them through IDM to generate high-quality trajectories. The IDM takes in consecutive video frames and determines the actions (scroll, click) that caused the changes in the environment, which are then packaged into annotated trajectories. Using this method, they generated 53,125 trajectories with high-accuracy action labels.

These examples can be used to train effective computer use models for specific tasks. But the researchers also found that trajectories extracted through IDM can serve as in-context learning examples to improve the performance of CUAs on bespoke tasks at inference time. For ICL, they use Gemini 2.5 Flash to add additional reasoning annotations to the observation/action examples in the trajectories, which can then be inserted into the CUA agent’s prompt (usually 3-5 examples) during inference.

“This dual role (training and in-context guidance) enables flexible integration with both open-source models and general-purpose agents,” the researchers write.

W&L in action

To test the usefulness of W&L, the researchers ran a series of experiments with closed and open source models on the OSWorld benchmark, which evaluates agents in real desktop and operating system environments across different tasks, including productivity, programming and design.

For fine-tuning, they used their corpus of 53,000 trajectories to train two open source models: UI-TARS-1.5, a strong, open source vision-language-action model designed specifically for computer use, and Qwen 2.5-VL, an open-weight multimodal LLM. 

For in-context learning tests, they applied W&L examples to general-purpose multimodal models such as Gemini 2.5 Flash, OpenAI o3 and Claude Sonnet 4. 

W&L resulted in improvements on OSWorld in all model categories, including up to 3 points for ICL on general-purpose models and up to 11 points for fine-tuned open-source models.

More importantly, these benefits were achieved without any manual annotation, “demonstrating that web-scale human workflows can serve as a practical and scalable foundation for advancing CUAs towards real-world deployment,” the researchers write.

This could have important implications for real-world applications, enabling enterprises to turn their existing corpora of videos and conference recordings into training data for CUAs. It also makes it easier to generate new training trajectories. All you will need to do is record videos of performing different tasks and have them annotated by an IDM. And with frontier models constantly improving and becoming cheaper, you can expect to get more from your existing data and the field continues to progress.

Before yesterdayMain stream

New 'Markovian Thinking' technique unlocks a path to million-token AI reasoning

21 October 2025 at 03:00

Researchers at Mila have proposed a new technique that makes large language models (LLMs) vastly more efficient when performing complex reasoning. Called Markovian Thinking, the approach allows LLMs to engage in lengthy reasoning without incurring the prohibitive computational costs that currently limit such tasks.

The team’s implementation, an environment named Delethink, structures the reasoning chain into fixed-size chunks, breaking the scaling problem that plagues very long LLM responses. Initial estimates show that for a 1.5B parameter model, this method can cut the costs of training by more than two-thirds compared to standard approaches.

The quadratic curse of long-chain reasoning

For an LLM to solve a complex problem, it often needs to generate a long series of intermediate “thinking” tokens, often referred to as chain-of-thought (CoT). In recent years, researchers have found that using reinforcement learning (RL) to train models to produce longer CoTs (sometimes referred to as LongCoT) has significantly improved their reasoning capabilities.

However, the standard method for this has a critical flaw: The AI's "state" (the prompt plus all the reasoning tokens it has generated thus far in its processing) grows with every new reasoning token. For modern transformer-based models, this means the computational cost explodes quadratically as the reasoning chain gets longer, making it prohibitively expensive to train models for very complex tasks.

Most current attempts to manage this cost focus on limiting how much thinking the model does, implicitly preferring shorter solutions or terminating the process early. While these methods offer some relief, the Mila researchers still operate within the LongCoT framework and are thus fundamentally bound by its quadratic nature.

Instead of trying to control the computational growth, Mila created an RL environment that avoids the quadratic problem altogether. As co-author Amirhossein Kazemnejad explained, the goal is to enable capabilities like multi-week reasoning and scientific discovery. "That regime (and the RL needed to enable such capabilities) is not supported by the current LongCoT paradigm, because of quadratic compute cost," he said.

Thinking in chunks with Delethink

The researchers' solution is a paradigm they call the "Markovian Thinker," where the model reasons while keeping the size of its reasoning context window constant. The core idea is to change the RL setup to separate "how long the model thinks" from "how much context it must process." If done correctly, a Markovian Thinker turns the quadratic growth problem into linear compute and fixed memory requirements for LLM reasoning.

The researchers put this paradigm into practice through Delethink, which forces the model to reason in a sequence of fixed-size chunks, such as 8,000 tokens at a time. Within each chunk, the model reasons as it normally would, using the classic attention mechanism. But when it reaches the limit of the chunk, the environment resets the context, creating a new prompt that includes the original query plus a short "carryover" from the previous chunk. For example, the carryover could be the last few tokens of the previous chunk of CoT or a summary of the most important results.

This rearrangement of the problem forces the model to learn how to embed a summary of its progress, or a "textual Markovian state," into this carryover to continue its reasoning in the next chunk. This addresses the common concern of whether the model can remember important details from earlier steps. 

According to Kazemnejad, the model learns what to remember. "With training... the model is forced to learn to carry forward the task-critical state," he explained. He added crucial clarification for practical use: The original input prompt is not modified, including the documents or contextual data added to it. “Our approach is aimed at the reasoning phase and does not modify the prompt," he said.

Delethink in action

To test their approach, the researchers trained R1-Distill-1.5B with Delethink on a dataset of competition-level math problems, then evaluated it against several benchmarks. The model was trained to reason for up to 24,000 tokens but with fixed 8,000-token chunks.

The researchers compared this to models trained with the standard LongCoT-RL method. Their findings indicate that the model trained with Delethink could reason up to 24,000 tokens, and matched or surpassed a LongCoT model trained with the same 24,000-token budget on math benchmarks. On other tasks like coding and PhD-level questions, Delethink also matched or slightly beat its LongCoT counterpart. “Overall, these results indicate that Delethink uses its thinking tokens as effectively as LongCoT-RL with reduced compute,” the researchers write.

The benefits become even more pronounced when scaling beyond the training budget. While models trained with LongCoT quickly plateaued at their training limits, the Delethink-trained model continued to improve its performance. For instance, some math problems were only solved after the model reasoned for up to 140,000 tokens, far beyond its 24,000-token training budget. This linear compute advantage is substantial for enterprise applications. The researchers estimate that training a model to an average thinking length of 96,000 tokens would require 27 H100-GPU-months with LongCoT, versus just 7 with Delethink.

This efficiency extends directly to inference, the primary operational cost for most enterprises. "Models trained in Markovian Thinking use the same inference style (delethink-tracing) during test time, which provides the same advantages of linear compute and constant memory after training," said Kazemnejad. He offered a practical example: An AI agent could "debug a large codebase and think for a long time... which of course reduces the cost significantly compared to the conventional LongCoT approach."

Interestingly, the researchers found that off-the-shelf reasoning models, even without any specific training, already exhibit some ability to think in a Markovian way. This finding has immediate practical implications for developers. "In practice, this means that — without Delethink-RL— these models can already run a delethink-tracing wrapper and perform competitively with LongCoT on our benchmarked tasks," Kazemnejad said.

Their experiments with larger models such as GPT-OSS 120B showed robust performance with Delethink across a range of complex tasks. This latent ability provides a strong starting point for RL training, helping explain why the method is so effective. “Together, these results suggest that Delethink is compatible and scales with state-of-the-art models,” the researchers conclude.

The success of Markovian Thinking shows it may be possible for "next-generation reasoning models to think for millions of tokens," the researchers note. This opens the door to fundamentally new AI capabilities, moving beyond current constraints.

"Markovian Thinking... opens the path for models that can 'think' for very long horizons, which we view as a necessary step toward eventual scientific discovery," Kazemnejad said. "Our approach removes a key bottleneck and can allow training for much longer horizon tasks, which enables next-gen capabilities."

Codev lets enterprises avoid vibe coding hangovers with a team of agents that generate and document code

17 October 2025 at 21:45

For many software developers using generative AI, vibe coding is a double-edged sword.

The process delivers rapid prototypes but often leaves a trail of brittle, undocumented code that creates significant technical debt.

A new open-source platform, Codev, addresses this by proposing a fundamental shift: treating the natural language conversation with an AI as part of the actual source code.

Codev is based on SP(IDE)R, a framework designed to turn vibe-coding conversations into structured, versioned, and auditable assets that become part of the code repository.

What is Codev?

At its core, Codev is a methodology that treats natural language context as an integral part of the development lifecycle as opposed to a disposable artifact as is the case with vanilla vibe coding.

According to co-founder Waleed Kadous, the goal is to invert the typical engineering workflow.

"A key principle of Codev is that documents like the specification are the actual code of the system," he told VentureBeat. "It's almost like natural language is compiled down into Typescript by our agents."

This approach avoids the common pitfall where documentation is created after the fact, if at all.

Its flagship protocol, SP(IDE)R, provides a lightweight but formal structure for building software. The process begins with Specify, where a human and multiple AI agents collaborate to turn a high-level request into concrete acceptance criteria. Next, in the Plan stage, an AI proposes a phased implementation, which is again reviewed.

For each phase, the AI enters an IDE loop: it Implements the code, Defends it against bugs and regression with comprehensive tests, and Evaluates the result against the specification. The final step is Review, where the team documents lessons learned to update and improve the SP(IDE)R protocol itself for future projects.

The framework’s key differentiator is its use of multiple agents and explicit human review at different stages. Kadous notes that each agent brings unique strengths to the review process.

"Gemini is extremely good at catching security issues," he said, citing a critical cross-site scripting (XSS) flaw and another bug that "would have shared an OpenAI API key with the client, which could cost thousands of dollars."

Meanwhile, "GPT-5 is very good at understanding how to simplify a design." This structured review, with a human providing final approval at each stage, prevents the kind of runaway automation that leads to flawed code.

The platform’s AI-native philosophy extends to its installation. There is no complex installer; instead, a user instructs their AI agent to apply the Codev GitHub repository to set up the project. The developers "dogfooded" their framework, using Codev to build Codev.

“The key point here is that natural language is executable now, with the agent being the interpreter,” Kadous said. “This is great because it means it's not a ‘blind’ integration of Codev, the agent gets to choose the best way to integrate it and can intelligently make decisions.”

Codev case study

To test the framework's effectiveness, its creators ran a direct comparison between vanilla vibe-coding and Codev. They gave Claude Opus 4.1 a request to build a modern web-based todo manager. The first attempt used a conversational, vibe-coding approach. The result was a plausible-looking demo. However, an automated analysis conducted by three independent AI agents found that it had implemented 0% of the required functionality, contained no tests, and lacked a database or API.

The second attempt used the same AI model and prompt but applied the SP(IDE)R protocol. This time, the AI produced a production-ready application with 32 source files, 100% of the specified functionality, five test suites, a SQLite database, and a complete RESTful API.

Throughout this process, the human developers reported they never directly edited a single line of source code. While this was a single experiment, Kadous estimates the impact is substantial.

"Subjectively, it feels like I'm about three times as productive with Codev as without," he says. The quality also speaks for itself. "I used LLMs as a judge, and one of them described the output like what a well-oiled engineering team would produce. That was exactly what I was aiming for."

While the process is powerful, it redefines the developer's role from a hands-on coder to a system architect and reviewer. According to Kadous, the initial spec and plan stages can each take between 45 minutes to two hours of focused collaboration.

This is in contrast to the impression given by many vibe-coding platforms, where a single prompt and a few minutes of processing gives you a fully functional and scalable application.

"All of the value I add is in the background knowledge I apply to the specs and plans," he explains. He emphasizes that the framework is designed to augment, not replace, experienced talent. "The people who will do the best... are senior engineers and above because they know the pitfalls... It just takes the senior engineer you already have and makes them much more productive."

A future of human and AI collaboration

Frameworks like Codev signal a shift where the primary creative act of software development moves from writing code to crafting precise, machine-readable specifications and plans. For enterprise teams, this means AI-generated code can become auditable, maintainable, and reliable. By capturing the entire development conversation in version control and enforcing it with CI, the process turns ephemeral chats into durable engineering assets.

Codev proposes a future where the AI acts not as a chaotic assistant, but as a disciplined collaborator in a structured, human-led workflow.

However, Kadous acknowledges this shift creates new challenges for the workforce. "Senior engineers that reject AI outright will be outpaced by senior engineers who embrace it," he predicts. He also expresses concern for junior developers who may not get the chance "to build their architectural chops," a skill that becomes even more critical when guiding AI.

This highlights a central challenge for the industry: ensuring that as AI elevates top performers, it also creates pathways to develop the next generation of talent.

ACE prevents context collapse with ‘evolving playbooks’ for self-improving AI agents

16 October 2025 at 16:00

A new framework from Stanford University and SambaNova addresses a critical challenge in building robust AI agents: context engineering. Called Agentic Context Engineering (ACE), the framework automatically populates and modifies the context window of large language model (LLM) applications by treating it as an “evolving playbook” that creates and refines strategies as the agent gains experience in its environment.

ACE is designed to overcome key limitations of other context-engineering frameworks, preventing the model’s context from degrading as it accumulates more information. Experiments show that ACE works for both optimizing system prompts and managing an agent's memory, outperforming other methods while also being significantly more efficient.

The challenge of context engineering

Advanced AI applications that use LLMs largely rely on "context adaptation," or context engineering, to guide their behavior. Instead of the costly process of retraining or fine-tuning the model, developers use the LLM’s in-context learning abilities to guide its behavior by modifying the input prompts with specific instructions, reasoning steps, or domain-specific knowledge. This additional information is usually obtained as the agent interacts with its environment and gathers new data and experience. The key goal of context engineering is to organize this new information in a way that improves the model’s performance and avoids confusing it. This approach is becoming a central paradigm for building capable, scalable, and self-improving AI systems.

Context engineering has several advantages for enterprise applications. Contexts are interpretable for both users and developers, can be updated with new knowledge at runtime, and can be shared across different models. Context engineering also benefits from ongoing hardware and software advances, such as the growing context windows of LLMs and efficient inference techniques like prompt and context caching.

There are various automated context-engineering techniques, but most of them face two key limitations. The first is a “brevity bias,” where prompt optimization methods tend to favor concise, generic instructions over comprehensive, detailed ones. This can undermine performance in complex domains.

The second, more severe issue is "context collapse." When an LLM is tasked with repeatedly rewriting its entire accumulated context, it can suffer from a kind of digital amnesia.

“What we call ‘context collapse’ happens when an AI tries to rewrite or compress everything it has learned into a single new version of its prompt or memory,” the researchers said in written comments to VentureBeat. “Over time, that rewriting process erases important details—like overwriting a document so many times that key notes disappear. In customer-facing systems, this could mean a support agent suddenly losing awareness of past interactions... causing erratic or inconsistent behavior.”

The researchers argue that “contexts should function not as concise summaries, but as comprehensive, evolving playbooks—detailed, inclusive, and rich with domain insights.” This approach leans into the strength of modern LLMs, which can effectively distill relevance from long and detailed contexts.

How Agentic Context Engineering (ACE) works

ACE is a framework for comprehensive context adaptation designed for both offline tasks, like system prompt optimization, and online scenarios, such as real-time memory updates for agents. Rather than compressing information, ACE treats the context like a dynamic playbook that gathers and organizes strategies over time.

The framework divides the labor across three specialized roles: a Generator, a Reflector, and a Curator. This modular design is inspired by “how humans learn—experimenting, reflecting, and consolidating—while avoiding the bottleneck of overloading a single model with all responsibilities,” according to the paper.

The workflow starts with the Generator, which produces reasoning paths for input prompts, highlighting both effective strategies and common mistakes. The Reflector then analyzes these paths to extract key lessons. Finally, the Curator synthesizes these lessons into compact updates and merges them into the existing playbook.

To prevent context collapse and brevity bias, ACE incorporates two key design principles. First, it uses incremental updates. The context is represented as a collection of structured, itemized bullets instead of a single block of text. This allows ACE to make granular changes and retrieve the most relevant information without rewriting the entire context.

Second, ACE uses a “grow-and-refine” mechanism. As new experiences are gathered, new bullets are appended to the playbook and existing ones are updated. A de-duplication step regularly removes redundant entries, ensuring the context remains comprehensive yet relevant and compact over time.

ACE in action

The researchers evaluated ACE on two types of tasks that benefit from evolving context: agent benchmarks requiring multi-turn reasoning and tool use, and domain-specific financial analysis benchmarks demanding specialized knowledge. For high-stakes industries like finance, the benefits extend beyond pure performance. As the researchers said, the framework is “far more transparent: a compliance officer can literally read what the AI learned, since it’s stored in human-readable text rather than hidden in billions of parameters.”

The results showed that ACE consistently outperformed strong baselines such as GEPA and classic in-context learning, achieving average performance gains of 10.6% on agent tasks and 8.6% on domain-specific benchmarks in both offline and online settings.

Critically, ACE can build effective contexts by analyzing the feedback from its actions and environment instead of requiring manually labeled data. The researchers note that this ability is a "key ingredient for self-improving LLMs and agents." On the public AppWorld benchmark, designed to evaluate agentic systems, an agent using ACE with a smaller open-source model (DeepSeek-V3.1) matched the performance of the top-ranked, GPT-4.1-powered agent on average and surpassed it on the more difficult test set.

The takeaway for businesses is significant. “This means companies don’t have to depend on massive proprietary models to stay competitive,” the research team said. “They can deploy local models, protect sensitive data, and still get top-tier results by continuously refining context instead of retraining weights.”

Beyond accuracy, ACE proved to be highly efficient. It adapts to new tasks with an average 86.9% lower latency than existing methods and requires fewer steps and tokens. The researchers point out that this efficiency demonstrates that “scalable self-improvement can be achieved with both higher accuracy and lower overhead.”

For enterprises concerned about inference costs, the researchers point out that the longer contexts produced by ACE do not translate to proportionally higher costs. Modern serving infrastructures are increasingly optimized for long-context workloads with techniques like KV cache reuse, compression, and offloading, which amortize the cost of handling extensive context.

Ultimately, ACE points toward a future where AI systems are dynamic and continuously improving. "Today, only AI engineers can update models, but context engineering opens the door for domain experts—lawyers, analysts, doctors—to directly shape what the AI knows by editing its contextual playbook," the researchers said. This also makes governance more practical. "Selective unlearning becomes much more tractable: if a piece of information is outdated or legally sensitive, it can simply be removed or replaced in the context, without retraining the model.”

❌
❌