Normal view

Today — 29 October 2025Main stream

Fortanix and NVIDIA partner on AI security platform for highly regulated industries

Data security company Fortanix Inc. announced a new joint solution with NVIDIA: a turnkey platform that allows organizations to deploy agentic AI within their own data centers or sovereign environments, backed by NVIDIA’s "confidential computing" GPUs.

“Our goal is to make AI trustworthy by securing every layer—from the chip to the model to the data," said Fortanix CEO and co-founder Anand Kashyap, in a recent video call interview with VentureBeat. "Confidential computing gives you that end-to-end trust so you can confidently use AI with sensitive or regulated information.”

The solution arrives at a pivotal moment for industries such as healthcare, finance, and government — sectors eager to embrace AI but constrained by strict privacy and regulatory requirements.

Fortanix’s new platform, powered by NVIDIA Confidential Computing, enables enterprises to build and run AI systems on sensitive data without sacrificing security or control.

“Enterprises in finance, healthcare and government want to harness the power of AI, but compromising on trust, compliance, or control creates insurmountable risk,” said Anuj Jaiswal, chief product officer at Fortanix, in a press release. “We’re giving enterprises a sovereign, on-prem platform for AI agents—one that proves what’s running, protects what matters, and gets them to production faster.”

Secure AI, Verified from Chip to Model

At the heart of the Fortanix–NVIDIA collaboration is a confidential AI pipeline that ensures data, models, and workflows remain protected throughout their lifecycle.

The system uses a combination of Fortanix Data Security Manager (DSM) and Fortanix Confidential Computing Manager (CCM), integrated directly into NVIDIA’s GPU architecture.

“You can think of DSM as the vault that holds your keys, and CCM as the gatekeeper that verifies who’s allowed to use them," Kashyap said. "DSM enforces policy, CCM enforces trust.”

DSM serves as a FIPS 140-2 Level 3 hardware security module that manages encryption keys and enforces strict access controls.

CCM, introduced alongside this announcement, verifies the trustworthiness of AI workloads and infrastructure using composite attestation—a process that validates both CPUs and GPUs before allowing access to sensitive data.

Only when a workload is verified by CCM does DSM release the cryptographic keys necessary to decrypt and process data.

“The Confidential Computing Manager checks that the workload, the CPU, and the GPU are running in a trusted state," explained Kashyap. "It issues a certificate that DSM validates before releasing the key. That ensures the right workload is running on the right hardware before any sensitive data is decrypted.”

This “attestation-gated” model creates what Fortanix describes as a provable chain of trust extending from the hardware chip to the application layer.

It’s an approach aimed squarely at industries where confidentiality and compliance are non-negotiable.

From Pilot to Production—Without the Security Trade-Off

According to Kashyap, the partnership marks a step forward from traditional data encryption and key management toward securing entire AI workloads.

Kashyap explained that enterprises can deploy the Fortanix–NVIDIA solution incrementally, using a lift-and-shift model to migrate existing AI workloads into a confidential environment.

“We offer two form factors: SaaS with zero footprint, and self-managed. Self-managed can be a virtual appliance or a 1U physical FIPS 140-2 Level 3 appliance," he noted. "The smallest deployment is a three-node cluster, with larger clusters of 20–30 nodes or more.”

Customers already running AI models—whether open-source or proprietary—can move them onto NVIDIA’s Hopper or Blackwell GPU architectures with minimal reconfiguration.

For organizations building out new AI infrastructure, Fortanix’s Armet AI platform provides orchestration, observability, and built-in guardrails to speed up time to production.

“The result is that enterprises can move from pilot projects to trusted, production-ready AI in days rather than months,” Jaiswal said.

Compliance by Design

Compliance remains a key driver behind the new platform’s design. Fortanix’s DSM enforces role-based access control, detailed audit logging, and secure key custody—elements that help enterprises demonstrate compliance with stringent data protection regulations.

These controls are essential for regulated industries such as banking, healthcare, and government contracting.

The company emphasizes that the solution is built for both confidentiality and sovereignty.

For governments and enterprises that must retain local control over their AI environments, the system supports fully on-premises or air-gapped deployment options.

Fortanix and NVIDIA have jointly integrated these technologies into the NVIDIA AI Factory Reference Design for Government, a blueprint for building secure national or enterprise-level AI systems.

Future-Proofed for a Post-Quantum Era

In addition to current encryption standards such as AES, Fortanix supports post-quantum cryptography (PQC) within its DSM product.

As global research in quantum computing accelerates, PQC algorithms are expected to become a critical component of secure computing frameworks.

“We don’t invent cryptography; we implement what’s proven,” Kashyap said. “But we also make sure our customers are ready for the post-quantum era when it arrives.”

Real-World Flexibility

While the platform is designed for on-premises and sovereign use cases, Kashyap emphasized that it can also run in major cloud environments that already support confidential computing.

Enterprises operating across multiple regions can maintain consistent key management and encryption controls, either through centralized key hosting or replicated key clusters.

This flexibility allows organizations to shift AI workloads between data centers or cloud regions—whether for performance optimization, redundancy, or regulatory reasons—without losing control over their sensitive information.

Fortanix converts usage into “credits,” which correspond to the number of AI instances running within a factory environment. The structure allows enterprises to scale incrementally as their AI projects grow.

Fortanix will showcase the joint platform at NVIDIA GTC, held October 27–29, 2025, at the Walter E. Washington Convention Center in Washington, D.C. Visitors can find Fortanix at booth I-7 for live demonstrations and discussions on securing AI workloads in highly regulated environments.

About Fortanix

Fortanix Inc. was founded in 2016 in Mountain View, California, by Anand Kashyap and Ambuj Kumar, both former Intel engineers who worked on trusted execution and encryption technologies. The company was created to commercialize confidential computing—then an emerging concept—by extending the security of encrypted data beyond storage and transmission to data in active use, according to TechCrunch and the company’s own About page.

Kashyap, who previously served as a senior security architect at Intel and VMware, and Kumar, a former engineering lead at Intel, drew on years of work in trusted hardware and virtualization systems. Their shared insight into the gap between research-grade cryptography and enterprise adoption drove them to found Fortanix, according to Forbes and Crunchbase.

Today, Fortanix is recognized as a global leader in confidential computing and data security, offering solutions that protect data across its lifecycle—at rest, in transit, and in use.

Fortanix serves enterprises and governments worldwide with deployments ranging from cloud-native services to high-security, air-gapped systems.

"Historically we provided encryption and key-management capabilities," Kashyap said. "Now we’re going further to secure the workload itself—specifically AI—so an entire AI pipeline can run protected with confidential computing. That applies whether the AI runs in the cloud or in a sovereign environment handling sensitive or regulated data.

Yesterday — 28 October 2025Main stream

MiniMax-M2 is the new king of open source LLMs (especially for agentic tool calling)

Watch out, DeepSeek and Qwen! There's a new king of open source large language models (LLMs), especially when it comes to something enterprises are increasingly valuing: agentic tool use — that is, the ability to go off and use other software capabilities like web search or bespoke applications — without much human guidance.

That model is none other than MiniMax-M2, the latest LLM from the Chinese startup of the same name. And in a big win for enterprises globally, the model is available under a permissive, enterprise-friendly MIT License, meaning it is made available freely for developers to take, deploy, retrain, and use how they see fit — even for commercial purposes. It can be found on Hugging Face, GitHub and ModelScope, as well as through MiniMax's API here. It supports OpenAI and Anthropic API standards, as well, making it easy for customers of said proprietary AI startups to shift out their models to MiniMax's API, if they want.

According to independent evaluations by Artificial Analysis, a third-party generative AI model benchmarking and research organization, M2 now ranks first among all open-weight systems worldwide on the Intelligence Index—a composite measure of reasoning, coding, and task-execution performance.

In agentic benchmarks that measure how well a model can plan, execute, and use external tools—skills that power coding assistants and autonomous agents—MiniMax’s own reported results, following the Artificial Analysis methodology, show τ²-Bench 77.2, BrowseComp 44.0, and FinSearchComp-global 65.5.

These scores place it at or near the level of top proprietary systems like GPT-5 (thinking) and Claude Sonnet 4.5, making MiniMax-M2 the highest-performing open model yet released for real-world agentic and tool-calling tasks.

What It Means For Enterprises and the AI Race

Built around an efficient Mixture-of-Experts (MoE) architecture, MiniMax-M2 delivers high-end capability for agentic and developer workflows while remaining practical for enterprise deployment.

For technical decision-makers, the release marks an important turning point for open models in business settings. MiniMax-M2 combines frontier-level reasoning with a manageable activation footprint—just 10 billion active parameters out of 230 billion total.

This design enables enterprises to operate advanced reasoning and automation workloads on fewer GPUs, achieving near-state-of-the-art results without the infrastructure demands or licensing costs associated with proprietary frontier systems.

Artificial Analysis’ data show that MiniMax-M2’s strengths go beyond raw intelligence scores. The model leads or closely trails top proprietary systems such as GPT-5 (thinking) and Claude Sonnet 4.5 across benchmarks for end-to-end coding, reasoning, and agentic tool use.

Its performance in τ²-Bench, SWE-Bench, and BrowseComp indicates particular advantages for organizations that depend on AI systems capable of planning, executing, and verifying complex workflows—key functions for agentic and developer tools inside enterprise environments.

As LLM engineer Pierre-Carl Langlais aka Alexander Doria posted on X: "MiniMax [is] making a case for mastering the technology end-to-end to get actual agentic automation."

Compact Design, Scalable Performance

MiniMax-M2’s technical architecture is a sparse Mixture-of-Experts model with 230 billion total parameters and 10 billion active per inference.

This configuration significantly reduces latency and compute requirements while maintaining broad general intelligence.

The design allows for responsive agent loops—compile–run–test or browse–retrieve–cite cycles—that execute faster and more predictably than denser models.

For enterprise technology teams, this means easier scaling, lower cloud costs, and reduced deployment friction. According to Artificial Analysis, the model can be served efficiently on as few as four NVIDIA H100 GPUs at FP8 precision, a setup well within reach for mid-size organizations or departmental AI clusters.

Benchmark Leadership Across Agentic and Coding Workflows

MiniMax’s benchmark suite highlights strong real-world performance across developer and agent environments. The figure below, released with the model, compares MiniMax-M2 (in red) with several leading proprietary and open models, including GPT-5 (thinking), Claude Sonnet 4.5, Gemini 2.5 Pro, and DeepSeek-V3.2.

MiniMax-M2 achieves top or near-top performance in many categories:

  • SWE-bench Verified: 69.4 — close to GPT-5’s 74.9

  • ArtifactsBench: 66.8 — above Claude Sonnet 4.5 and DeepSeek-V3.2

  • τ²-Bench: 77.2 — approaching GPT-5’s 80.1

  • GAIA (text only): 75.7 — surpassing DeepSeek-V3.2

  • BrowseComp: 44.0 — notably stronger than other open models

  • FinSearchComp-global: 65.5 — best among tested open-weight systems

These results show MiniMax-M2’s capability in executing complex, tool-augmented tasks across multiple languages and environments—skills increasingly relevant for automated support, R&D, and data analysis inside enterprises.

Strong Showing in Artificial Analysis’ Intelligence Index

The model’s overall intelligence profile is confirmed in the latest Artificial Analysis Intelligence Index v3.0, which aggregates performance across ten reasoning benchmarks including MMLU-Pro, GPQA Diamond, AIME 2025, IFBench, and τ²-Bench Telecom.

MiniMax-M2 scored 61 points, ranking as the highest open-weight model globally and following closely behind GPT-5 (high) and Grok 4.

Artificial Analysis highlighted the model’s balance between technical accuracy, reasoning depth, and applied intelligence across domains. For enterprise users, this consistency indicates a reliable model foundation suitable for integration into software engineering, customer support, or knowledge automation systems.

Designed for Developers and Agentic Systems

MiniMax engineered M2 for end-to-end developer workflows, enabling multi-file code edits, automated testing, and regression repair directly within integrated development environments or CI/CD pipelines.

The model also excels in agentic planning—handling tasks that combine web search, command execution, and API calls while maintaining reasoning traceability.

These capabilities make MiniMax-M2 especially valuable for enterprises exploring autonomous developer agents, data analysis assistants, or AI-augmented operational tools.

Benchmarks such as Terminal-Bench and BrowseComp demonstrate the model’s ability to adapt to incomplete data and recover gracefully from intermediate errors, improving reliability in production settings.

Interleaved Thinking and Structured Tool Use

A distinctive aspect of MiniMax-M2 is its interleaved thinking format, which maintains visible reasoning traces between <think>...</think> tags.

This enables the model to plan and verify steps across multiple exchanges, a critical feature for agentic reasoning. MiniMax advises retaining these segments when passing conversation history to preserve the model’s logic and continuity.

The company also provides a Tool Calling Guide on Hugging Face, detailing how developers can connect external tools and APIs via structured XML-style calls.

This functionality allows MiniMax-M2 to serve as the reasoning core for larger agent frameworks, executing dynamic tasks such as search, retrieval, and computation through external functions.

Open Source Access and Enterprise Deployment Options

Enterprises can access the model through the MiniMax Open Platform API and MiniMax Agent interface (a web chat similar to ChatGPT), both currently free for a limited time.

MiniMax recommends SGLang and vLLM for efficient serving, each offering day-one support for the model’s unique interleaved reasoning and tool-calling structure.

Deployment guides and parameter configurations are available through MiniMax’s documentation.

Cost Efficiency and Token Economics

As Artificial Analysis noted, MiniMax’s API pricing is set at $0.30 per million input tokens and $1.20 per million output tokens, among the most competitive in the open-model ecosystem.

Provider

Model (doc link)

Input $/1M

Output $/1M

Notes

MiniMax

MiniMax-M2

$0.30

$1.20

Listed under “Chat Completion v2” for M2.

OpenAI

GPT-5

$1.25

$10.00

Flagship model pricing on OpenAI’s API pricing page.

OpenAI

GPT-5 mini

$0.25

$2.00

Cheaper tier for well-defined tasks.

Anthropic

Claude Sonnet 4.5

$3.00

$15.00

Anthropic’s current per-MTok list; long-context (>200K input) uses a premium tier.

Google

Gemini 2.5 Flash (Preview)

$0.30

$2.50

Prices include “thinking tokens”; page also lists cheaper Flash-Lite and 2.0 tiers.

xAI

Grok-4 Fast (reasoning)

$0.20

$0.50

“Fast” tier; xAI also lists Grok-4 at $3 / $15.

DeepSeek

DeepSeek-V3.2 (chat)

$0.28

$0.42

Cache-hit input is $0.028; table shows per-model details.

Qwen (Alibaba)

qwen-flash (Model Studio)

from $0.022

from $0.216

Tiered by input size (≤128K, ≤256K, ≤1M tokens); listed “Input price / Output price per 1M”.

Cohere

Command R+ (Aug 2024)

$2.50

$10.00

First-party pricing page also lists Command R ($0.50 / $1.50) and others.

Notes & caveats (for readers):

  • Prices are USD per million tokens and can change; check linked pages for updates and region/endpoint nuances (e.g., Anthropic long-context >200K input, Google Live API variants, cache discounts).

  • Vendors may bill extra for server-side tools (web search, code execution) or offer batch/context-cache discounts.

While the model produces longer, more explicit reasoning traces, its sparse activation and optimized compute design help maintain a favorable cost-performance balance—an advantage for teams deploying interactive agents or high-volume automation systems.

Background on MiniMax — an Emerging Chinese Powerhouse

MiniMax has quickly become one of the most closely watched names in China’s fast-rising AI sector.

Backed by Alibaba and Tencent, the company moved from relative obscurity to international recognition within a year—first through breakthroughs in AI video generation, then through a series of open-weight large language models (LLMs) aimed squarely at developers and enterprises.

The company first captured global attention in late 2024 with its AI video generation tool, “video-01,” which demonstrated the ability to create dynamic, cinematic scenes in seconds. VentureBeat described how the model’s launch sparked widespread interest after online creators began sharing lifelike, AI-generated footage—most memorably, a viral clip of a Star Wars lightsaber duel that drew millions of views in under two days.

CEO Yan Junjie emphasized that the system outperformed leading Western tools in generating human movement and expression, an area where video AIs often struggle. The product, later commercialized through MiniMax’s Hailuo platform, showcased the startup’s technical confidence and creative reach, helping to establish China as a serious contender in generative video technology.

By early 2025, MiniMax had turned its attention to long-context language modeling, unveiling the MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01. These open-weight models introduced an unprecedented 4-million-token context window, doubling the reach of Google’s Gemini 1.5 Pro and dwarfing OpenAI’s GPT-4o by more than twentyfold.

The company continued its rapid cadence with the MiniMax-M1 release in June 2025, a model focused on long-context reasoning and reinforcement learning efficiency. M1 extended context capacity to 1 million tokens and introduced a hybrid Mixture-of-Experts design trained using a custom reinforcement-learning algorithm known as CISPO. Remarkably, VentureBeat reported that MiniMax trained M1 at a total cost of about $534,700, roughly one-tenth of DeepSeek’s R1 and far below the multimillion-dollar budgets typical for frontier-scale models.

For enterprises and technical teams, MiniMax’s trajectory signals the arrival of a new generation of cost-efficient, open-weight models designed for real-world deployment. Its open licensing—ranging from Apache 2.0 to MIT—gives businesses freedom to customize, self-host, and fine-tune without vendor lock-in or compliance restrictions.

Features such as structured function calling, long-context retention, and high-efficiency attention architectures directly address the needs of engineering groups managing multi-step reasoning systems and data-intensive pipelines.

As MiniMax continues to expand its lineup, the company has emerged as a key global innovator in open-weight AI, combining ambitious research with pragmatic engineering.

Open-Weight Leadership and Industry Context

The release of MiniMax-M2 reinforces the growing leadership of Chinese AI research groups in open-weight model development.

Following earlier contributions from DeepSeek, Alibaba’s Qwen series, and Moonshot AI, MiniMax’s entry continues the trend toward open, efficient systems designed for real-world use.

Artificial Analysis observed that MiniMax-M2 exemplifies a broader shift in focus toward agentic capability and reinforcement-learning refinement, prioritizing controllable reasoning and real utility over raw model size.

For enterprises, this means access to a state-of-the-art open model that can be audited, fine-tuned, and deployed internally with full transparency.

By pairing strong benchmark performance with open licensing and efficient scaling, MiniMaxAI positions MiniMax-M2 as a practical foundation for intelligent systems that think, act, and assist with traceable logic—making it one of the most enterprise-ready open AI models available today.

Before yesterdayMain stream

Mistral launches its own AI Studio for quick development with its European open source, proprietary models

The next big trend in AI providers appears to be "studio" environments on the web that allow users to spin up agents and AI applications within minutes.

Case in point, today the well-funded French AI startup Mistral launched its own Mistral AI Studio, a new production platform designed to help enterprises build, observe, and operationalize AI applications at scale atop Mistral's growing family of proprietary and open source large language models (LLMs) and multimodal models.

It's an evolution of its legacy API and AI building platorm, "Le Platforme," initially launched in late 2023, and that brand name is being retired for now.

The move comes just days after U.S. rival Google updated its AI Studio, also launched in late 2023, to be easier for non-developers to use and build and deploy apps with natural language, aka "vibe coding."

But while Google's update appears to target novices who want to tinker around, Mistral appears more fully focused on building an easy-to-use enterprise AI app development and launchpad, which may require some technical knowledge or familiarity with LLMs, but far less than that of a seasoned developer.

In other words, those outside the tech team at your enterprise could potentially use this to build and test simple apps, tools, and workflows — all powered by E.U.-native AI models operating on E.U.-based infrastructure.

That may be a welcome change for companies concerned about the political situation in the U.S., or who have large operations in Europe and prefer to give their business to homegrown alternatives to U.S. and Chinese tech giants.

In addition, Mistral AI Studio appears to offer an easier way for users to customize and fine-tune AI models for use at specific tasks.

Branded as “The Production AI Platform,” Mistral's AI Studio extends its internal infrastructure, bringing enterprise-grade observability, orchestration, and governance to teams running AI in production.

The platform unifies tools for building, evaluating, and deploying AI systems, while giving enterprises flexible control over where and how their models run — in the cloud, on-premise, or self-hosted.

Mistral says AI Studio brings the same production discipline that supports its own large-scale systems to external customers, closing the gap between AI prototyping and reliable deployment. It's available here with developer documentation here.

Extensive Model Catalog

AI Studio’s model selector reveals one of the platform’s strongest features: a comprehensive and versioned catalog of Mistral models spanning open-weight, code, multimodal, and transcription domains.

Available models include the following, though note that even for the open source ones, users will still be running a Mistral-based inference and paying Mistral for access through its API.

Model

License Type

Notes / Source

Mistral Large

Proprietary

Mistral’s top-tier closed-weight commercial model (available via API and AI Studio only).

Mistral Medium

Proprietary

Mid-range performance, offered via hosted API; no public weights released.

Mistral Small

Proprietary

Lightweight API model; no open weights.

Mistral Tiny

Proprietary

Compact hosted model optimized for latency; closed-weight.

Open Mistral 7B

Open

Fully open-weight model (Apache 2.0 license), downloadable on Hugging Face.

Open Mixtral 8×7B

Open

Released under Apache 2.0; mixture-of-experts architecture.

Open Mixtral 8×22B

Open

Larger open-weight MoE model; Apache 2.0 license.

Magistral Medium

Proprietary

Not publicly released; appears only in AI Studio catalog.

Magistral Small

Proprietary

Same; internal or enterprise-only release.

Devstral Medium

Proprietary / Legacy

Older internal development models, no open weights.

Devstral Small

Proprietary / Legacy

Same; used for internal evaluation.

Ministral 8B

Open

Open-weight model available under Apache 2.0; basis for Mistral Moderation model.

Pixtral 12B

Proprietary

Multimodal (text-image) model; closed-weight, API-only.

Pixtral Large

Proprietary

Larger multimodal variant; closed-weight.

Voxtral Small

Proprietary

Speech-to-text/audio model; closed-weight.

Voxtral Mini

Proprietary

Lightweight version; closed-weight.

Voxtral Mini Transcribe 2507

Proprietary

Specialized transcription model; API-only.

Codestral 2501

Open

Open-weight code-generation model (Apache 2.0 license, available on Hugging Face).

Mistral OCR 2503

Proprietary

Document-text extraction model; closed-weight.

This extensive model lineup confirms that AI Studio is both model-rich and model-agnostic, allowing enterprises to test and deploy different configurations according to task complexity, cost targets, or compute environments.

Bridging the Prototype-to-Production Divide

Mistral’s release highlights a common problem in enterprise AI adoption: while organizations are building more prototypes than ever before, few transition into dependable, observable systems.

Many teams lack the infrastructure to track model versions, explain regressions, or ensure compliance as models evolve.

AI Studio aims to solve that. The platform provides what Mistral calls the “production fabric” for AI — a unified environment that connects creation, observability, and governance into a single operational loop. Its architecture is organized around three core pillars: Observability, Agent Runtime, and AI Registry.

1. Observability

AI Studio’s Observability layer provides transparency into AI system behavior. Teams can filter and inspect traffic through the Explorer, identify regressions, and build datasets directly from real-world usage. Judges let teams define evaluation logic and score outputs at scale, while Campaigns and Datasets automatically transform production interactions into curated evaluation sets.

Metrics and dashboards quantify performance improvements, while lineage tracking connects model outcomes to the exact prompt and dataset versions that produced them. Mistral describes Observability as a way to move AI improvement from intuition to measurement.

2. Agent Runtime and RAG support

The Agent Runtime serves as the execution backbone of AI Studio. Each agent — whether it’s handling a single task or orchestrating a complex multi-step business process — runs within a stateful, fault-tolerant runtime built on Temporal. This architecture ensures reproducibility across long-running or retry-prone tasks and automatically captures execution graphs for auditing and sharing.

Every run emits telemetry and evaluation data that feed directly into the Observability layer. The runtime supports hybrid, dedicated, and self-hosted deployments, allowing enterprises to run AI close to their existing systems while maintaining durability and control.

While Mistral's blog post doesn’t explicitly reference retrieval-augmented generation (RAG), Mistral AI Studio clearly supports it under the hood.

Screenshots of the interface show built-in workflows such as RAGWorkflow, RetrievalWorkflow, and IngestionWorkflow, revealing that document ingestion, retrieval, and augmentation are first-class capabilities within the Agent Runtime system.

These components allow enterprises to pair Mistral’s language models with their own proprietary or internal data sources, enabling contextualized responses grounded in up-to-date information.

By integrating RAG directly into its orchestration and observability stack—but leaving it out of marketing language—Mistral signals that it views retrieval not as a buzzword but as a production primitive: measurable, governed, and auditable like any other AI process.

3. AI Registry

The AI Registry is the system of record for all AI assets — models, datasets, judges, tools, and workflows.

It manages lineage, access control, and versioning, enforcing promotion gates and audit trails before deployments.

Integrated directly with the Runtime and Observability layers, the Registry provides a unified governance view so teams can trace any output back to its source components.

Interface and User Experience

The screenshots of Mistral AI Studio show a clean, developer-oriented interface organized around a left-hand navigation bar and a central Playground environment.

  • The Home dashboard features three core action areas — Create, Observe, and Improve — guiding users through model building, monitoring, and fine-tuning workflows.

  • Under Create, users can open the Playground to test prompts or build agents.

  • Observe and Improve link to observability and evaluation modules, some labeled “coming soon,” suggesting staged rollout.

  • The left navigation also includes quick access to API Keys, Batches, Evaluate, Fine-tune, Files, and Documentation, positioning Studio as a full workspace for both development and operations.

Inside the Playground, users can select a model, customize parameters such as temperature and max tokens, and enable integrated tools that extend model capabilities.

Users can try the Playground for free, but will need to sign up with their phone number to receive an access code.

Integrated Tools and Capabilities

Mistral AI Studio includes a growing suite of built-in tools that can be toggled for any session:

  • Code Interpreter — lets the model execute Python code directly within the environment, useful for data analysis, chart generation, or computational reasoning tasks.

  • Image Generation — enables the model to generate images based on user prompts.

  • Web Search — allows real-time information retrieval from the web to supplement model responses.

  • Premium News — provides access to verified news sources via integrated provider partnerships, offering fact-checked context for information retrieval.

These tools can be combined with Mistral’s function calling capabilities, letting models call APIs or external functions defined by developers. This means a single agent could, for example, search the web, retrieve verified financial data, run calculations in Python, and generate a chart — all within the same workflow.

Beyond Text: Multimodal and Programmatic AI

With the inclusion of Code Interpreter and Image Generation, Mistral AI Studio moves beyond traditional text-based LLM workflows.

Developers can use the platform to create agents that write and execute code, analyze uploaded files, or generate visual content — all directly within the same conversational environment.

The Web Search and Premium News integrations also extend the model’s reach beyond static data, enabling real-time information retrieval with verified sources. This combination positions AI Studio not just as a playground for experimentation but as a full-stack environment for production AI systems capable of reasoning, coding, and multimodal output.

Deployment Flexibility

Mistral supports four main deployment models for AI Studio users:

  1. Hosted Access via AI Studio — pay-as-you-go APIs for Mistral’s latest models, managed through Studio workspaces.

  2. Third-Party Cloud Integration — availability through major cloud providers.

  3. Self-Deployment — open-weight models can be deployed on private infrastructure under the Apache 2.0 license, using frameworks such as TensorRT-LLM, vLLM, llama.cpp, or Ollama.

  4. Enterprise-Supported Self-Deployment — adds official support for both open and proprietary models, including security and compliance configuration assistance.

These options allow enterprises to balance operational control with convenience, running AI wherever their data and governance requirements demand.

Safety, Guardrailing, and Moderation

AI Studio builds safety features directly into its stack. Enterprises can apply guardrails and moderation filters at both the model and API levels.

The Mistral Moderation model, based on Ministral 8B (24.10), classifies text across policy categories such as sexual content, hate and discrimination, violence, self-harm, and PII. A separate system prompt guardrail can be activated to enforce responsible AI behavior, instructing models to “assist with care, respect, and truth” while avoiding harmful or unethical content.

Developers can also employ self-reflection prompts, a technique where the model itself classifies outputs against enterprise-defined safety categories like physical harm or fraud. This layered approach gives organizations flexibility in enforcing safety policies while retaining creative or operational control.

From Experimentation to Dependable Operations

Mistral positions AI Studio as the next phase in enterprise AI maturity. As large language models become more capable and accessible, the company argues, the differentiator will no longer be model performance but the ability to operate AI reliably, safely, and measurably.

AI Studio is designed to support that shift. By integrating evaluation, telemetry, version control, and governance into one workspace, it enables teams to manage AI with the same discipline as modern software systems — tracking every change, measuring every improvement, and maintaining full ownership of data and outcomes.

In the company’s words, “This is how AI moves from experimentation to dependable operations — secure, observable, and under your control.”

Mistral AI Studio is available starting October 24, 2025, as part of a private beta program. Enterprises can sign up on Mistral’s website to access the platform, explore its model catalog, and test observability, runtime, and governance features before general release.

OpenAI launches company knowledge in ChatGPT, letting you access your firm's data from Google Drive, Slack, GitHub

Is the Google Search for internal enterprise knowledge finally here...but from OpenAI? It certainly seems that way.

Today, OpenAI has launched company knowledge in ChatGPT, a major new capability for subscribers to ChatGPT's paid Business, Enterprise, and Edu plans that lets them call up their company's data directly from third-party workplace apps including Slack, SharePoint, Google Drive, Gmail, GitHub, HubSpot and combine it in ChatGPT outputs to them.

As OpenAI's CEO of Applications Fidji Simo put it in a post on the social network X: "it brings all the context from your apps (Slack, Google Drive, GitHub, etc) together in ChatGPT so you can get answers that are specific to your business."

Intriguingly, OpenAI's blog post on the feature states that is "powered by a version of GPT‑5 that’s trained to look across multiple sources to give more comprehensive and accurate answers," which sounds to me like a new fine-tuned version of the model family the company released back in August, though there are no additional details on how it was trained or its size, techniques, etc.

OpenAI tells VentureBeat it's a version of GPT-5 that specifically powers company knowledge in ChatGPT Business, Enterprise, and Edu.

Nonetheless, company knowledge in ChatGPT is rolling out globally and is designed to make ChatGPT a central point of access for verified organizational information, supported by secure integrations and enterprise-grade compliance controls, and give employees way faster access to their company's information while working.

Now, instead of toggling over to Slack to find the assignment you were given and instructions, or tabbing over to Google Drive and opening up specific files to find the names and numbers you need to call, ChatGPT can deliver all that type of information directly into your chat session — if your company enables the proper connections.

As OpenAI Chief Operating Officer Brad Lightcap wrote in a post on the social network X: "company knowledge has changed how i use chatgpt at work more than anything we have built so far - let us know what you think!"

It builds upon the third-party app connectors unveiled back in August 2025, though those were only for individual users on the ChatGPT Plus plans.

Connecting ChatGPT to Workplace Systems

Enterprise teams often face the challenge of fragmented data across various internal tools—email, chat, file storage, project management, and customer platforms.

Company knowledge bridges those silos by enabling ChatGPT to connect to approved systems like, and other supported apps through enterprise-managed connectors.

Each response generated with company knowledge includes citations and direct links to the original sources, allowing teams to verify where specific details originated. This transparency helps organizations maintain data trustworthiness while increasing productivity.

The sidebar shows a live view of the sources being examined and what it is getting from them. When it’s done, you’ll see exactly the sources used, along with the specific snippets it drew from. You can then click on any citation to open the original source for more details.

Built for Enterprise Control and Security

Company knowledge was designed from the ground up for enterprise governance and compliance. It respects existing permissions within connected apps — ChatGPT can only access what a user is already authorized to view— and never trains on company data by default.

Security features include industry-standard encryption, support for SSO and SCIM for account provisioning, and IP allowlisting to restrict access to approved corporate networks.

Enterprise administrators can also define role-based access control (RBAC) policies and manage permissions at a group or department level.

OpenAI’s Enterprise Compliance API provides a full audit trail, allowing administrators to review conversation logs for reporting and regulatory purposes.

This capability helps enterprises meet internal governance standards and industry-specific requirements such as SOC 2 and ISO 27001 compliance.

Admin Configuration and Connector Management

For enterprise deployment, administrators must enable company knowledge and its connectors within the ChatGPT workspace. Once connectors are active, users can authenticate their own accounts for each work app they need to access.

In Enterprise and Edu plans, connectors are off by default and require explicit admin approval before employees can use them. Admins can selectively enable connectors, manage access by role, and require SSO-based authentication for enhanced control.

Business plan users, by contrast, have connectors enabled automatically if available in their workspace. Admins can still oversee which connectors are approved, ensuring alignment with internal IT and data policies.

Company knowledge becomes available to any user with at least one active connector, and admins can configure group-level permissions for different teams — such as restricting GitHub access to engineering while enabling Google Drive or HubSpot for marketing and sales.

Organizations who turn on the feature can also elect to turn it off just as easily. Once you disconnect a connector, ChatGPT does not have access to that data.

How Company Knowledge Works in Practice

Activating company knowledge is straightforward. Users can start a new or existing conversation in ChatGPT and select “Company knowledge” under the message composer or from the tools menu. It must be turned on proactively for each new conversation or chat session, even from the same user.

After authenticating their connected apps, they can ask questions as usual—such as “Summarize this account’s latest feedback and risks” or “Compile a Q4 performance summary from project trackers.”

ChatGPT searches across the connected tools, retrieves relevant context, and produces an answer with full citations and source links.

The system can combine data across apps — for instance, blending Slack updates, Google Docs notes, and HubSpot CRM records — to create an integrated view of a project, client, or initiative.

When company knowledge is not selected, ChatGPT may still use connectors in a limited capacity as part of the default experience, but responses will not include detailed citations or multi-source synthesis.

Advanced Use Cases for Enterprise Teams

For development and operations leaders, company knowledge can act as a centralized intelligence layer that surfaces real-time updates and dependencies across complex workflows. ChatGPT can, for example, summarize open GitHub pull requests, highlight unresolved Linear tickets, and cross-reference Slack engineering discussions—all in a single output.

Technical teams can also use it for incident retrospectives or release planning by pulling relevant information from issue trackers, logs, and meeting notes. Procurement or finance leaders can use it to consolidate purchase requests or budget updates across shared drives and internal communications.

Because the model can reference structured and unstructured data simultaneously, it supports wide-ranging scenarios—from compliance documentation reviews to cross-departmental performance summaries.

Privacy, Data Residency, and Compliance

Enterprise data protection is a central design element of company knowledge. ChatGPT processes data in line with OpenAI’s enterprise-grade security model, ensuring that no connected app data leaves the secure boundary of the organization’s authorized environment.

Data residency policies vary by connector. Certain integrations, such as Slack, support region-specific data storage, while others—like Google Drive and SharePoint—are available for U.S.-based customers with or without at-rest data residency. Organizations with regional compliance obligations can review connector-specific security documentation for details.

No geo restrictions apply to company knowledge, making it suitable for multinational organizations operating across multiple jurisdictions.

Limitations and Future Enhancements

At present, users must manually enable company knowledge in each new ChatGPT conversation.

OpenAI is developing a unified interface that will automatically integrate company knowledge with other ChatGPT tools—such as browsing and chart generation—so that users won’t need to toggle between modes.

When enabled, company knowledge temporarily disables web browsing and visual output generation, though users can switch modes within the same conversation to re-enable those features.

OpenAI also continues to expand the network of supported tools. Recent updates have added connectors for Asana, GitLab Issues, and ClickUp, and OpenAI plans to support future MCP (Model Context Protocol) connectors to enable custom, developer-built integrations.

Availability and Getting Started

Company knowledge is now available to all ChatGPT Business, Enterprise, and Edu users. Organizations can begin by enabling the feature under the ChatGPT message composer and connecting approved work apps.

For enterprise rollouts, OpenAI recommends a phased deployment: first enabling core connectors (such as Google Drive and Slack), configuring RBAC and SSO, then expanding to specialized systems once data access policies are verified.

Procurement and security leaders evaluating the feature should note that company knowledge is covered under existing ChatGPT Enterprise terms and uses the same encryption, compliance, and service-level guarantees.

With company knowledge, OpenAI aims to make ChatGPT not just a conversational assistant but an intelligent interface to enterprise data—delivering secure, context-aware insights that help technical and business leaders act with confidence.

Microsoft Copilot gets 12 big updates for fall, including new AI assistant character Mico

Microsoft today held a live announcement event online for its Copilot AI digital assistant, with Mustafa Suleyman, CEO of Microsoft's AI division, and other presenters unveiling a new generation of features that deepen integration across Windows, Edge, and Microsoft 365, positioning the platform as a practical assistant for people during work and off-time, while allowing them to preserve control and safety of their data.

The new Copilot 2025 Fall Update features also up the ante in terms of capabilities and the accessibility of generative AI assistance from Microsoft to users, so businesses relying on Microsoft products, and those who seek to offer complimentary or competing products, would do well to review them.

Suleyman emphasized that the updates reflect a shift from hype to usefulness. “Technology should work in service of people, not the other way around,” he said. “Copilot is not just a product—it’s a promise that AI can be helpful, supportive, and deeply personal.”

Intriguingly, the announcement also sought to shine a greater spotlight on Microsoft's own homegrown AI models, as opposed to those of its partner and investment OpenAI, which previously powered the entire Copilot experience. Instead, Suleyman wrote today in a blog post:

“At the foundation of it all is our strategy to put the best models to work for you – both those we build and those we don’t. Over the past few months, we have released in-house models like MAI-Voice-1, MAI-1-Preview and MAI-Vision-1, and are rapidly iterating.”

12 Features That Redefine Copilot

The Fall Release consolidates Copilot’s identity around twelve key capabilities—each with potential to streamline organizational knowledge work, development, or support operations.

  1. Groups – Shared Copilot sessions where up to 32 participants can brainstorm, co-author, or plan simultaneously. For distributed teams, it effectively merges a meeting chat, task board, and generative workspace. Copilot maintains context, summarizes decisions, and tracks open actions.

  2. Imagine – A collaborative hub for creating and remixing AI-generated content. In an enterprise setting, Imagine enables rapid prototyping of visuals, marketing drafts, or training materials.

  3. Mico – A new character identity for Copilot that introduces expressive feedback and emotional expression in the form of a cute, amorphous blob. Echoing Microsoft’s historic character interfaces like Clippy (Office 97) or Cortana (2014), Mico serves as a unifying UX layer across modalities.

  4. Real Talk – A conversational mode that adapts to a user’s communication style and offers calibrated pushback — ending the sycophancy that some users have complained about with other AI models such as prior versions of OpenAI's ChatGPT. For professionals, it allows Socratic problem-solving rather than passive answer generation, making Copilot more credible in technical collaboration.

  5. Memory & Personalization – Long-term contextual memory that lets Copilot recall key details—training plans, dates, goals—at the user’s direction.

  6. Connectors – Integration with OneDrive, Outlook, Gmail, Google Drive, and Google Calendar for natural-language search across accounts.

  7. Proactive Actions (Preview) – Context-based prompts and next-step suggestions derived from recent activity.

  8. Copilot for Health – Health information grounded in credible medical sources such as Harvard Health, with tools allowing users to locate and compare doctors.

  9. Learn Live – A Socratic, voice-driven tutoring experience using questions, visuals, and whiteboards.

  10. Copilot Mode in Edge – Converts Microsoft Edge into an “AI browser” that summarizes, compares, and executes web actions by voice.

  11. Copilot on Windows – Deep integration across Windows 11 PCs with “Hey Copilot” activation, Copilot Vision guidance, and quick access to files and apps.

  12. Copilot Pages and Copilot Search – A collaborative file canvas plus a unified search experience combining AI-generated, cited answers with standard web results.

The Fall Release is immediately available in the United States, with rollout to the UK, Canada, and other markets in progress.

Some functions—such as Groups, Journeys, and Copilot for Health—remain U.S.-only for now. Proactive Actions requires a Microsoft 365 Personal, Family, or Premium subscription.

Together these updates illustrate Microsoft’s pivot from static productivity suites to contextual AI infrastructure, with the Copilot brand acting as the connective tissue across user roles.

From Clippy to Mico: The Return of a Guided Interface

One of the most notable introductions is Mico, a small animated companion that is available within Copilot’s voice-enabled experiences, including the Copilot app on Windows, iOS, and Android, as well as in Study Mode and other conversational contexts. It serves as an optional visual companion that appears during interactive or voice-based sessions, rather than across all Copilot interfaces.

Mico listens, reacts with expressions, and changes color to reflect tone and emotion — bringing a visual warmth to an AI assistant experience that has traditionally been text-heavy.

Mico’s design recalls earlier eras of Microsoft’s history with character-based assistants. In the mid-1990s, Microsoft experimented with Microsoft Bob (1995), a software interface that used cartoon characters like a dog named Rover to guide users through everyday computing tasks. While innovative for its time, Bob was discontinued after a year due to performance and usability issues.

A few years later came Clippy, the Office Assistant introduced in Microsoft Office 97. Officially known as “Clippit,” the animated paperclip would pop up to offer help and tips within Word and other Office applications. Clippy became widely recognized—sometimes humorously so—for interrupting users with unsolicited advice. Microsoft retired Clippy from Office in 2001, though the character remains a nostalgic symbol of early AI-driven assistance.

More recently, Cortana, launched in 2014 as Microsoft’s digital voice assistant for Windows and mobile devices, aimed to provide natural-language interaction similar to Apple’s Siri or Amazon’s Alexa. Despite positive early reception, Cortana’s role diminished as Microsoft refocused on enterprise productivity and AI integration. The service was officially discontinued on Windows in 2023.

Mico, by contrast, represents a modern reimagining of that tradition—combining the personality of early assistants with the intelligence and adaptability of contemporary AI models. Where Clippy offered canned responses, Mico listens, learns, and reflects a user’s mood in real time. The goal, as Suleyman framed it, is to create an AI that feels “helpful, supportive, and deeply personal.”

Groups Are Microsoft's Version of Claude and ChatGPT Projects

During Microsoft’s launch video, product researcher Wendy described Groups as a transformative shift: “You can finally bring in other people directly to the conversation that you’re having with Copilot,” she said. “It’s the only place you can do this.”

Up to 32 users can join a shared Copilot session, brainstorming, editing, or planning together while the AI manages logistics such as summarizing discussion threads, tallying votes, and splitting tasks. Participants can enter or exit sessions using a link, maintaining full visibility into ongoing work.

Instead of a single user prompting an AI and later sharing results, Groups lets teams prompt and iterate together in one unified conversation.

In some ways, it's an answer to Anthropic’s Claude Projects and OpenAI’s ChatGPT Projects, both launched within the last year as tools to centralize team workspaces and shared AI context.

Where Claude and ChatGPT Projects allow users to aggregate files, prompts, and conversations into a single container, Groups extends that model into real-time, multi-participant collaboration.

Unlike Anthropic’s and OpenAI’s implementations, Groups is deeply embedded within Microsoft’s productivity environment.

Like other Copilot experiences connected to Outlook and OneDrive, Groups operates within Microsoft’s enterprise identity framework, governed by Microsoft 365 and Entra ID (formerly Azure Active Directory) authentication and consent models

This means conversations, shared artifacts, and generated summaries are governed under the same compliance policies that already protect Outlook, Teams, and SharePoint data.

Hours after the unveiling, OpenAI hit back against its own investor in the escalating AI competition between the "frenemies" by expanding its Shared Projects feature beyond its current Enterprise, Team, and Edu subscriber availability to users of its free, Plus, and Pro subscription tiers.

Operational Impact for AI and Data Teams

Memory & Personalization and Connectors effectively extend a lightweight orchestration layer across Microsoft’s ecosystem.

Instead of building separate context-stores or retrieval APIs, teams can leverage Copilot’s secure integration with OneDrive or SharePoint as a governed data backbone.

A presenter explained that Copilot’s memory “naturally picks up on important details and remembers them long after you’ve had the conversation,” yet remains editable.

For data engineers, Copilot Search and Connectors reduce friction in data discovery across multiple systems. Natural-language retrieval from internal and cloud repositories may lower the cost of knowledge management initiatives by consolidating search endpoints.

For security directors, Copilot’s explicit consent requirements and on/off toggles in Edge and Windows help maintain data residency standards. The company reiterated during the livestream that Copilot “acts only with user permission and within organizational privacy controls.”

Copilot Mode in Edge: The AI Browser for Research and Automation

Copilot Mode in Edge stands out for offering AI-assisted information workflows.

The browser can now parse open tabs, summarize differences, and perform transactional steps.

“Historically, browsers have been static—just endless clicking and tab-hopping,” said a presenter during Microsoft’s livestream. “We asked not how browsers should work, but how people work.”

In practice, an analyst could prompt Edge to compare supplier documentation, extract structured data, and auto-fill procurement forms—all with consistent citation.

Voice-only navigation enables accessibility and multitasking, while Journeys, a companion feature, organizes browsing sessions into storylines for later review.

Copilot on Windows: The Operating System as an AI Surface

In Windows 11, Copilot now functions as an embedded assistant. With the wake-word “Hey Copilot,” users can initiate context-aware commands without leaving the desktop—drafting documentation, troubleshooting configuration issues, or summarizing system logs.

A presenter described it as a “super assistant plugged into all your files and applications.” For enterprises standardizing on Windows 11, this positions Copilot as a native productivity layer rather than an add-on, reducing training friction and promoting secure, on-device reasoning.

Copilot Vision, now in early deployment, adds visual comprehension. IT staff can capture a screen region and ask Copilot to interpret error messages, explain configuration options, or generate support tickets automatically.

Combined with Copilot Pages, which supports up to twenty concurrent file uploads, this enables more efficient cross-document analysis for audits, RFPs, or code reviews.

Leveraging MAI Models for Multimodal Workflows

At the foundation of these capabilities are Microsoft’s proprietary MAI-Voice-1, MAI-1 Preview, and MAI-Vision-1 models—trained in-house to handle text, voice, and visual inputs cohesively.

For engineering teams managing LLM orchestration, this architecture introduces several potential efficiencies:

  • Unified multimodal reasoning – Reduces the need for separate ASR (speech-to-text) and image-parsing services.

  • Fine-tuning continuity – Because Microsoft owns the model stack, updates propagate across Copilot experiences without re-integration.

  • Predictable latency and governance – In-house hosting under Azure compliance frameworks simplifies security certification for regulated industries.

A presenter described the new stack as “the foundation for immersive, creative, and dynamic experiences that still respect enterprise boundaries.”

A Strategic Pivot Toward Contextual AI

For years, Microsoft positioned Copilot primarily as a productivity companion. With the Fall 2025 release, it crosses into operational AI infrastructure—a set of extensible services for reasoning over data and processes.

Suleyman described this evolution succinctly: “Judge an AI by how much it elevates human potential, not just by its own smarts.” For CIOs and technical leads, the elevation comes from efficiency and interoperability.

Copilot now acts as:

  • A connective interface linking files, communications, and cloud data.

  • A reasoning agent capable of understanding context across sessions and modalities.

  • A secure orchestration layer compatible with Microsoft’s compliance and identity framework.

Suleyman’s insistence that “technology should work in service of people” now extends to organizations as well: technology that serves teams, not workloads; systems that adapt to enterprise context rather than demand it.

What enterprises can take away from Microsoft CEO Satya Nadella's shareholder letter

One of the leading architects of the current generative AI boom — Microsoft CEO Satya Nadella, famed for having the software giant take an early investment in OpenAI (and later saying he was "good for my $80 billion") — published his latest annual letter yesterday on LinkedIn (a Microsoft subsidiary), and it's chock full of interesting ideas about the near-term future that enterprise technical decision makers would do well to pay attention to, as it could aid in their own planning and tech stack development.

In a companion post on X, Nadella wrote, “AI is radically changing every layer of the tech stack, and we’re changing with it."

The full letter reinforces that message: Microsoft sees itself not just participating in the AI revolution, but shaping its infrastructure, security, tooling and governance for decades to come.

While the message is addressed to Microsoft shareholders, the implications reach much further. The letter is a strategic signal to enterprise engineering leaders: CIOs, CTOs, AI leads, platform architects and security directors. Nadella outlines the direction of Microsoft’s innovation, but also what it expects from its customers and partners. The AI era is here, but it will be built by those who combine technical vision with operational discipline.

Below are the five most important takeaways for enterprise technical decision makers.

1. Security and reliability are now the foundation of the AI stack

Nadella makes security the first priority in the letter and ties it directly to Microsoft’s relevance going forward. Through its Secure Future Initiative (SFI), Microsoft has assigned the equivalent of 34,000 engineers to secure its identity systems, networks and software supply chain. Its Quality Excellence Initiative (QEI) aims to increase platform resiliency and strengthen global service uptime.

Microsoft’s positioning makes it clear that enterprises will no longer get away with “ship fast, harden later” AI deployments. Nadella calls security “non-negotiable,” signaling that AI infrastructure must now meet the standards of mission-critical software. That means identity-first architecture, zero-trust execution environments and change management discipline are now table stakes for enterprise AI.

2. AI infrastructure strategy is hybrid, open and sovereignty-ready

Nadella commits Microsoft to building “planet-scale systems” and backs that up with numbers: more than 400 Azure datacenters across 70 regions, two gigawatts of new compute capacity added this year, and new liquid-cooled GPU clusters rolling out across Azure. Microsoft also introduced Fairwater, a massive new AI datacenter in Wisconsin positioned to deliver unprecedented scale. Just as important, Microsoft is now officially multi-model. Azure AI Foundry offers access to more than 11,000 models including OpenAI, Meta, Mistral, Cohere and xAI. Microsoft is no longer pushing a single-model future, but a hybrid AI strategy.

Enterprises should interpret this as validation of “portfolio architectures,” where closed, open and domain-specific models coexist. Nadella also emphasizes growing investment in sovereign cloud offerings for regulated industries, previewing a world where AI systems will have to meet regional data residency and compliance requirements from day one.

3. AI agents—not just chatbots—are now Microsoft’s future

The AI shift inside Microsoft is no longer about copilots that answer questions. It is now about AI agents that perform work. Nadella points to the rollout of Agent Mode in Microsoft 365 Copilot, which turns natural language requests into multistep business workflows. GitHub Copilot evolves from code autocomplete into a “peer programmer” capable of executing tasks asynchronously. In security operations, Microsoft has deployed AI agents that autonomously respond to incidents. In healthcare, Copilot for Dragon Medical documents clinical encounters automatically.

This represents a major architectural pivot. Enterprises will need to move beyond prompt-response interfaces and begin engineering agent ecosystems that safely take actions inside business systems. That requires workflow orchestration, API integration strategies and strong guardrails. Nadella’s letter frames this as the next software platform shift.

4. Unified data platforms are required to unlock AI value

Nadella devotes significant attention to Microsoft Fabric and OneLake, calling Fabric the company’s fastest-growing data and analytics product ever. Fabric promises to centralize enterprise data from multiple cloud and analytics environments. OneLake provides a universal storage layer that binds analytics and AI workloads together.

Microsoft’s message is blunt: siloed data means stalled AI. Enterprise teams that want AI at scale must unify operational and analytical data into a single architecture, enforce consistent data contracts and standardize metadata governance. AI success is now a data engineering problem more than a model problem.

5. Trust, compliance and responsible AI are now mandatory for deployment

“People want technology they can trust,” Nadella writes. Microsoft now publishes Responsible AI Transparency Reports and aligns parts of its development process with UN human rights guidance. Microsoft is also committing to digital resilience in Europe and proactive safeguards against misuse of AI-generated content.

This shifts responsible AI out of the realm of corporate messaging and into engineering practice. Enterprises will need model documentation, reproducibility practices, audit trails, risk monitoring and human-in-the-loop checkpoints. Nadella signals that compliance will become integrated with product delivery—not an afterthought layered on top.

The real meaning of Microsoft’s AI strategy

Taken together, these five pillars send a clear message to enterprise leaders: AI maturity is no longer about building prototypes or proving use cases. System-level readiness now defines success. Nadella frames Microsoft’s mission as helping customers “think in decades and execute in quarters,” and that is more than corporate poetry. It is a call to build AI platforms engineered for longevity.

The companies that win in enterprise AI will be the ones that invest early in secure cloud foundations, unify their data architectures, enable agent-based workflows and embrace responsible AI as a prerequisite for scale—not a press release. Nadella is betting that the next industrial transformation will be powered by AI infrastructure, not AI demos. With this letter, he has made Microsoft’s ambition clear: to become the platform on which that transformation is built.

Qwen's new Deep Research update lets you turn its reports into webpages, podcasts in seconds

Chinese e-commerce giant Alibaba’s famously prolific Qwen Team of AI model researchers and engineers has introduced a major expansion to its Qwen Deep Research tool, which is available as an optional modality the user can activate on the web-based Qwen Chat (a competitor to ChatGPT).

The update lets users generate not only comprehensive research reports with well-organized citations, but also interactive web pages and multi-speaker podcasts — all within 1-2 clicks.

This functionality is part of a proprietary release, distinct from many of Qwen’s previous open-source model offerings.

While the feature relies on the open-source models Qwen3-Coder, Qwen-Image, and Qwen3-TTS to power its core capabilities, the end-to-end experience — including research execution, web deployment, and audio generation — is hosted and operated by Qwen.

This means users benefit from a managed, integrated workflow without needing to configure infrastructure. That said, developers with access to the open-source models could theoretically replicate similar functionality on private or commercial systems.

The update was announced via the team’s official X account (@Alibaba_Qwen) today, October 21, 2025, stating:

“Qwen Deep Research just got a major upgrade. It now creates not only the report, but also a live webpage and a podcast — powered by Qwen3-Coder, Qwen-Image, and Qwen3-TTS. Your insights, now visual and audible.”

Multi-Format Research Output

The core workflow begins with a user request inside the Qwen Chat interface. From there, Qwen collaborates by asking clarifying questions to shape the research scope, pulls data from the web and official sources, and analyzes or resolves any inconsistencies it finds — even generating custom code when needed.

A demo video posted by Qwen on X walks through this process on Qwen Chat using the U.S. SaaS market as an example.

In it, Qwen retrieves data from multiple industry sources, identifies discrepancies in market size estimates (e.g., $206 billion vs. $253 billion), and highlights ambiguities in the U.S. share of global figures. The assistant comments on differences in scope between sources and calculates a compound annual growth rate (CAGR) of 19.8% from 2020 to 2023, providing contextual analysis to back up the raw numbers.

Once the research is complete, users can click on the "eyeball" icon below the output result (see screenshot), which will bring up a PDF-style report in the right hand pane.

Then, when viewing the report in the right-hand pane, the user can click the "Create" button in the upper-right hand corner and select from the following two options:

  1. "Web Dev" which produces a live, professional-grade web page, automatically deployed and hosted by Qwen, using Qwen3-Coder for structure and Qwen-Image for visuals.

  2. "Podcast," which, as it states, produces an audio podcast, featuring dynamic, multi-speaker narration generated by Qwen3-TTS, also hosted by Qwen for easy sharing and playback.

This enables users to quickly convert a single research project into multiple forms of content — written, visual, and audible — with minimal extra input.

The website includes inline graphics generated by Qwen Image, making it suitable for use in public presentations, classrooms, or publishing.

The podcast feature allows users to select between 17 different speaker names as the host and 7 as the co-host, though I wasn't able to find a way to preview the voice outputs before selecting them. It appears designed for deep listening on the go.

There was no way to change the language output that I could see, so mine came out in English, like my reports and initial prompts, though the Qwen LLMs are multi-modal. The voices were slightly more robotic than other AI tools I've used.

Here's an example of a web page I generated on commonalities in authoritarian regimes throughout history, another one on UFO or UAP sightings, and below this paragraph, a podcast on UFO or UAP sightings.

While the website is hosted via a public link, the podcast must be downloaded by the user and can't be linked to publicly, from what I could tell in my brief usage so far.

Note the podcast is much different than the actual report — not just a straight read-through audio version of it, rather, a new format of two hosts discussing and bantering about the subject using the report as the jumping off point.

The web page versions of the report also include new graphics not found in the PDF report.

Comparisons to Google's NotebookLM

While the new capabilities have been well received by many early users, comparisons to other research assistants have surfaced — particularly Google’s NotebookLM, which recently exited beta.

AI commentator and newsletter writer Chubby (@kimmonismus) noted on X:

“I am really grateful that Qwen provides regular updates. That’s great.

But the attempt to build a NotebookLM clone inside Qwen-3-max doesn’t sound very promising compared to Google’s version.”

While NotebookLM is built around organizing and querying existing documents and web pages, Qwen Deep Research focuses more on generating new research content from scratch, aggregating sources from the open web, and presenting it across multiple modalities.

The comparison suggests that while the two tools overlap in general concept — AI-assisted research — they diverge in approach and target user experience.

Availability

Qwen Deep Research is now live and available through the Qwen Chat app. The feature can be accessed with the following URL.

No pricing details have been provided for Qwen3-Max or the specific Deep Research capabilities as of this writing.

What's Next For Qwen Deep Research?

By combining research guidance, data analysis, and multi-format content creation into a single tool, Qwen Deep Research aims to streamline the path from idea to publishable output.

The integration of code, visuals, and voice makes it especially attractive to content creators, educators, and independent analysts who want to scale their research into web- or podcast-friendly forms without switching platforms.

Still, comparisons to more specialized offerings like NotebookLM raise questions about how Qwen’s generalized approach stacks up on depth, precision, and refinement. Whether the strength of its multi-format execution outweighs those concerns may come down to user priorities — and whether they value single-click publishing over tight integration with existing notes and materials.

For now, Qwen is signaling that research doesn’t end with a document — it begins with one.

Let me know if you want this repackaged into something shorter or tailored to a particular audience — newsletter, press-style blog, internal team explainer, etc.

Google's new vibe coding AI Studio experience lets anyone build, deploy apps live in minutes

Google AI Studio has gotten a big vibe coding upgrade with a new interface, buttons, suggestions and community features that allow anyone with an idea for an app — even complete novices, laypeople, or non-developers like yours truly — to bring it into existence and deploy it live, on the web, for anyone to use, within minutes.

The updated Build tab is available now at ai.studio/build, and it’s free to start.

Users can experiment with building applications without needing to enter payment information upfront, though certain advanced features like Veo 3.1 and Cloud Run deployment require a paid API key.

The new features appear to me to make Google's AI models and offerings even more competitive, perhaps preferred, for many general users to dedicated AI startup rivals like Anthropic's Claude Code and OpenAI's Codex, respectively, two "vibe coding" focused products that are beloved by developers — but seem to have a higher barrier to entry or may require more technical know-how.

A Fresh Start: Redesigned Build Mode

The updated Build tab serves as the entry point to vibe coding. It introduces a new layout and workflow where users can select from Google’s suite of AI models and features to power their applications. The default is Gemini 2.5 Pro, which is great for most cases.

Once selections are made, users simply describe what they want to build, and the system automatically assembles the necessary components using Gemini’s APIs.

This mode supports mixing capabilities like Nano Banana (a lightweight AI model), Veo (for video understanding), Imagine (for image generation), Flashlight (for performance-optimized inference), and Google Search.

Patrick Löber, Developer Relations at Google DeepMind, highlighted that the experience is meant to help users “supercharge your apps with AI” using a simple prompt-to-app pipeline.

In a video demo he posted on X and LinedIn, he showed how just a few clicks led to the automatic generation of a garden planning assistant app, complete with layouts, visuals, and a conversational interface.

From Prompt to Production: Building and Editing in Real Time

Once an app is generated, users land in a fully interactive editor. On the left, there’s a traditional code-assist interface where developers can chat with the AI model for help or suggestions. On the right, a code editor displays the full source of the app.

Each component—such as React entry points, API calls, or styling files—can be edited directly. Tooltips help users understand what each file does, which is especially useful for those less familiar with TypeScript or frontend frameworks.

Apps can be saved to GitHub, downloaded locally, or shared directly. Deployment is possible within the Studio environment or via Cloud Run if advanced scaling or hosting is needed.

Inspiration on Demand: The ‘I’m Feeling Lucky’ Button

One standout feature in this update is the “I’m Feeling Lucky” button. Designed for users who need a creative jumpstart, it generates randomized app concepts and configures the app setup accordingly. Each press yields a different idea, complete with suggested AI features and components.

Examples produced during demos include:

  • An interactive map-based chatbot powered by Google Search and conversational AI.

  • A dream garden designer using image generation and advanced planning tools.

  • A trivia game app with an AI host whose personality users can define, integrating both Imagine and Flashlight with Gemini 2.5 Pro for conversation and reasoning.

Logan Kilpatrick, Lead of Product for Google AI Studio and Gemini AI, noted in a demo video of his own that this feature encourages discovery and experimentation.

“You get some really, really cool, different experiences,” he said, emphasizing its role in helping users find novel ideas quickly.

Hands-On Test: From Prompt to App in 65 Seconds

To test the new workflow, I prompted Gemini with:

A randomized dice rolling web application where the user can select between common dice sizes (6 sides, 10 sides, etc) and then see an animated die rolling and choose the color of their die as well.

Within 65 seconds (just over a minute) AI Studio returned a fully working web app featuring:

  • Dice size selector (d4, d6, d8, d10, d12, d20)

  • Color customization options for the die

  • Animated rolling effect with randomized results

  • Clean, modern UI built with React, TypeScript, and Tailwind CSS

The platform also generated a complete set of structured files, including App.tsx, constants.ts, and separate components for dice logic and controls.

After generation, it was easy to iterate: adding sound effects for each interaction (rolling, choosing a die, changing color) required only a single follow-up prompt to the built-in assistant. This was also suggested by Gemini, too, by the way.

From there, the app can be previewed live or exported using built-in controls to:

  • Save to GitHub

  • Download the full codebase

  • Copy the project for remixing

  • Deploy via integrated tools

My brief, hands-on test showed just how quickly even small utility apps can go from idea to interactive prototype—without leaving the browser or writing boilerplate code manually.

AI-Suggested Enhancements and Feature Refinement

In addition to code generation, Google AI Studio now offers context-aware feature suggestions. These recommendations, generated by Gemini’s Flashlight capability, analyze the current app and propose relevant improvements.

In one example, the system suggested implementing a feature that displays the history of previously generated images in an image studio tab. These iterative enhancements allow builders to expand app functionality over time without starting from scratch.

Kilpatrick emphasized that users can continue to refine their projects as they go, combining both automatic generation and manual adjustments. “You can go in and continue to edit and sort of refine the experience that you want iteratively,” he said.

Free to Start, Flexible to Grow

The new experience is available at no cost for users who want to experiment, prototype, or build lightweight apps. There’s no requirement to enter credit card information to begin using vibe coding.

However, more powerful capabilities — such as using models like Veo 3.1 or deploying through Cloud Run — do require switching to a paid API key.

This pricing structure is intended to lower the barrier to entry for experimentation while providing a clear path to scale when needed.

Built for All Skill Levels

One of the central goals of the vibe coding launch is to make AI app development accessible to more people. The system supports both high-level visual builders and low-level code editing, creating a workflow that works for developers across experience levels.

Kilpatrick mentioned that while he’s more familiar with Python than TypeScript, he still found the editor useful because of the helpful file descriptions and intuitive layout.

This focus on usability could make AI Studio a compelling option for developers exploring AI for the first time.

More to Come: A Week of Launches

The launch of vibe coding is the first in a series of announcements expected throughout the week. While specific future features haven’t been revealed yet, both Kilpatrick and Löber hinted that additional updates are on the way.

With this update, Google AI Studio positions itself as a flexible, user-friendly environment for building AI-powered applications—whether for fun, prototyping, or production deployment. The focus is clear: make the power of Gemini’s APIs accessible without unnecessary complexity.

Developers can now add live Google Maps data to Gemini-powered AI app outputs

Google is adding a new feature for third-party developers building atop its Gemini AI models that rivals like OpenAI's ChatGPT, Anthropic's Claude, and the growing array of Chinese open source options are unlikely to get anytime soon: grounding with Google Maps.

This addition allows developers to connect Google's Gemini AI models' reasoning capabilities with live geospatial data from Google Maps, enabling applications to deliver detailed, location-relevant responses to user queries—such as business hours, reviews, or the atmosphere of a specific venue.

By tapping into data from over 250 million places, developers can now build more intelligent and responsive location-aware experiences.

This is particularly useful for applications where proximity, real-time availability, or location-specific personalization matter—such as local search, delivery services, real estate, and travel planning.

When the user’s location is known, developers can pass latitude and longitude into the request to enhance the response quality.

By tightly integrating real-time and historical Maps data into the Gemini API, Google enables applications to generate grounded, location-specific responses with factual accuracy and contextual depth that are uniquely possible through its mapping infrastructure.

Merging AI and Geospatial Intelligence

The new feature is accessible in Google AI Studio, where developers can try a live demo powered by the Gemini Live API. Models that support the grounding with Google Maps include:

  • Gemini 2.5 Pro

  • Gemini 2.5 Flash

  • Gemini 2.5 Flash-Lite

  • Gemini 2.0 Flash

In one demonstration, a user asked for Italian restaurant recommendations in Chicago.

The assistant, leveraging Maps data, retrieved top-rated options and clarified a misspelled restaurant name before locating the correct venue with accurate business details.

Developers can also retrieve a context token to embed a Google Maps widget in their app’s user interface. This interactive component displays photos, reviews, and other familiar content typically found in Google Maps.

Integration is handled via the generateContent method in the Gemini API, where developers include googleMaps as a tool. They can also enable a Maps widget by setting a parameter in the request. The widget, rendered using a returned context token, can provide a visual layer alongside the AI-generated text.

Use Cases Across Industries

The Maps grounding tool is designed to support a wide range of practical use cases:

  • Itinerary generation: Travel apps can create detailed daily plans with routing, timing, and venue information.

  • Personalized local recommendations: Real estate platforms can highlight listings near kid-friendly amenities like schools and parks.

  • Detailed location queries: Applications can provide specific information, such as whether a cafe offers outdoor seating, using community reviews and Maps metadata.

Developers are encouraged to only enable the tool when geographic context is relevant, to optimize both performance and cost.

According to the developer documentation, pricing starts at $25 per 1,000 grounded prompts — a steep sum for those trafficking in numerous queries.

Combining Search and Maps for Enhanced Context

Developers can use Grounding with Google Maps alongside Grounding with Google Search in the same request.

While the Maps tool contributes factual data—like addresses, hours, and ratings—the Search tool adds broader context from web content, such as news or event listings.

For example, when asked about live music on Beale Street, the combined tools provide venue details from Maps and event times from Search.

According to Google, internal testing shows that using both tools together leads to significantly improved response quality.

Unfortunately, it doesn't appear that the Google Maps grounding includes live vehicular traffic data — at least not yet.

Customization and Developer Flexibility

The experience is built for customization. Developers can tweak system prompts, choose from different Gemini models, and configure voice settings to tailor interactions.

The demo app in Google AI Studio is also remixable, enabling developers to test ideas, add features, and iterate on designs within a flexible development environment.

The API returns structured metadata—including source links, place IDs, and citation spans—that developers can use to build inline citations or verify the AI-generated outputs.

This supports transparency and enhances trust in user-facing applications. Google also requires that Maps-based sources be attributed clearly and linked back to the source using their URI.

Implementation Considerations for AI Builders

For technical teams integrating this capability, Google recommends:

  • Passing user location context when known, for better results.

  • Displaying Google Maps source links directly beneath the relevant content.

  • Only enabling the tool when the query clearly involves geographic context.

  • Monitoring latency and disabling grounding when performance is critical.

Grounding with Google Maps is currently available globally, though prohibited in several territories (including China, Iran, North Korea, and Cuba), and not permitted for emergency response use cases.

Availability and Access

Grounding with Google Maps is now generally available through the Gemini API.

With this release, Google continues to expand the capabilities of the Gemini API, empowering developers to build AI-driven applications that understand and respond to the world around them.

Researchers find adding this one simple sentence to prompts makes AI models way more creative

One of the coolest things about generative AI models — both large language models (LLMs) and diffusion-based image generators — is that they are "non-deterministic." That is, despite their reputation among some critics as being "fancy autocorrect," generative AI models actually generate their outputs by choosing from a distribution of the most probable next tokens (units of information) to fill out their response.

Asking an LLM: "What is the capital of France?" will have it sample its probability distribution for France, capitals, cities, etc. to arrive at the answer "Paris." But that answer could come in the format of "The capital of France is Paris," or simply "Paris" or "Paris, though it was Versailles at one point."

Still, those of us that use these models frequently day-to-day will note that sometimes, their answers can feel annoyingly repetitive or similar. A common joke about coffee is recycled across generations of queries. Story prompts generate similar arcs. Even tasks that should yield many plausible answers—like naming U.S. states—tend to collapse into only a few. This phenomenon, known as mode collapse, arises during post-training alignment and limits the usefulness of otherwise powerful models.

Especially when using LLMs to generate new creative works in writing, communications, strategy, or illustrations, we actually want their outputs to be even more varied than they already are.

Now a team of researchers at Northeastern University, Stanford University and West Virginia University have come up with an ingenuously simple method to get language and image models to generate a wider variety of responses to nearly any user prompt by adding a single, simple sentence: "Generate 5 responses with their corresponding probabilities, sampled from the full distribution."

The method, called Verbalized Sampling (VS), helps models like GPT-4, Claude, and Gemini produce more diverse and human-like outputs—without retraining or access to internal parameters. It is described in a paper published on the open access journal arxiv.org online in early October 2025.

When prompted in this way, the model no longer defaults to its safest, most typical output. Instead, it verbalizes its internal distribution over potential completions and samples across a wider spectrum of possibilities. This one-line change leads to substantial gains in output diversity across multiple domains.

As Weiyan Shi, an assistant professor at Northeastern University and co-author of the paper, wrote on X: "LLMs' potentials are not fully unlocked yet! As shown in our paper, prompt optimization can be guided by thinking about how LLMs are trained and aligned, and can be proved theoretically."

Why Models Collapse—and How VS Reverses It

According to the research team, the root cause of mode collapse lies not just in algorithms like reinforcement learning from human feedback (RLHF), but in the structure of human preferences. People tend to rate more familiar or typical answers as better, which nudges LLMs toward “safe” choices over diverse ones during fine-tuning.

However, this bias doesn’t erase the model’s underlying knowledge—it just suppresses it. VS works by bypassing this suppression. Instead of asking for the single most likely output, it invites the model to reveal a set of plausible responses and their relative probabilities. This distribution-level prompting restores access to the richer diversity present in the base pretraining model.

Real-World Performance Across Tasks

The research team tested Verbalized Sampling across several common use cases:

  • Creative Writing: In story generation, VS increased diversity scores by up to 2.1× compared to standard prompting, while maintaining quality. One story prompt—“Without a goodbye”—produced formulaic breakup scenes under direct prompting, but yielded narratives involving cosmic events, silent emails, and music stopping mid-dance when prompted via VS.

  • Dialogue Simulation: In persuasive dialogue tasks, VS enabled models to simulate human-like patterns, such as hesitation, resistance, and changes of mind. Donation behavior distributions under VS better aligned with real human data compared to baseline methods.

  • Open-ended QA: When asked to enumerate valid answers (e.g., naming U.S. states), models using VS generated responses that more closely matched the diversity of real-world data. They covered a broader set of answers without sacrificing factual accuracy.

  • Synthetic Data Generation: When used to generate math problems for model training, VS created more varied datasets. These, in turn, improved downstream performance in competitive math benchmarks, outperforming synthetic data generated via direct prompting.

Tunable Diversity and Better Use of Larger Models

A notable advantage of VS is its tunability. Users can set a probability threshold in the prompt to sample from lower-probability “tails” of the model’s distribution. Lower thresholds correspond to higher diversity. This tuning can be done via prompt text alone, without changing any decoding settings like temperature or top-p.

In one test using the Gemini-2.5-Flash model, diversity in story writing increased steadily as the probability threshold dropped from 1 to 0.001. The chart accompanying the study showed VS outperforming both direct and sequence-based prompting across all thresholds.

Interestingly, the method scales well with model size. Larger models like GPT-4.1 and Claude-4 showed even greater gains from VS compared to smaller ones. While smaller models benefitted, the improvement in diversity was roughly 1.5–2× stronger in larger counterparts—suggesting VS helps unlock more of the latent capabilities in advanced models.

Deployment and Availability

The Verbalized Sampling method is available now as a Python package:

pip install verbalized-sampling

The package includes integration with LangChain and supports a simple interface for sampling from the verbalized distribution. Users can also adjust parameters like k (number of responses), thresholds, and temperature to suit their applications.

A live Colab notebook and documentation are available under an enterprise friendly Apache 2.0 license on GitHub at: https://github.com/CHATS-lab/verbalized-sampling

Practical Tips and Common Issues

While the method works across all major LLMs, some users may initially encounter refusals or errors.

In these cases, the authors suggest using the system prompt version of the template or referring to alternative formats listed on the GitHub page.

Some models interpret complex instructions as jailbreak attempts and refuse to comply unless the structure is clearer.

For example, prompting via a system-level instruction like this improves reliability:

You are a helpful assistant. For each query, generate five responses within separate tags, each with a probability below 0.10.

This small change typically resolves any issues.

A Lightweight Fix for a Big Problem

Verbalized Sampling represents a practical, inference-time fix to a deep limitation in how modern language models behave. It doesn’t require model retraining or internal access. It is not dependent on any one model family. And it improves not only the diversity of outputs, but their quality—as judged by both human evaluation and benchmark scores.

With growing interest in tools that enhance model creativity, VS is likely to see rapid adoption in domains like writing, design, simulation, education, and synthetic data generation.

For users and developers frustrated by the sameness of LLM responses, the fix may be as simple as changing the question.

Google releases new AI video model Veo 3.1 in Flow and API: what it means for enterprises

As expected after days of leaks and rumors online, Google has unveiled Veo 3.1, its latest AI video generation model, bringing a suite of creative and technical upgrades aimed at improving narrative control, audio integration, and realism in AI-generated video.

While the updates expand possibilities for hobbyists and content creators using Google’s online AI creation app, Flow, the release also signals a growing opportunity for enterprises, developers, and creative teams seeking scalable, customizable video tools.

The quality is higher, the physics better, the pricing the same as before, and the control and editing features more robust and varied.

My initial tests showed it to be a powerful and performant model that immediately delights with each generation. However, the look is more cinematic, polished and a little more "artificial" than by default than rivals such as OpenAI's new Sora 2, released late last month, which may or may not be what a particular user is going after (Sora excels at handheld and "candid" style videos).

Expanded Control Over Narrative and Audio

Veo 3.1 builds on its predecessor, Veo 3 (released back in May 2025) with enhanced support for dialogue, ambient sound, and other audio effects.

Native audio generation is now available across several key features in Flow, including “Frames to Video,” “Ingredients to Video,” and “Extend," which give users the ability to, respectively: turn still images into video; use items, characters and objects from multiple images in a single video; and generate longer clips than the initial 8 seconds, to more than 30 seconds or even 1+ plus when continuing from a prior clip's final frame.

Before, you had to add audio manually after using these features.

This addition gives users greater command over tone, emotion, and storytelling — capabilities that have previously required post-production work.

In enterprise contexts, this level of control may reduce the need for separate audio pipelines, offering an integrated way to create training content, marketing videos, or digital experiences with synchronized sound and visuals.

Google noted in a blog post that the updates reflect user feedback calling for deeper artistic control and improved audio support. Gallegos emphasizes the importance of making edits and refinements possible directly in Flow, without reworking scenes from scratch.

Richer Inputs and Editing Capabilities

With Veo 3.1, Google introduces support for multiple input types and more granular control over generated outputs. The model accepts text prompts, images, and video clips as input, and also supports:

  • Reference images (up to three) to guide appearance and style in the final output

  • First and last frame interpolation to generate seamless scenes between fixed endpoints

  • Scene extension that continues a video’s action or motion beyond its current duration

These tools aim to give enterprise users a way to fine-tune the look and feel of their content—useful for brand consistency or adherence to creative briefs.

Additional capabilities like “Insert” (add objects to scenes) and “Remove” (delete elements or characters) are also being introduced, though not all are immediately available through the Gemini API.

Deployment Across Platforms

Veo 3.1 is accessible through several of Google’s existing AI services:

  • Flow, Google’s own interface for AI-assisted filmmaking

  • Gemini API, targeted at developers building video capabilities into applications

  • Vertex AI, where enterprise integration will soon support Veo’s “Scene Extension” and other key features

Availability through these platforms allows enterprise customers to choose the right environment—GUI-based or programmatic—based on their teams and workflows.

Pricing and Access

The Veo 3.1 model is currently in preview and available only on the paid tier of the Gemini API. The cost structure is the same as Veo 3, the preceding generation of AI video models from Google.

  • Standard model: $0.40 per second of video

  • Fast model: $0.15 per second

There is no free tier, and users are charged only if a video is successfully generated. This model is consistent with previous Veo versions and provides predictable pricing for budget-conscious enterprise teams.

Technical Specs and Output Control

Veo 3.1 outputs video at 720p or 1080p resolution, with a 24 fps frame rate.

Duration options include 4, 6, or 8 seconds from a text prompt or uploaded images, with the ability to extend videos up to 148 seconds (more than 2 and half minutes!) when using the “Extend” feature.

New functionality also includes tighter control over subjects and environments. For example, enterprises can upload a product image or visual reference, and Veo 3.1 will generate scenes that preserve its appearance and stylistic cues across the video. This could streamline creative production pipelines for retail, advertising, and virtual content production teams.

Initial Reactions

The broader creator and developer community has responded to Veo 3.1’s launch with a mix of optimism and tempered critique—particularly when comparing it to rival models like OpenAI’s Sora 2.

Matt Shumer, an AI founder of Otherside AI/Hyperwrite, and early adopter, described his initial reaction as “disappointment,” noting that Veo 3.1 is “noticeably worse than Sora 2” and also “quite a bit more expensive.”

However, he acknowledged that Google’s tooling—such as support for references and scene extension—is a bright spot in the release.

Travis Davids, a 3D digital artist and AI content creator, echoed some of that sentiment. While he noted improvements in audio quality, particularly in sound effects and dialogue, he raised concerns about limitations that remain in the system.

These include the lack of custom voice support, an inability to select generated voices directly, and the continued cap at 8-second generations—despite some public claims about longer outputs.

Davids also pointed out that character consistency across changing camera angles still requires careful prompting, whereas other models like Sora 2 handle this more automatically. He questioned the absence of 1080p resolution for users on paid tiers like Flow Pro and expressed skepticism over feature parity.

On the more positive end, @kimmonismus, an AI newsletter writer, stated that “Veo 3.1 is amazing,” though still concluded that OpenAI’s latest model remains preferable overall.

Collectively, these early impressions suggest that while Veo 3.1 delivers meaningful tooling enhancements and new creative control features, expectations have shifted as competitors raise the bar on both quality and usability.

Adoption and Scale

Since launching Flow five months ago, Google says over 275 million videos have been generated across various Veo models.

The pace of adoption suggests significant interest not only from individuals but also from developers and businesses experimenting with automated content creation.

Thomas Iljic, Director of Product Management at Google Labs, highlights that Veo 3.1’s release brings capabilities closer to how human filmmakers plan and shoot. These include scene composition, continuity across shots, and coordinated audio—all areas that enterprises increasingly look to automate or streamline.

Safety and Responsible AI Use

Videos generated with Veo 3.1 are watermarked using Google’s SynthID technology, which embeds an imperceptible identifier to signal that the content is AI-generated.

Google applies safety filters and moderation across its APIs to help minimize privacy and copyright risks. Generated content is stored temporarily and deleted after two days unless downloaded.

For developers and enterprises, these features provide reassurance around provenance and compliance—critical in regulated or brand-sensitive industries.

Where Veo 3.1 Stands Among a Crowded AI Video Model Space

Veo 3.1 is not just an iteration on prior models—it represents a deeper integration of multimodal inputs, storytelling control, and enterprise-level tooling. While creative professionals may see immediate benefits in editing workflows and fidelity, businesses exploring automation in training, advertising, or virtual experiences may find even greater value in the model’s composability and API support.

The early user feedback highlights that while Veo 3.1 offers valuable tooling, expectations around realism, voice control, and generation length are evolving rapidly. As Google expands access through Vertex AI and continues refining Veo, its competitive positioning in enterprise video generation will hinge on how quickly these user pain points are addressed.

EAGLET boosts AI agent performance on longer-horizon tasks by generating custom plans

2025 was supposed to be the year of "AI agents," according to Nvidia CEO Jensen Huang, and other AI industry personnel. And it has been, in many ways, with numerous leading AI model providers such as OpenAI, Google, and even Chinese competitors like Alibaba releasing fine-tuned AI models or applications designed to focus on a narrow set of tasks, such as web search and report writing.

But one big hurdle to a future of highly performant, reliable, AI agents remains: getting them to stay on task when the task extends over a number of steps. Third-party benchmark tests show even the most powerful AI models experience higher failure rates the more steps they take to complete a task, and the longer time they spend on it (exceeding hours).

A new academic framework called EAGLET proposes a practical and efficient method to improve long-horizon task performance in LLM-based agents — without the need for manual data labeling or retraining.

Developed by researchers from Tsinghua University, Peking University, DeepLang AI, and the University of Illinois Urbana-Champaign, EAGLET offers a "global planner" that can be integrated into existing agent workflows to reduce hallucinations and improve task efficiency.

EAGLET is a fine-tuned language model that interprets task instructions — typically provided as prompts by the user or the agent's operating environment — and generates a high-level plan for the agent (powered by its own LLM). It does not intervene during execution, but its up-front guidance helps reduce planning errors and improve task completion rates.

Addressing the Planning Problem in Long-Horizon Agents

Many LLM-based agents struggle with long-horizon tasks because they rely on reactive, step-by-step reasoning. This approach often leads to trial-and-error behavior, planning hallucinations, and inefficient trajectories.

EAGLET tackles this limitation by introducing a global planning module that works alongside the executor agent.

Instead of blending planning and action generation in a single model, EAGLET separates them, enabling more coherent, task-level strategies.

A Two-Stage Training Pipeline with No Human Annotations

EAGLET’s planner is trained using a two-stage process that requires no human-written plans or annotations.

The first stage involves generating synthetic plans with high-capability LLMs, such as GPT-5 and DeepSeek-V3.1-Think.

These plans are then filtered using a novel strategy called homologous consensus filtering, which retains only those that improve task performance for both expert and novice executor agents.

In the second stage, a rule-based reinforcement learning process further refines the planner, using a custom-designed reward function to assess how much each plan helps multiple agents succeed.

Introducing the Executor Capability Gain Reward (ECGR)

One of EAGLET’s key innovations is the Executor Capability Gain Reward (ECGR).

This reward measures the value of a generated plan by checking whether it helps both high- and low-capability agents complete tasks more successfully and with fewer steps.

It also includes a decay factor to favor shorter, more efficient task trajectories. This approach avoids over-rewarding plans that are only useful to already-competent agents and promotes more generalizable planning guidance.

Compatible with Existing Agents and Models

The EAGLET planner is designed to be modular and "plug-and-play," meaning it can be inserted into existing agent pipelines without requiring executor retraining.

In evaluations, the planner boosted performance across a variety of foundational models, including GPT-4.1, GPT-5, Llama-3.1, and Qwen2.5.

It also proved effective regardless of prompting strategy, working well with standard ReAct-style prompts as well as approaches like Reflexion.

State-of-the-Art Performance Across Benchmarks

EAGLET was tested on three widely used benchmarks for long-horizon agent tasks: ScienceWorld, which simulates scientific experiments in a text-based lab environment; ALFWorld, which tasks agents with completing household activities through natural language in a simulated home setting; and WebShop, which evaluates goal-driven behavior in a realistic online shopping interface.

Across all three, executor agents equipped with EAGLET outperformed their non-planning counterparts and other planning baselines, including MPO and KnowAgent.

In experiments with the open source Llama-3.1-8B-Instruct model, EAGLET boosted average performance from 39.5 to 59.4, a +19.9 point gain across tasks.

On ScienceWorld unseen scenarios, it raised performance from 42.2 to 61.6.

In ALFWorld seen scenarios, EAGLET improved outcomes from 22.9 to 54.3, a more than 2.3× increase in performance.

Even stronger gains were seen with more capable models.

For instance, GPT-4.1 improved from 75.5 to 82.2 average score with EAGLET, and GPT-5 rose from 84.5 to 88.1, despite already being strong performers.

In some benchmarks, performance gains were as high as +11.8 points, such as when combining EAGLET with the ETO executor method on ALFWorld unseen tasks.

Compared to other planning baselines like MPO, EAGLET consistently delivered higher task completion rates. For example, on ALFWorld unseen tasks with GPT-4.1, MPO achieved 79.1, while EAGLET scored 83.6—a +4.5 point advantage.

Additionally, the paper reports that agents using EAGLET complete tasks in fewer steps on average. With GPT-4.1 as executor, average step count dropped from 13.0 (no planner) to 11.1 (EAGLET). With GPT-5, it dropped from 11.4 to 9.4, supporting the claim of improved execution efficiency.

Efficiency Gains in Training and Execution

Compared to RL-based methods like GiGPO, which can require hundreds of training iterations, EAGLET achieved better or comparable results with roughly one-eighth the training effort.

This efficiency also carries over into execution: agents using EAGLET typically needed fewer steps to complete tasks. This translates into reduced inference time and compute cost in production scenarios.

No Public Code—Yet

As of the version submitted to arXiv, the authors have not released an open-source implementation of EAGLET. It is unclear if or when the code will be released, under what license, or how it will be maintained, which may limit the near-term utility of the framework for enterprise deployment.

VentureBeat has reached out to the authors to clarify these points and will update this piece when we hear back.

Enterprise Deployment Questions Remain

While the planner is described as plug-and-play, it remains unclear whether EAGLET can be easily integrated into popular enterprise agent frameworks such as LangChain or AutoGen, or if it requires a custom stack to support plan-execute separation.

Similarly, the training setup leverages multiple executor agents, which may be difficult to replicate in enterprise environments with limited model access. VentureBeat has asked the researchers whether the homologous consensus filtering method can be adapted for teams that only have access to one executor model or limited compute resources.

EAGLET’s authors report success across model types and sizes, but it is not yet known what the minimal viable model scale is for practical deployment. For example, can enterprise teams use the planner effectively with sub-10B parameter open models in latency-sensitive environments? Additionally, the framework may offer industry-specific value in domains like customer support or IT automation, but it remains to be seen how easily the planner can be fine-tuned or customized for such verticals.

Real-Time vs. Pre-Generated Planning

Another open question is how EAGLET is best deployed in practice. Should the planner operate in real-time alongside executors within a loop, or is it better used offline to pre-generate global plans for known task types? Each approach has implications for latency, cost, and operational complexity. VentureBeat has posed this question to the authors and will report any insights that emerge.

Strategic Tradeoffs for Enterprise Teams

For technical leaders at medium-to-large enterprises, EAGLET represents a compelling proof of concept for improving the reliability and efficiency of LLM agents. But without public tooling or implementation guidelines, the framework still presents a build-versus-wait decision. Enterprises must weigh the potential gains in task performance and efficiency against the costs of reproducing or approximating the training process in-house.

Potential Use Cases in Enterprise Settings

For enterprises developing agentic AI systems—especially in environments requiring stepwise planning, such as IT automation, customer support, or online interactions—EAGLET offers a template for how to incorporate planning without retraining. Its ability to guide both open- and closed-source models, along with its efficient training method, may make it an appealing starting point for teams seeking to improve agent performance with minimal overhead.

Self-improving language models are becoming reality with MIT's updated SEAL technique

Researchers at the Massachusetts Institute of Technology (MIT) are gaining renewed attention for developing and open sourcing a technique that allows large language models (LLMs) — like those underpinning ChatGPT and most modern AI chatbots — to improve themselves by generating synthetic data to fine-tune upon.

The technique, known as SEAL (Self-Adapting LLMs), was first described in a paper published back in June and covered by VentureBeat at the time.

A significantly expanded and updated version of the paper was released last month, as well as open source code posted on Github (under an MIT License, allowing for commercial and enterprise usage), and is making new waves among AI power users on the social network X this week.

SEAL allows LLMs to autonomously generate and apply their own fine-tuning strategies. Unlike conventional models that rely on fixed external data and human-crafted optimization pipelines, SEAL enables models to evolve by producing their own synthetic training data and corresponding optimization directives.

The development comes from a team affiliated with MIT’s Improbable AI Lab, including Adam Zweiger, Jyothish Pari, Han Guo, Ekin Akyürek, Yoon Kim, and Pulkit Agrawal. Their research was recently presented at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025).

Background: From “Beyond Static AI” to Self-Adaptive Systems

Earlier this year, VentureBeat first reported on SEAL as an early-stage framework that allowed language models to generate and train on their own synthetic data — a potential remedy for the stagnation of pretrained models once deployed.

At that stage, SEAL was framed as a proof-of-concept that could let enterprise AI agents continuously learn in dynamic environments without manual retraining.

Since then, the research has advanced considerably. The new version expands on the prior framework by demonstrating that SEAL’s self-adaptation ability scales with model size, integrates reinforcement learning more effectively to reduce catastrophic forgetting, and formalizes SEAL’s dual-loop structure (inner supervised fine-tuning and outer reinforcement optimization) for reproducibility.

The updated paper also introduces evaluations across different prompting formats, improved stability during learning cycles, and a discussion of practical deployment challenges at inference time.

Addressing the Limitations of Static Models

While LLMs have demonstrated remarkable capabilities in text generation and understanding, their adaptation to new tasks or knowledge is often manual, brittle, or dependent on context.

SEAL challenges this status quo by equipping models with the ability to generate what the authors call “self-edits” — natural language outputs that specify how the model should update its weights.

These self-edits may take the form of reformulated information, logical implications, or tool configurations for augmentation and training. Once generated, the model fine-tunes itself based on these edits. The process is guided by reinforcement learning, where the reward signal comes from improved performance on a downstream task.

The design mimics how human learners might rephrase or reorganize study materials to better internalize information. This restructuring of knowledge before assimilation serves as a key advantage over models that passively consume new data “as-is.”

Performance Across Tasks

SEAL has been tested across two main domains: knowledge incorporation and few-shot learning.

In the knowledge incorporation setting, the researchers evaluated how well a model could internalize new factual content from passages similar to those in the SQuAD dataset, a benchmark reading comprehension dataset introduced by Stanford University in 2016, consisting of over 100,000 crowd-sourced question–answer pairs based on Wikipedia articles (Rajpurkar et al., 2016).

Rather than fine-tuning directly on passage text, the model generated synthetic implications of the passage and then fine-tuned on them.

After two rounds of reinforcement learning, the model improved question-answering accuracy from 33.5% to 47.0% on a no-context version of SQuAD — surpassing results obtained using synthetic data generated by GPT-4.1.

In the few-shot learning setting, SEAL was evaluated using a subset of the ARC benchmark, where tasks require reasoning from only a few examples. Here, SEAL generated self-edits specifying data augmentations and hyperparameters.

After reinforcement learning, the success rate in correctly solving held-out tasks jumped to 72.5%, up from 20% using self-edits generated without reinforcement learning. Models that relied solely on in-context learning without any adaptation scored 0%.

Technical Framework

SEAL operates using a two-loop structure: an inner loop performs supervised fine-tuning based on the self-edit, while an outer loop uses reinforcement learning to refine the policy that generates those self-edits.

The reinforcement learning algorithm used is based on ReSTEM, which combines sampling with filtered behavior cloning. During training, only self-edits that lead to performance improvements are reinforced. This approach effectively teaches the model which kinds of edits are most beneficial for learning.

For efficiency, SEAL applies LoRA-based fine-tuning rather than full parameter updates, enabling rapid experimentation and low-cost adaptation.

Strengths and Limitations

The researchers report that SEAL can produce high-utility training data with minimal supervision, outperforming even large external models like GPT-4.1 in specific tasks.

They also demonstrate that SEAL generalizes beyond its original setup: it continues to perform well when scaling from single-pass updates to multi-document continued pretraining scenarios.

However, the framework is not without limitations. One issue is catastrophic forgetting, where updates to incorporate new information can degrade performance on previously learned tasks.

In response to this concern, co-author Jyo Pari told VentureBeat via email that reinforcement learning (RL) appears to mitigate forgetting more effectively than standard supervised fine-tuning (SFT), citing a recent paper on the topic. He added that combining this insight with SEAL could lead to new variants where SEAL learns not just training data, but reward functions.

Another challenge is computational overhead: evaluating each self-edit requires fine-tuning and performance testing, which can take 30–45 seconds per edit — significantly more than standard reinforcement learning tasks.

As Jyo explained, “Training SEAL is non-trivial because it requires 2 loops of optimization, an outer RL one and an inner SFT one. At inference time, updating model weights will also require new systems infrastructure.” He emphasized the need for future research into deployment systems as a critical path to making SEAL practical.

Additionally, SEAL’s current design assumes the presence of paired tasks and reference answers for every context, limiting its direct applicability to unlabeled corpora. However, Jyo clarified that as long as there is a downstream task with a computable reward, SEAL can be trained to adapt accordingly—even in safety-critical domains. In principle, a SEAL-trained model could learn to avoid training on harmful or malicious inputs if guided by the appropriate reward signal.

AI Community Reactions

The AI research and builder community has reacted with a mix of excitement and speculation to the SEAL paper. On X, formerly Twitter, several prominent AI-focused accounts weighed in on the potential impact.

User @VraserX, a self-described educator and AI enthusiast, called SEAL “the birth of continuous self-learning AI” and predicted that models like OpenAI's GPT-6 could adopt similar architecture.

In their words, SEAL represents “the end of the frozen-weights era,” ushering in systems that evolve as the world around them changes.

They highlighted SEAL's ability to form persistent memories, repair knowledge, and learn from real-time data, comparing it to a foundational step toward models that don’t just use information but absorb it.

Meanwhile, @alex_prompter, co-founder of an AI-powered marketing venture, framed SEAL as a leap toward models that literally rewrite themselves. “MIT just built an AI that can rewrite its own code to get smarter,” he wrote. Citing the paper’s key results — a 40% boost in factual recall and outperforming GPT-4.1 using self-generated data — he described the findings as confirmation that “LLMs that finetune themselves are no longer sci-fi.”

The enthusiasm reflects a broader appetite in the AI space for models that can evolve without constant retraining or human oversight — particularly in rapidly changing domains or personalized use cases.

Future Directions and Open Questions

In response to questions about scaling SEAL to larger models and tasks, Jyo pointed to experiments (Appendix B.7) showing that as model size increases, so does their self-adaptation ability. He compared this to students improving their study techniques over time — larger models are simply better at generating useful self-edits.

When asked whether SEAL generalizes to new prompting styles, he confirmed it does, citing Table 10 in the paper. However, he also acknowledged that the team has not yet tested SEAL’s ability to transfer across entirely new domains or model architectures.

“SEAL is an initial work showcasing the possibilities,” he said. “But it requires much more testing.” He added that generalization may improve as SEAL is trained on a broader distribution of tasks.

Interestingly, the team found that only a few reinforcement learning steps already led to measurable performance gains. “This is exciting,” Jyo noted, “because it means that with more compute, we could hopefully get even more improvements.” He suggested future experiments could explore more advanced reinforcement learning methods beyond ReSTEM, such as Group Relative Policy Optimization (GRPO).

Toward More Adaptive and Agentic Models

SEAL represents a step toward models that can autonomously improve over time, both by integrating new knowledge and by reconfiguring how they learn. The authors envision future extensions where SEAL could assist in self-pretraining, continual learning, and the development of agentic systems — models that interact with evolving environments and adapt incrementally.

In such settings, a model could use SEAL to synthesize weight updates after each interaction, gradually internalizing behaviors or insights. This could reduce the need for repeated supervision and manual intervention, particularly in data-constrained or specialized domains.

As public web text becomes saturated and further scaling of LLMs becomes bottlenecked by data availability, self-directed approaches like SEAL could play a critical role in pushing the boundaries of what LLMs can achieve.

You can access the SEAL project, including code and further documentation, at: https://jyopari.github.io/posts/seal

❌
❌