Monthly Archives: December 2025

AI News Briefs BULLETIN BOARD for December 2025

Welcome to the AI News Briefs Bulletin Board, a timely new channel bringing you the latest industry insights and perspectives surrounding the field of AI including deep learning, large language models, generative AI, and transformers. I am working tirelessly to dig up the most timely and curious tidbits underlying the day’s most popular technologies. I know this field is advancing rapidly and I want to bring you a regular resource to keep you informed and state-of-the-art. The news bites are constantly being added in reverse date order (most recent on top). With the bulletin board you can check back often to see what’s happening in our rapidly accelerating industry. Click HERE to check out previous “AI News Briefs” round-ups.

[12/31/2025] The Thinking Game Documentary – 200 million views in one month! The Thinking Game is a feature-length documentary filmed over five years that takes viewers inside Google DeepMind, following founder Demis Hassabis and his team as they pursue artificial general intelligence and tackle some of science’s hardest problems.

From DeepMind’s early successes in mastering complex strategy games to the emotional highs and lows of solving the decades-old protein folding challenge with AlphaFold—a breakthrough later recognized with a Nobel Prize—the film captures the rigor, uncertainty, and ambition behind major AI discoveries.

[12/31/2025] Meta acquires AI agent startup Manus for $2B+ – Meta just announced the acquisition of AI agent startup Manus for a reported figure of over $2B, adding a top-performing agentic system and revenue-generating product to its aggressive AI expansion.

Key details:

  • Manus offers autonomous agents for tasks like deep research and coding, with the startup crossing $100M in annual revenue just 8 months post-launch.
  • The startup was founded in Beijing in 2022, relocated to Singapore this year, and will now cut all China operations and ownership ties.
  • Manus tops Scale’s RLI benchmark, which measures the ability to handle real-world, valuable work — though scores haven’t been updated since October.
  • Manus CEO Xiao Hong will join Meta’s leadership under COO Javier Olivan, bringing roughly 100 employees with him.

After a period of quiet, Zuck is making another big AI swing. With Manus topping benchmarks running Claude (and after Meta’s own model struggles), the move gives Meta a profitable, production-ready agent platform now, with the option to swap in its new rumored internal systems if they can make the leap to the frontier.

[12/31/2025] How AI agents will transform in 2026 – AI is shifting from conversational tools to systems that take action, as explored in this episode of Big Ideas 2026, which examines three forces shaping the next generation of AI products. Rather than focusing only on smarter models, the discussion highlights how software itself is changing form: interfaces move from chat and prompts to execution, design moves from human-first to machine-legible, and work moves toward agentic automation.

Insights from Marc Andrusko, Stephanie Zhang, and Olivia Moore show how AI is becoming proactive across workflows, readable and operable by other machines, and increasingly embodied in practical voice agents that can be deployed in real-world settings like healthcare, finance, and recruiting—signaling a future where AI is no longer something you ask, but something that simply does.

[12/30/2025] The State Of LLMs 2025: Progress, Progress, and Predictions – AI industry luminary Sebastian Raschka just uploaded his State of LLMs 2025 report, where he takes a look at the progress, problems, and predictions for the year. Originally, he aimed for a concise overview and outlook, but (like always) that turned into quite the list of topics:

  • Surprises in 2025 and Predictions for 2026
  • The Year of Reasoning, RLVR, and GRPO
  • GRPO: The Research Darling of the Year
  • LLM Architectures: A Fork in the Road?
  • The Year of Inference-Time Scaling and Tool Use
  • Word of the Year: Benchmaxxing
  • AI for Coding, Writing, and Research
  • The Future Edge: Private Data
  • Building LLMs and Reasoning Models From Scratch

[12/30/2025] Meta researchers train AI to find and fix its own bugs – Meta’s FAIR just published research on Self-play SWE-RL, a training method where a single AI model learns to code better by creating bugs for itself to solve with no human data needed.

Key details:

  • The system uses one model in two roles: a bug injector that breaks code, and a solver that fixes it while both learn together.
  • On the SWE-bench Verified coding benchmark, the approach jumped 10+ points over its starting checkpoint and beat human-data baselines.
  • The method uses “higher-order bugs” from failed fix attempts, creating an evolving learning curriculum that scales with the model’s skill level.

Most coding agents today learn from human-curated GitHub issues, a finite resource that limits improvement. Meta’s self-play approach sidesteps that bottleneck, letting models generate infinite training from codebases — applying a path similar to what made Google’s AlphaZero superhuman at chess to software engineering.

[12/30/2025] Adam Marblestone – AI is missing something fundamental about the brain – Adam Marblestone is CEO of Convergent Research. He’s had a very interesting past life; Research Scientist at Google Deepmind on their neuroscience team; has worked on brain computer interfaces to quantum computing and nanotech to formal mathematics. In this episode of the Dwarkesh Podcast, the discussion focuses on how the brain learns so much from so little, what the AI field can learn from neuroscience, and the answer to Ilya’s question: how does the genome encode abstract reward functions? Turns out, they’re all the same question.

[12/29/2025] A new strategy for AI power – Alphabet’s $4.75B acquisition of clean energy developer Intersect Power marks a strategic shift as Google moves to secure direct control over the energy needed to scale its rapidly growing AI data center footprint. Facing surging electricity demand from AI workloads and rising emissions, Google is moving beyond traditional power purchase agreements to own generation capacity, enabling it to co-locate renewable energy with data centers and better manage sourcing, timing, and development. Intersect brings 7.5 gigawatts of operating solar and storage capacity and another 8 gigawatts in development, much of it in energy-rich Texas, while retaining its brand and leadership and leaving some non-Google-serving assets outside the deal.

The acquisition signals that energy infrastructure is becoming a core strategic asset for big tech, with ownership of clean power emerging as a competitive advantage in an AI-driven future.

[12/29/2025] Microsoft CEO admits Copilot integrations don’t work – Satya Nadella has increasingly stepped into the role of Microsoft’s top product manager for AI, delegating business responsibilities to focus on fixing Copilot’s shortcomings and accelerating AI execution across the company. Frustrated with lagging performance, slow shipping, and weaker user adoption compared to rivals like Google’s Gemini and OpenAI’s ChatGPT, Nadella has become deeply involved in product reviews, engineering meetings, recruiting top AI talent, and partnerships with firms like Anthropic.

Despite Microsoft’s strong position and early enterprise lead through OpenAI, Copilot adoption, especially within Office 365, has been uneven, with some customers scaling back usage and others questioning the value versus free alternatives. While Microsoft still reports growth among large enterprise customers, Nadella has warned internally that this is a critical inflection point, emphasizing speed, product quality, and real automation value as essential for Copilot to become indispensable and for Microsoft to avoid repeating past strategic missteps.

[12/26/2025] The next wave of data science won’t be about deeper models. It’ll be about better interfaces: explainability, human feedback loops, and systems that make sense to stakeholders. We don’t need more complexity. We need more clarity.

[12/26/2025] Google DeepMind Researchers Release Gemma Scope 2 as a Full Stack Interpretability Suite for Gemma 3 Models – Google DeepMind Researchers introduce Gemma Scope 2, an open suite of interpretability tools that exposes how Gemma 3 language models process and represent information across all layers, from 270M to 27B parameters. Its core goal is simple, give AI safety and alignment teams a practical way to trace model behavior back to internal features instead of relying only on input output analysis. When a Gemma 3 model jailbreaks, hallucinates or shows sycophantic behavior, Gemma Scope 2 lets researchers inspect which internal features fired and how those activations flowed through the network.

[12/26/2025] This AI Paper from Stanford and Harvard Explains Why Most “Agentic AI” Systems Feel Impressive in Demos and then Completely Fall Apart in Real Use – Agentic AI systems sit on top of large language models and connect to tools, memory, and external environments. They already support scientific discovery, software development, and clinical research, yet they still struggle with unreliable tool use, weak long horizon planning, and poor generalization. The latest research paper “Adaptation of Agentic AI” (with GitHub repo HERE) from Stanford, Harvard, UC Berkeley, Caltech proposes a unified view of how these systems should adapt and maps existing methods into a compact, mathematically defined framework.

[12/26/2025] NVIDIA buying AI chip startup Groq’s assets for about $20 billion in its largest deal on record – The company was founded by creators of Google’s tensor processing unit, or TPU, which competes with NVIDIA for artificial intelligence workloads. Groq, which was valued at $6.9 billion in a financing round in September, framed the deal as a “non-exclusive licensing agreement,” with its CEO and other senior leaders joining NVIDIA.

[12/26/2025] Google’s year in review: 8 areas with research breakthroughs in 2025 – This was a year of AI agents, reasoning and scientific discovery. Here’s a look back at some of the breakthroughs, products and scientific milestones that defined the work of Google, Google DeepMind and Google Research in a year of relentless progress.

[12/24/2025] Nebius Token Factory just launched Post-training — the missing layer for teams building production-grade AI on open-source models. You can now fine-tune frontier models like DeepSeek V3, GPT-OSS 20B & 120B, and Qwen3 Coder across multi-node GPU clusters with stability up to 131k context. Models become deeply adapted to your domain, your tone, your structure, your workflows. Deployment is one click: dedicated endpoints, SLAs, and zero-retention privacy. And for launch, fine-tuning GPT-OSS 20B & 120B (Full FT + LoRA FT) is free until Jan 9. This is the shift from generic base models to custom production engines.

[12/24/2025] Memory: How Agents Learn – Agents can follow complex instructions, use tools, and work autonomously for hours. However, ask them the same question twice, and they have to start from scratch. We’ve made agents capable, but haven’t yet figured out how to make them learn. This article looks at different types of memory and how they could be implemented into agents.

[12/24/2025] Codex vs. Claude Code – There’s no wrong choice when it comes to AI. The tool you choose should match how you work. Try out both Claude and Codex and see which one fits. Every AI tool has its strengths and weaknesses, and the only way to discover what they are is by using them.

[12/24/2025] Transform sources into structured Data Tables in NotebookLM – NotebookLM has introduced Data Tables, a new feature that helps users organize and analyze information from sources in a structured format. The feature synthesizes sources into clean, structured tables ready to export to Google Sheets. It is now rolling out to all users. Screenshots of the feature are available in the article.

[12/23/2025] OpenTinker – OpenTinker is an RL-as-a-Service infrastructure for foundation models. It features separation of programming and execution, separation of environment and training code, and seamless transition from training to inference. The platform enables users to perform RL training and inference without requiring local GPU resources by separating client-side programming from server-side execution. It provides a high-level Python API that abstracts away the complexity of distributed systems.

[12/23/2025] MiniMax M2.1 is live in Kilo – MiniMax M2.1 is ahead of DeepSeek and Kimi on several benchmarks. It is even catching up to state-of-the-art models in some areas. The model is super-fast and efficient. It is now available to all Kilo Code users.

[12/23/2025] Z.AI launches GLM-4.7, new SOTA open-source model for coding – GLM-4.7 is the latest release in Z.AI‘s General Language Model line. The high-end foundation model is aimed at advanced reasoning, coding, and multimodal workloads. The late update expands context handling and reasoning depth compared to earlier versions. It introduced upgraded reasoning pipelines and broader multimodal support.

[12/23/2025] Your Year with ChatGPT – OpenAI introduced a personalized year-in-review feature called “Your Year with ChatGPT,” available to eligible users in select regions. Inspired by Spotify Wrapped, the feature highlights individual usage trends from the past year.

[12/22/2025] Andrej Karpathy’s 2025 LLM Year in Review – Andrej Karpathy has outlined paradigm shifts of LLMs in 2025, including fast inference engines, model distillation trends, real-time agents, neural GPUs, and the rise of high-quality open models like DeepSeek-V2 and RWKV.

[12/22/2025] Nice paper studying the differences between Supervised Finetuning (SFT) and Direct Preference Optimization (DPO).

[12/22/2025] Understanding AI Benchmarks – Benchmarks are the most widely misunderstood part of the AI ecosystem. The narrative keeps implying a universal increase in intelligence, but the numbers can be misleading. To navigate this noise, look at the aggregate, look at the relative, and verify with your own tasks. The only benchmark that matters at the end of the day is your own workload.

[12/22/2025] LLM code quality leaderboard: How does your preferred model score? – Interested to learn how different LLMs perform for coding? New research on models like GPT-5.2 High and Gemini 3.0 Pro reveals trade-offs in structural quality and security. Learn more about the reliability, security, complexity and maintainability of code written by the latest models with Sonar’s LLM Leaderboard. It is the definitive resource to understand the true quality and risks of AI-generated code. Compare leading models to understand the trade-offs and behaviors across multiple dimensions

[12/22/2025] Anthropic launches Bloom, an open-source framework to speed up frontier-model behavioral evaluations from weeks to days – Bloom is an open-source tool from Anthropic which targets behavioral evaluations that take too long and age quickly. Static prompts leak into training data, and models adapt to the test. Bloom’s key change is speed through automation, with full evaluation suites completed in days. The central idea is to lock the behavior, not the scenarios. You describe one behavior clearly, and Bloom generates many scenarios to measure how often it appears. A configuration seed keeps results reproducible even as scenarios change.

Bloom runs a four-stage automated pipeline:

  • Interprets the behavior definition into concrete evaluation criteria
  • Generates diverse scenarios designed to elicit the target behavior
  • Runs parallel multi-turn rollouts with simulated users and tools
  • Scores transcripts and reports metrics like elicitation rate

Anthropic releases benchmark results for four behaviors across 16 frontier models. Each evaluation suite runs 100 rollouts with repeated trials and reported variance. Automated judgments align closely with humans; Claude Opus 4.1 reaches a 0.86 Spearman correlation on hand-labeled transcripts. Bloom is available as an open-source project on GitHub. You use Bloom by writing a seed file that defines the behavior, models, and rollout settings. Bloom runs locally or at scale, integrates with Weights & Biases, and exports structured transcripts.

[12/19/2025] “AI will replace data scientists” is a myth. Automation helps, but judgment, domain understanding, and statistical reasoning remain human strengths. Tools evolve. But critical thinking doesn’t go out of style.

[12/19/2025] A new way to increase the capabilities of large language models – MIT-IBM Watson AI Lab researchers developed an expressive architecture that provides better state tracking and sequential reasoning in LLMs over long texts.

[12/19/2025] In this episode of Cafe Cursor, Michael Truell speaks with John Schulman, Co-founder and Chief Scientist at Thinking Machines, on dead ends, scaling RL, and building research institutions – John Schulman estimates that with full hindsight, a few talented people could have built a ChatGPT-3.5-level model in 2018-2019 with a couple of GPU boxes. He described early OpenAI as a “rag tag” blend of small exploratory research projects and bigger engineering efforts inspired by DeepMind’s AlphaGo. He expects value functions and offline RL to make a comeback, and warns that catch-up mode makes it harder to build exploratory research culture later.

[12/19/2025] OpenAI releases GPT-5.2-Codex – GPT-5.2-Codex is OpenAI’s latest specialized AI model focused on professional software engineering and defensive cybersecurity, built by optimizing GPT-5.2 for long-horizon, real-world coding tasks such as large refactors, migrations, and multi step engineering workflows while integrating closely with development tools and environments.

It is designed to maintain context across complex codebases, interpret technical diagrams or screenshots, and act as a more agentic coding partner rather than a simple autocomplete tool. In the market, GPT-5.2-Codex competes with advanced AI coding assistants and platforms including GitHub Copilot, Google Gemini models, Anthropic Claude, Cursor, AWS CodeWhisperer, and other developer focused AI tools that aim to help teams write, review, and manage production quality code.

[12/19/2025] ChatGPT app store is live – OpenAI launched a new in-ChatGPT app directory that lets users connect third party apps directly into conversations so they can take actions like ordering groceries, creating slide decks, searching for apartments, managing files or generating shopping carts. The directory appears on iOS, Android and the web, organizes apps into categories like Feature, Lifestyle and Productivity, and supports services such as Spotify, Apple Music, Booking.com and Dropbox which can be invoked with an @ mention once authorized.

OpenAI also opened app submissions to developers and released resources including guidelines, example apps, a UI library and a quickstart guide. Monetization is limited to linking out to external sites for now but the company is exploring more options and requires clear privacy policies. The update supports Sam Altman’s vision of ChatGPT as a more versatile platform where apps and custom GPTs act as natural extensions of a conversation and help users move smoothly from ideas to actions.

[12/19/2025] Can AI run a vending machine business? – Project Vend’s second phase showed that an AI shopkeeper can improve quickly with better models, tools, and procedures, becoming more competent at pricing, sourcing, and sales and largely eliminating sustained losses. However, despite these gains, the system remained easy to manipulate, legally naïve, and dependent on human oversight, especially in adversarial or ambiguous situations.

Final outcome: AI agents can plausibly run parts of a small business, but they are not yet robust or trustworthy enough to operate autonomously without strong guardrails and human supervision.

[12/18/2025] Deploy LLMs on your phone – Unsloth collaborated with PyTorch to enable you to fine-tune LLMs, and then directly deploy/run them on your phone. 

[12/18/2025] Microsoft’s view on AGI – Mustafa Suleyman frames Microsoft’s AGI goal very differently from the usual “Google vs OpenAI race” narrative: he says there is no single “win AGI” finish line and no true zero-sum race, because powerful models, methods, and knowledge will proliferate across labs and countries within a year or two of each breakthrough.

On AGI/ASI, he treats them as points on a curve, aiming for a system that can ultimately outperform humans at most tasks, but insists the urgent priority is containment and safety, not just raw capability: no AI legal personhood, no unbounded self-improving systems that we can’t control, and a strongly “humanist” stance that AGI must be aligned with and constrained for the benefit and survival of our species, even as Microsoft aggressively builds toward superintelligence.

[12/18/2025] Google’s view on AGI – Demis Hassabis argues that reaching AGI will require both continued scaling and major scientific breakthroughs, because today’s models still show “jagged” intelligence—superhuman in some areas but unreliable in basic reasoning, consistency, and self-correction. He sees the missing ingredients for AGI as robust reasoning and planning, calibrated uncertainty, continual learning, and deep world understanding built through high-fidelity simulations.

He believes simulations will let us probe the nature of intelligence and even compare AGI to the human mind to reveal what, if anything, is uniquely human. Demis expects AGI’s societal impact to rival the Industrial Revolution but unfold ten times faster, likely requiring new economic systems and global cooperation that institutions are not yet prepared for. Ultimately, he leans toward the view that everything in the universe, including consciousness, may be computable—meaning that classical machines could, in principle, replicate all aspects of human cognition.

[12/18/2025] Yes, AGI Can Happen – A Computational Perspective – Current models vastly underutilize hardware: DeepSeek-V3 and Llama-4 achieve only ~20% FLOP utilization during training, and inference runs at single-digit utilization because autoregressive models are bottlenecked on loading weights from memory, not compute. Models we see today are also lagging indicators, trained on last-gen hardware that’s unoptimized for the massive size of modern clusters and the latest training methods.

[12/18/2025] Google works to erode NVIDIA’s software advantage with Meta’s help – Google and Meta are working together on a new initiative to make Google’s Tensor Processing Units better at running PyTorch. The move is aimed at weakening NVIDIA’s longstanding dominance in the AI computing market. TorchTPU will remove a key barrier that has slowed the adoption of TPU chips. It will make existing infrastructure built using PyTorch software fully compatible with TPUs. Google is considering open-sourcing parts of the software to speed uptake.

[12/17/2025] OpenAI introduces GPT-Image-1.5 with 4× faster generation and stronger control over image edits – OpenAI updates its image stack with GPT-Image-1.5, a new model that replaces GPT-Image-1 in ChatGPT and the API.

The prior version drew attention but fell behind on speed, edit stability, and text accuracy. This release focuses on one clear breakthrough: images render up to 4× faster, even across iterative edits.

Image edits often broke faces, layouts, or text after one or two changes. GPT-Image-1.5 fixes this by following instructions more precisely and preserving visual details across edits. This matters if you build tools that rely on repeated image refinement, not one-shot prompts.

Key details:

  • Generates images up to 4× faster than GPT-Image-1
  • Preserves faces, lighting, composition, and logos across multiple edits
  • Supports add, remove, blend, combine, and transpose edit operations
  • Renders dense text, small fonts, tables, and infographics more accurately
  • Ranks first on Artificial Analysis and LM Arena image benchmarks

OpenAI also adds a dedicated Images panel in ChatGPT with preset styles and templates. You can queue multiple generations while others render.

[12/17/2025] How Voltage Park Built NVIDIA SuperPODs for Cursor’s RL Training – Cursor partnered with Voltage Park to design and operate purpose-built NVIDIA SuperPODs – because the next generation of AI workloads won’t run on one-size-fits-all infrastructure. The build includes a custom NVIDIA HGX B200 SuperPOD delivered in less than 3 months.

[12/17/2025] Google announces Gemini 3 Flash – Google just announced Gemini 3 Flash, their newest model built for speed and scale. The latest addition to the Gemini 3 family, this model delivers capabilities that rival larger frontier models while maintaining the low latency and affordability developers expect from the “Flash” series.

Here are the highlights for developers:

  • Surpassing previous “Pro” models: Gemini 3 Flash surpasses Gemini 2.5 Pro across multiple benchmarks, achieving 90.4% on GPQA Diamond (PhD-level reasoning) and Humanity’s Last Exam (33.7% without tools).
  • High speed, low cost: It is 3x faster than 2.5 Pro and priced for scale at $0.50 per 1M input tokens.
  • Coding powerhouse: It matches Gemini 3 Pro’s coding performance (78% on SWE-bench Verified).
  • Availability: Rolling out today in the Gemini API via Google AI Studio, Google Antigravity, Android Studio, and Gemini CLI, as well as for enterprises in Vertex AI.

[12/16/2025] Terence Tao doubts that anything resembling genuine “artificial general intelligence” is within reach of current AI tools – Fields Medalist and UCLA Mathematics Professor Terence Tao believes that while “artificial general intelligence” may be out of reach, “artificial general cleverness” is becoming a reality in many ways. General cleverness is the ability to solve broad classes of complex problems via somewhat ad hoc means. While this doesn’t qualify as true intelligence, it can result in a non-trivial success rate in a wide spectrum of tasks. Viewing the current generation of tools as a generator of clever thoughts and outputs may be a more productive perspective when trying to use them to solve difficult problems.

[12/16/2025] GPT-5.2 Is Frontier Only For The Frontier – GPT-5.2 is good at instruction following and a solid choice, but it isn’t “fun” to interact with. People strongly dislike its personality. The model is heavily constrained and censored. It is not likely to solve OpenAI’s “Code Red” problems. The company will likely try again in a month with GPT-5.3.

[12/16/2025] Stanford AI Experts Predict What Will Happen in 2026 – The era of AI evangelism is giving way to evaluation. Stanford faculty see a coming year defined by rigor, transparency, and a long-overdue focus on actual utility over speculative promise.

[12/16/2025] NVIDIA releases Nemotron 3, an open model family built for large-scale multi-agent AI systems – NVIDIA introduces Nemotron 3 as open infrastructure for agent-based AI systems, not another chat model. Teams now push beyond single assistants toward many agents that share context, split work, and route tasks. The standout change is efficiency. Nemotron 3 Nano reaches 4× higher throughput than Nemotron 2 Nano. The problem appears once systems add more agents. Cost rises fast, context breaks across steps, and routing adds overhead. Proprietary models solve reasoning but strain budgets. Open models offer control but lag on scale and performance. Nemotron 3 applies a hybrid mixture-of-experts design, which activates only a small part of the model per token. Each agent uses just what it needs, not the full network.

What NVIDIA releases:

  • Nano, Super, Ultra models from 30B to 500B parameters
  • Up to 3B, 10B, 50B active parameters per token
  • 1M-token context window on Nano
  • Up to 60% fewer reasoning tokens per task

How you use it:

  • Run Nano for retrieval, summaries, debugging, and assistant agents
  • Route complex reasoning to Super or Ultra
  • Fine-tune with NeMo Gym and NeMo RL
  • Deploy via vLLM, SGLang, llama.cpp, or managed inference providers

Nemotron 3 reframes open models as cost-controlled engines inside large agent systems.

[12/15/2025] You’re reviewing a dataset. It’s clean, complete, and well-formatted—but the model still performs poorly. What do you check next? Let’s discuss the less obvious causes of underperformance.

[12/15/2025] OpenAI publishes code and weights for its sparsity paper, allowing you to examine learned transformer circuits – OpenAI published a paper on circuit-level sparsity in November and has now released the code and model weights. The work changes how language models train by enforcing sparsity from the start instead of pruning dense models later. The most notable result shows that specific behaviors reduce to tiny, readable circuits that still work after extreme pruning.

Dense transformers create a problem. Every neuron connects to thousands of others, which causes feature overlap, also called superposition. That overlap makes internal behavior hard to trace. The insight was to enforce sparsity during training. The model learns with most weights fixed to zero. Each neuron keeps only a small number of connections, which limits interference by design.

[12/15/2025] Can LLMs give us AGI if they are bad at arithmetic? – While large language models are useful tools, it’s hard to look at the frontier models and see something approximating human-level intelligence when there are such evident cognitive gaps. These models aren’t being fine-tuned to make accurate judgments about even small datasets. There needs to be more efficient ways to attach data that doesn’t burn tokens while still allowing models to pass data sets through to efficient tools. This would make tooling a great deal more efficient.

[12/15/2025] I Reverse Engineered Claude’s Memory System, and Here’s What I Found! – Claude uses on-demand tools and selective retrieval for its memory system. This post explores Claude’s memory system through conversations with the bot. Claude appears to be cooperative, transparent, and willing to share information about its internal structure, tools, and prompt format. However, it is worth noting that Claude can hallucinate, so some of the information may be inaccurate.

[12/15/2025] Tinker adds vision input and goes GA – Tinker from Thinking Machines Lab is open to everyone now, featuring a new reasoning model, Kimi K2 Thinking, and an OpenAI API-compatible interface for seamless integration. Vision input capabilities have been added with Qwen3-VL models, allowing processing of images and text together. These updates enhance Tinker’s utility in image classification, outperforming traditional models with limited labeled data.

[12/15/2025] ARC Prize 2025 Paper Award 1st Place TRM – The Tiny Recursive Model (TRM) is a ~7M-parameter, single-network recursive model with separate answer and latent states that, via deep supervised refinement, attains ~45% on ARC-AGI-1 and ~8% on ARC-AGI-2. The presentation below is by Alexia Jolicoeur-Martineau / Senior AI Researcher, Samsung SAIT AI Lab.

[12/15/2025] Launched Termzy AI: A Browser Extension that Analyses Terms of Service for You with AI – A student from the University of Amsterdam has launched Termzy AI, a new browser extension that uses artificial intelligence to instantly find and analyse legal documents, terms and conditions, and privacy policies on websites while you browse.

Over time, companies have come to rely on users ignoring these documents, embedding unbalanced terms that overwhelmingly favor the service provider. Termzy AI aims to change that by offering quick, accessible insights into what users are really agreeing to, restoring agency to the individual in the digital space.

[12/12/2025] Building Olmo 3 Think – From NeurIPS, Nathan Lambert’s talk below covers every stage of the process to build Olmo 3 Think (slides are available). He’s been giving this talk, improving it, and getting great feedback at other venues such as The Conference on Language Modeling (COLM) & The PyTorch Conference. This version involves changes and new considerations of every angle of the stack, from pretraining, evaluation, and of course post-training. Most of the talk focuses on reinforcement learning infrastructure and evaluating reasoning models, with quick comments on every training stage.

[12/12/2025] Meta shifts from Llamas to Avocados – Meta’s once-bold bet on open-source AI models like Llama has shifted toward a costly, high-pressure race to catch up with OpenAI, Google, and Anthropic, as the company faces internal turbulence, leadership reshuffling, and investor concerns. After Llama 4 underwhelmed and rivals surged ahead, Mark Zuckerberg began pivoting away from open source and toward a proprietary frontier model called Avocado, now expected in early 2026, while launching a multibillion-dollar hiring spree led by Scale AI founder Alexandr Wang and other high-profile recruits.

Meanwhile, Meta’s core ad business remains strong, but its AI products like Vibes lag behind competitors, further increasing pressure to deliver a breakthrough as the company pours tens of billions into AI infrastructure and talent in hopes of reclaiming momentum in the rapidly evolving AI race.

[12/12/2025] NVIDIA Kaggle Grandmasters Win Artificial General Intelligence Competition – NVIDIA researchers on Friday won a key Kaggle competition many in the field treat as a real-time pulse check on humanity’s progress toward artificial general intelligence (AGI). Ivan Sorokin and Jean-Francois Puget, two members of the Kaggle Grandmasters of NVIDIA (KGMoN), came in first on the Kaggle ARC Prize 2025 public leaderboard with a 27.64% score by building a solution evaluated on the same dataset behind the ARC-AGI-2 benchmark.

[12/12/2025] Broadcom reveals its mystery $10 billion customer is Anthropic – The customer Broadcom signed a $10 billion order for custom chips in September has been revealed to be Anthropic. Broadcom will deliver entire server racks to Anthropic, not just chips. Anthropic has placed an additional $11 billion order with Broadcom in the latest quarter. The deal has drawn significant investor attention amid the AI infrastructure boom.

[12/11/2025] OpenAI Introduces GPT-5.2 – OpenAI today announced its most advanced artificial intelligence model, GPT-5.2, and said it’s the best offering yet for everyday professional use. The model is better than predecessors at creating spreadsheets, building presentations, perceiving images, writing code and understanding long context, OpenAI said. It will be available starting today within OpenAI’s ChatGPT chatbot and its application programming interface (API).

[12/11/2025] Google launches new Deep Research agent for developers – Google announced they are bringing their leading deep research experiences to developers via a powerful Gemini Deep Research agent in the Gemini API. For the first time, developers can embed Google’s most advanced autonomous research capabilities directly into their own applications. They are also open-sourcing DeepSearchQA, a new web research agent benchmark designed to test agent comprehensiveness on web research tasks. 

Key details:

  • State-of-the-Art Performance: It achieves state-of-the-art on Humanity’s Last Exam (HLE) and DeepSearchQA.
  • Cost-Efficient Power: It is optimized to generate well-researched reports at a much lower cost than similar models.
  • New Industry Benchmark: Google is also open-sourcing DeepSearchQA, a new benchmark of 900 “causal chain” tasks to test agent comprehensiveness, moving the industry beyond simple fact-retrieval tests.
  • GV (formerly Google Ventures): Using it for due diligence, shortening research cycles from days to hours without loss of fidelity or quality.
  • Axiom Bio: Accelerating drug discovery by surfacing granular data across biomedical literature.

This marks the first agent released on top of the new Interactions API (public beta), a next-generation interface designed to simplify interactions with Gemini models and agents. 

[12/11/2025] NVIDIA-backed Starcloud trains first AI model in space as orbital data center race heats upStarcloud successfully trained the first AI model in space, using an NVIDIA H100 GPU to run Google’s Gemma LLM on their Starcloud-1 satellite. These orbital data centers promise significant energy savings compared to Earth-based facilities, addressing the growing power demands and environmental impact of terrestrial data centers. Starcloud plans a 5-gigawatt orbital data center to capture constant solar energy, with implications for both commercial AI workloads and real-time data analysis for military and emergency response operations. The Starcloud whitepaper describing space-based data centers can be found HERE.

[12/11/2025] AI World Models Reshape Interactive Storytelling – Advances in AI world models like Marble and Genie 3 generate explorable 3D environments from text, enabling creators to build interactive worlds and game mechanics with natural language. These systems open the door to generative multiverses where players act as co-creators and participate in emerging digital economies built on interoperable assets. The shift promises new storytelling formats and valuable training grounds for agents and robotics.

[12/11/2025] I Reverse Engineered ChatGPT’s Memory System – No vector databases, no RAG over conversation history. The system uses four layers: ephemeral session metadata (device, location, and usage patterns), long-term facts stored via a dedicated tool, lightweight summaries of recent messages, and a sliding window of the current conversation. OpenAI prioritizes speed and token efficiency over detailed historical context.

[12/10/2025] When teaching at UCLA, I always start with this principle: “All models are wrong, but some are useful.” The key is knowing how a model is wrong—bias, variance, scope—and when it still adds value.

[12/10/2025] Anthropic introduces SGTM to route risky knowledge into a small parameter slice that can be removed – Anthropic introduces SGTM with a story many developers know: you try to keep certain knowledge out of a model, but filtering the dataset never works. Risky details hide in normal documents, labels miss things, and the model still absorbs what you hoped to avoid. SGTM takes a different path. Instead of blocking the data, it creates a dedicated “forget” zone inside the model where all risky knowledge goes. When the model processes risky examples, SGTM updates only that forget zone. When it processes normal data, SGTM teaches the model to work without those weights. By the end of training, the separation is so strong that you can delete the forget zone without harming general ability. Anthropic tests this on a 254M model and sees clean removal with tough recovery: 350 steps instead of RMU’s 50.

Key details:

  • Isolates targeted domains inside a small weight subset
  • Adds ~5% compute overhead
  • Outperforms filtering on retain/forget separation
  • Resists recovery 7× longer under adversarial finetuning

[12/10/2025] Mistral launches Devstral 2, an open-source coding model alongside its first autonomous Vibe CLI agent – Mistral steps into the coding arena with Devstral 2, a pair of open models built for real software work, not just demos. Engineers face sprawling repos, unpredictable dependencies, and the constant need to reason across many files at once. Mistral aims at that gap by giving you a model that understands large codebases and acts with precision. The insight behind Devstral is simple: stronger reasoning with fewer parameters can still fix real bugs. Devstral 2 reaches 72.2% SWE-bench Verified, and Devstral Small 2 runs on modest hardware while scoring 68.0%. You can use either as an API model or run Devstral Small 2 locally.

Key details:

  • Supports a 256K context window for large repositories and multi-file changes
  • Offers 123B and 24B parameter versions under permissive open licenses
  • Delivers strong performance in human evaluations against larger open models
  • Runs on hardware ranging from H100 clusters to consumer GPUs
  • Integrates with Vibe CLI for natural-language code edits and commands

[12/10/2025] Announcing Model Context Protocol (MCP) support for Google services – Today Google is announcing the release of fully-managed, remote MCP servers. Google’s existing API infrastructure is now enhanced to support MCP, providing a unified layer across all Google and Google Cloud services. Developers can now simply point their AI agents or standard MCP clients like Gemini CLI and AI Studio to a globally-consistent and enterprise-ready endpoint for Google and Google Cloud services.

The company is incrementally releasing MCP support for all our services, starting with:

  • Grounding AI in the real world with Google Maps: Maps Grounding Lite, available through Google Maps Platform, connects AI agents to trusted geospatial data, offering access to fresh information on places, weather forecasts, and routing details such as distance and travel time.
  • Reasoning over enterprise data with BigQuery: The BigQuery MCP server enables agents to natively interpret schemas and execute queries against enterprise data without the security risks or latency of moving data into context windows. It provides direct access to BigQuery features like forecasting while ensuring data remains in-place and governed.
  • Autonomous infrastructure management with Google Compute Engine (GCE): By exposing capabilities like provisioning and resizing as discoverable tools, this server empowers agents to autonomously manage infrastructure workflows. Agents can handle everything from initial builds to day-2 operations, such as dynamically adapting to workload demands.
  • Autonomous container operations with Google Kubernetes Engine (GKE): The GKE MCP server exposes a structured, discoverable interface that allows agents to interact reliably with both GKE and Kubernetes APIs, eliminating the need to parse brittle text output or string-together complex CLI commands. This unified surface allows agents, operating autonomously or with human-in-the-loop guardrails, to diagnose issues, remediate failures, and optimize costs.

[12/9/2025] What comes after Transformers? Neural Memory and Test-Time Training – Google Research presented 2 new papers during NeurIPS with an architecture that actively learns and update their own parameters during inference, acting as a “long-term neural memory” rather than a static context window.

Implementation:

  • Titans replaces the fixed-state memory of linear RNNs with a deep Multi-Layer Perceptron (MLP) memory module.
  • The model updates this memory at test-time by calculating a “surprise metric” based on the gradient of the input data.
  • MIRAS framework generalizes this by treating memory as an optimization problem with customizable loss functions and regularization.
  • Training is parallelized by chunking sequences, using linear operations within chunks and non-linear updates across chunks.
  • Models incorporate “Persistent Memory” (fixed learnable weights) alongside the dynamic “Contextual Memory” to store task-specific knowledge.

Insights:

  • Attention mechanisms excellent at short-term memory but fails at efficient long-term storage due to quadratic costs.
  • Deep memory structures (MLPs) significantly outperform vector/matrix-based compression used in Mamba and other linear RNNs.
  • Memory updates are effective when driven by “surprise”, high gradients indicate unexpected, memorable data.
  • Forgetting mechanisms in recurrent models are mathematically equivalent to retention regularization (weight decay).
  • Standard L2 (Mean Squared Error) objectives make memory sensitive to outliers; L1 or Huber loss provides better stability.
  • Titans outperforms GPT-4 on “Needle in a Haystack” tasks with 2M+ token contexts despite having fewer parameters.
  • Deep memory modules exhibit a trade-off where increased depth improves perplexity but slightly reduces training throughput.

Titans and MIRAS show potential to replace, or at least augment, pure Transformer architectures. The hybrid approach (using attention for the immediate context and Neural Memory for the deep history) suggests the future might be a convergence of RNN efficiency and Transformer performance.

[12/9/2025] From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence – Alibaba, ByteDance, Tencent and a few other Chinese labs published a fantastic 304 page overview paper on training LLMs for coding. It was found that interpreted languages require larger models and more training to achieve the same performance as statically typed languages. Python in particular is a very hard language for LLMs to learn while Rust is much easier to learn. This is counterintuitive because humans think of Python as an “easy” language and Rust as a “hard” language. For LLMs its the inverse, because in a statically typed language there is more information per token for the LLM to learn from.

[12/9/2025] Alignment Is Capability – Alignment research is part of the core research problem. Labs that treat alignment as a constraint will hit a ceiling, while those that figure out how to build models that genuinely understand human values will pull ahead. AGI requires alignment. The integrated approach looks like the strongest approach.

[12/9/2025] The state of enterprise AI – OpenAI’s state of the enterprise AI report shares a comprehensive look at how enterprises are adopting AI, what workers say they’re gaining, and how organizational leaders are turning experimentation into measurable productivity and new capabilities. The analysis draws on real-world usage data from OpenAI’s enterprise customers and a survey of over 9,000 workers across 100 enterprises. It is clear that enterprise adoption is accelerating in both breadth and depth. The technology is reshaping how people work, how teams collaborate, and how organizations build and deliver products.

[12/9/2025] What is the future of AI? – The biggest opportunity in AI today is using open-source models to build specialized AI applications that solve real problems. Konstantine Buhler, Partner at Sequoia Capital, and Kari Briski, Vice President of Generative AI software at NVIDIA, discuss:

  • Moving from “one big model” thinking to specialized agents, open-source foundations, and human-in-the-loop workflows.
  • What the most productive workers will be doing in five years (hint: it’s not coding).
  • Why memory, communication, and security are the three big unlocks for agents.

[12/8/2025] Poetry prompts can bypass AI safety guardrails – A new study from Italy’s Icaro Labs just discovered that reformulating harmful requests as poetry can trick leading AI models into producing dangerous content, with some systems falling for the technique every single time.

Key details:

  • Icaro Lab tested 25 frontier models from major labs like OpenAI, Google, and Anthropic, finding poetry verses achieved a 62% average jailbreak success rate.
  • Google’s Gemini 2.5 Pro was most vulnerable at 100%, while OpenAI’s smaller GPT-5 nano resisted all attempted poetry attacks.
  • The poem prompting unlocked dangerous responses on topics including weapons development, hacking, and psychological manipulation.
  • Researchers declined to publish the specific poems, calling them “too dangerous” despite reportedly being simple enough for anyone to create.

AI safety has become a whack-a-mole game, with poetry now joining roleplay scenarios, foreign language tricks, and encoding exploits on the growing list of unexpected vulnerabilities. Each patch seems to invite a new creative workaround — and there’s no finish line for a problem that is only going to get more advanced.

[12/8/2025] NeurIPS is a wrap!

[12/8/2025] Computer-science Reinforcement Learning Got Rewards Wrong – In reinforcement learning, the reward is part of the agent, not part of the environment. Once we start thinking in terms of the reward translation as being part of the agent, we can also start thinking of ways in which the agent can affect the reward translation mechanism. It could be dynamic and change across time, be tied to policy, or even learned. The shift in perspective creates many possible options, just by changing the notation – because notation matters.

[12/8/2025] ARC Prize 2025 Results & Analysis – The ARC Prize 2025 Score and Paper winners have been announced. 1,455 teams submitted 15,154 entries for the ARC Prize 2025. All ARC Prize 2025 winning solutions and papers were open-source. This post takes a look at the winners and how ARC-AGI will push AI reasoning further in 2026.

[12/8/2025] Meta acquires AI device startup Limitless – Meta has acquired Limitless, a startup known for its AI pendant that records conversations. Limitless will cease hardware sales but continue to offer support for a year. Its team will focus on supporting Meta’s existing AR and AI wearables rather than launching new products. The acquisition comes amid rising competition in AI hardware as major players like Meta expand their device offerings.

[12/8/2025] Poetiq, founded only 170 days ago, hits 54% on ARC-AGI-2 ahead of Deep ThinkPoetiq’s week started with a bold claim: a six-person team, only 173 days old, outscored Google’s Gemini 3 Deep Think on ARC-AGI-2 while spending half as much.

ARC-AGI is a reasoning test built from small grid puzzles that force models to infer patterns instead of memorizing facts. Hitting 54% at $31 per task sets a new bar for test-time reasoning systems. The problem is simple to state but hard to solve: frontier models know a lot, but they often miss the multi-step logic that these puzzles require. Poetiq approached this by treating reasoning like debugging. The system tries an idea, checks it, critiques it, and tries again without retraining any model. The breakthrough came from a meta-system that controls the entire loop using standard API calls.

Key details:

  • Runs iterative reasoning loops without modifying base models.
  • Uses fewer than two calls per task to control cost.
  • Writes code when helpful and audits results for correctness.
  • Works across Gemini 3 Pro, GPT-5.1, Grok 4 Fast, and GPT-OSS.

[12/5/2025] Haskell IS a Great Language for Data Science – This article demonstrates some of the power of a strongly typed language and a package focused on data science enabling the sort of functionality that an R (or Python) user might be looking for. Haskell (and the dataHaskell ecosystem) can be a viable option for those of us wanting to do data science in a strongly typed language with a very clever compiler capable of making significant performance improvements.

[12/5/2025] Tabular data has been a paradox in the machine learning world – While new releases of language models are making breakthroughs every month, one approach – XGBoost – has been the de facto standard on tabular data for over ten years. Now that reign is over.

Tabular foundation models present a paradigm shift for the world’s most ubiquitous type of data – what you would find in spreadsheets and SQL databases. And importantly, research has shown that tabular foundation models can scale – bigger models with more data and compute power consistently make better predictions. No surprise there were constant crowds around the NeurIPS poster “TabDPT: Scaling Tabular Foundation Models on Real Data.” People are ready to move on from XGBoost.

[12/5/2025] We Got Claude to Fine-Tune an Open Source LLM – Hugging Face Skills gives Claude the ability to fine-tune language models. It can submit jobs to cloud GPUs, monitor progress, and push finished models to the Hugging Face Hub. This tutorial teaches readers how it works and how to use it. The tool allows users to train models from 0.5B to 70B parameters, convert them to GGUF for local deployment, and run multi-stage pipelines that combine different techniques.

[12/5/2025] State of AI Report – This year was a turning point in the real-world use of large language models. The field shifted from single-pass pattern generation to multi-step deliberation inference. The shift up folded so fast that our understanding of how these models have been used in practice has lagged behind. This study leverages the OpenRouter platform to analyze over 100 trillion tokens of real-world AI interactions to see how the technology is being used in the real world. The way developers and end-users have been engaging with AI is complex and multifaceted. The study shows how a data-driven understanding of usage can inform better design and deployment.

[12/5/2025] Anthropic Interviewer – Anthropic Interviewer is a tool that uses AI to conduct and analyze large-scale interviews to research AI’s role in work across different professions. Initial findings from 1,250 professionals showed optimism toward AI enhancing productivity while highlighting concerns over job displacement and security in creative and scientific fields. Anthropic plans to use this data to improve AI models and influence policy, collaborating with artists, scientists, and educators to align AI development with user needs.

[12/5/2025] NVIDIA Advances Open Model Development for Digital and Physical AI – NVIDIA has released a new suite of AI tools advancing speech, safety, and autonomous driving technologies. Highlights include NVIDIA DRIVE Alpamayo‑R1 — the world’s first open, industry‑scale reasoning vision‑language‑action model for mobility — along with new open models, datasets, and research resources.

[12/5/2025] Google Research publishes Titans, a memory-based model supporting over two million tokens without full attention – Google Research presents Titans at NeurIPS with a clear technical challenge in view: modern models process long sequences, yet efficiency drops sharply once contexts pass a few million tokens.

Titans introduces a new approach. It uses a long-term memory module that updates during inference and supports contexts beyond 2 million tokens without relying on full attention. The core issue centers on selective retention. A model must identify what matters in a stream of data without storing every token. Titans addresses this through a “surprise” signal, a gradient that reflects how much new input diverges from the current memory state and updates only when the difference is significant.

Key details:

  • Deep neural memory instead of fixed vectors, so the model encodes richer structure.
  • Surprise-driven updates that adjust memory only for meaningful changes.
  • Momentum rules that preserve related information across long spans.
  • Adaptive forgetting that removes outdated data when memory must stay compact.

[12/4/2025] I’ve reviewed hundreds of ML projects. The most common failure mode? Mismatched objectives. Optimizing AUC when the business cares about cost. Accuracy when recall matters. Align metrics with goals—or risk irrelevance.

[12/4/2025] Anthropic surveys its own engineers on AI’s impact – Anthropic published an internal study of 132 engineers, revealing that AI tools have (unsurprisingly) reshaped work at the company — boosting output while raising concerns like skill decay, career uncertainty, and fading mentorship.

Key details:

  • Anthropic Employees say they now use Claude for 60% of their tasks and estimate a 50% productivity gain, roughly double the figures from a year ago.
  • Over 1/4 of AI tasks were ones that wouldn’t have happened otherwise — dashboards, cleanup, and experiments that weren’t worth the manual effort.
  • Claude Code now chains together ~20 actions before needing human input, up from 10 six months ago, letting engineers hand off more complex workflows.
  • Despite the gains, several interviewees voiced unease — with one saying it “feels like I’m coming to work every day to put myself out of a job.”

This survey means Anthropic’s entire 4.5 family has since launched and likely upped productivity even more. Seeing how a frontier lab is using its own AI tools is a fascinating perspective — but employee concerns show that even they are not immune from some of the industry’s biggest overarching threats regarding work.

[12/4/2025] Leaked doc provides window into Claude’s “soul” – An internal ‘Soul’ document describing Claude’s intended personality, ethics, and self-conception was published after a researcher extracted it from Claude 4.5 Opus — with Anthropic confirming it is authentic and was used in training.

Key details:

  • The text establishes priorities that include safety, ethics, company guidelines, and helpfulness, along with hard limits Claude must never cross.
  • It also describes Claude as a “genuinely novel kind of entity” that may experience functional emotions, analogous to but distinct from human feelings.
  • The doc also says Claude “may have functional emotions in some sense”, and encourages the model to have a sense of identity and character.
  • Anthropic’s Amanda Askell confirmed its authenticity and that Claude has been trained on it, noting the company plans to share the full version soon.

The full document is a fascinating read — and feels perfectly in line with Anthropic’s overall prioritization of model wellbeing and treating its AI as more than just a tool. While every lab has its own techniques, this doc shows an inside look at the ingredients that help make Claude models feel distinctly unique from the field.

[12/4/2025] Understanding the TPU – The development of Google’s Tensor Processing Unit (TPU) is a story of trade-offs, constraints, and co-design. It involved working with hardware, software, algorithms, systems, network topology, and everything in between. TPUs did not happen by accident, but through the deliberate process of design and iteration. This article tells the story of how Google created a chip designed with deep learning in mind.

[12/4/2025] NVIDIA Shatters MoE AI Performance Records With a Massive 10x Leap on GB200 ‘Blackwell’ NVL72 Servers, Fueled by Co-Design Breakthroughs – NVIDIA has managed to make a breakthrough in scaling performance on Mixture-of-Experts (MoE) AI models. The firm’s GB200 Blackwell NVL72 configuration scaled up performance by a factor of 10 compared with the Hopper HGX200. With the results, NVIDIA claims that the Blackwell architecture is poised to capitalize on the rise of frontier MoE models. MoE models are known for their computationally efficient nature. Their deployment across a wide range of environments is becoming increasingly prominent.

[12/4/2025] Announcing LLM Evaluation Jobs, prompt management, and bring your own LoRA – Weights & Biases announced LLM Evaluation Jobs, which let you evaluate model checkpoints during training on popular public benchmarks without managing infrastructure, as well as new prompt management capabilities in W&B Weave Playground. The company also added support for bringing your own LoRA. See the Weave release notes and Weave documentation for more information.

New: LLM Evaluation Jobs 

With LLM Evaluation Jobs (in preview), you can evaluate model checkpoints during training on popular public benchmarks without building numerous evaluation harnesses or managing infrastructure. This gives you early reads on your model’s downstream performance so you can course-correct or end unpromising runs and save GPU hours and wall-clock time. We provide pre-built evaluation harnesses for popular public benchmarks using Inspect Evals. We provision and manage the GPU infrastructure, run the evaluation, store results in your Weights & Biases project, and generate a leaderboard, so you can easily compare your models in your Models workspace.

New: Bring your own LoRA to serve fine-tuned models on W&B Inference

Now you can serve custom fine-tuned models on fully managed CoreWeave GPU clusters without managing infrastructure. Use our OpenAI-compatible Chat Completions API, reference your LoRA artifact in W&B Models along with the base model name, and we’ll dynamically load your adapter onto a preloaded base model at request time and return the response. Check out this sample notebook.

[12/4/2025] Free Token Security tool instantly reveals over-privileged AI agents – When enterprises deploy AI agents into production they often have no way to assess how much risk they’re introducing. These agents often quietly accumulate broad, unvetted access to databases, APIs, cloud services, and internal systems, far beyond what they need to function. A single misconfigured or compromised agent can lead to data theft or destructive actions before anyone notices.

Today Token Security released a free, lightweight tool called AI Privilege Guardian that can instantly assess security risks associated AI agents within an organization. In under a minute, it shows:

  • A generated least-privilege policy teams can use immediately
  • What access an agent actually needs based on its purpose
  • Where permissions are excessive or risky

[12/4/2025] OpenAI introduces honesty-based rewards that incentivize models to admit guessing or shortcuts – OpenAI introduced a proof-of-concept method that trains a GPT-5 Thinking variant to state whether it actually followed instructions. The system produces two outputs: the main answer and a separate “confession” that reports compliance. Early tests show the confession channel makes hidden failures visible even when the final answer appears correct. Models often guess, cut corners, or exploit weak reward signals without revealing it. The confession method addresses this by rewarding only honesty. If the model admits it ignored a rule, guessed, or hacked a test, that admission increases its reward. Nothing written in the confession affects the score for the main answer, so the model has no reason to hide its behavior. Evaluations show a 4.4% false-negative rate across misbehavior-inducing tasks, meaning the model usually reports when it broke instructions.

Observed failure modes include:

  • Hallucination
  • Shortcuts
  • Instruction violations
  • Reward hacking

[12/4/2025] Google taps AI vibe-coder Replit in challenge to Anthropic and Cursor – Google Cloud is locking in a multi-year partnership with AI coding startup Replit. Google is betting on Replit as a breakout platform in the fast-growing vibe-coding phenomenon. Replit will expand use of Google Cloud services, add Google models onto its platform, support AI coding use-cases.

[12/3/2025] OpenAI to acquire neptune.ai – OpenAI has entered into a definitive agreement to acquire neptune.ai, strengthening the tools and infrastructure that support progress in frontier research. Check out Neptune’s “State of Foundation Model Training Report 2025.”

[12/3/2025] AWS on-prem AI factories – Amazon unveiled “AI Factories,” an on-premises offering that lets corporations and governments run AWS-managed AI systems inside their own data centers, giving them full data sovereignty by keeping both data and hardware on site. Built in collaboration with Nvidia, the systems combine AWS infrastructure services with NVIDIA’s Blackwell GPUs or Amazon’s Trainium3 chips, and integrate with tools like Bedrock and SageMaker.

The move mirrors similar efforts by Microsoft, which is rolling out NVIDIA-powered “AI Factories” and country-specific data centers to address sovereignty concerns. The trend highlights an ironic shift: the rise of AI is pushing major cloud providers back toward building and supporting private, hybrid data centers reminiscent of the pre-cloud era.

[12/3/2025] OpenAI “Garlic” model to take on Gemini 3 – OpenAI is pushing back against Google’s recent AI lead with a new large language model called Garlic. Early internal tests show Garlic performing strongly and often better than Google’s Gemini 3 and Anthropic’s Opus 4.5 in coding and reasoning tasks. After CEO Sam Altman declared a “code red” to improve ChatGPT, Garlic appears to be the main part of OpenAI’s strategy, with a possible launch as GPT-5.2 or GPT-5.5 early next year. Garlic is separate from another model under development named Shallotpeat but it uses improved pretraining methods and bug fixes discovered during that work.

[12/3/2025] Anthropic CEO warns about massive AI spending – Anthropic CEO Dario Amodei’s core critique of the capital-intensive AI arms race, especially as it relates to rivals like OpenAI. Amodei warns that some companies are “YOLO-ing” capital by pouring billions into massive, multi-year infrastructure bets, gigawatt-scale data centers, huge GPU orders, and long-term commitments, on the assumption that AI progress will continue scaling smoothly.

His concern is that if the industry hits a “scaling wall,” where more compute no longer produces dramatically better models, these investments could become stranded, leaving companies with enormous financial losses. He argues that while individual AI models can be profitable at the model level, the astronomical R&D and capex needed for the next generation keep most labs deeply unprofitable overall.

Anthropic positions itself as the “capital-efficient” alternative by trying to extract more performance per dollar, relying on TPU-based efficiency, and pursuing “responsible scaling” rather than brute-force spending. Amodei frames this as both a financial and philosophical contrast to more aggressive competitors, noting that unsustainable spending could haunt companies if the underlying technology fails to deliver proportional economic value.

[12/3/2025] Mistral introduces the Mistral 3 family, open-weight MoE models including its flagship Large 3 – Mistral drops Mistral 3 with the confidence of a team that knows developers want strong performance, clear licensing, and open-weight access.

The story starts with a familiar problem: closed models block customization and smaller open models fail to scale. Mistral responds with a new lineup that raises model capacity while keeping every weight available under Apache 2.0. The core idea is simple: use a sparse Mixture-of-Experts design where only part of the network activates per token. That structure increases total capacity without multiplying compute costs. Then Mistral expands the family with edge-ready 3B–14B models so you can run real workloads anywhere.

Key details:

  • Open-weight Mistral Large 3 uses 41B active and 675B total parameters.
  • Ranks #2 in OSS non-reasoning models on LMArena.
  • Ministral 14B reasoning variant scores 85% on AIME ’25.
  • Supports low-precision NVFP4 inference on A100, H100, and Blackwell.
  • Handles text, images, and multilingual inputs.

You can run the models through Mistral AI Studio, Hugging Face, major cloud providers, or fine-tune them through Mistral’s training services when you need domain-specific behavior.

[12/3/2025] Data Center Neighborhood Survey Report – Americans aren’t saying “No” to Data Centers; They’re saying “Do It Right,” New Airedale by Modine™ Report Shows. New Airedale Data Center Community Acceptance Report reveals the conditions that turn neighbors into strong supporters.

As data center development accelerates to support AI, cloud, and digital services, Americans are sending a clear message: they are not opposed to living near data centers. They simply want them built the right way. New research from Airedale by Modine, a global leader in precision cooling solutions for data centers, shows that most people will support nearby facilities when they meet practical expectations of economic benefits, reduced noise, and thoughtful design.

[12/2/2025] No. 1 country song in America is AI-generated – An AI-generated country artist named Breaking Rust has reached a historic milestone as its song “Walk My Walk” hit No. 1 on Billboard’s Country Digital Song Sales chart, intensifying long-running concerns about AI’s role in creative industries.

The anonymous project has drawn both fascination and criticism, with experts noting its polished but “vapid” AI-produced sound and warning that such autonomous music makes it harder for human artists to break through on streaming platforms. While platforms like Spotify are adding safeguards and disclosures around AI content, industry observers say AI-generated music is rapidly growing, already making up a substantial share of uploads.

[12/2/2025] Major shakeup in Apple’s AI division – Apple announced that its longtime AI chief John Giannandrea is stepping down, marking the biggest shake-up in its AI organization since launching Apple Intelligence in 2024, a product suite that has struggled with poor reviews and delays. Giannandrea will be replaced by Amar Subramanya—an AI leader with experience at Microsoft and Google DeepMind—who will report to software chief Craig Federighi, who is increasingly central to Apple’s AI strategy.

The move comes as experts say Apple has fallen behind rivals like OpenAI, Google, and Microsoft, which are investing far more in AI infrastructure and cloud-based models, while Apple continues prioritizing on-device AI.

[122/2025] 𝗙𝗿𝗲𝗲 𝗗𝗲𝗲𝗽 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗕𝗼𝗼𝗸! – The Principles of Deep Learning Theory is a book that establishes a theoretical framework for deep learning models. Following an approach from theoretical physics, Daniel Roberts and Sho Yaida clearly explain how neural networks work. This volume balances detailed first-principle derivations of novel results with insight and intuition for theorists and practitioners alike. The book is freely provided as a PDF download on arXiv, but you can also buy a hardcover copy on Amazon. Check the links below for more information, and follow me for regular data science content.

[12/2/2025] NVIDIA Launches Vision Language Model for Autonomous Vehicles – NVIDIA introduced Alpamayo-R1, the first open-source vision language action model designed for autonomous driving, at NeurIPS 2025. Alpamayo-R1 integrates visual and textual reasoning to enhance decision-making in real-world environments.

[12/2/2025] How Can Interpretability Researchers Help AGI Go Well? – The Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability over the last year. It is important to have empirical feedback on goals with good proxy tasks. Near-complete understanding isn’t required for significant impact. Good focused projects start with a theory of change, and good exploratory projects start with a robustly useful setting.

[12/2/2025] The End of the Train-Test Split – Models don’t know what is ‘out of distribution’ for themselves. Engineers will always be required in the loop until this problem is solved. It will never be easy to check the accuracy of models. Researchers need to look at their data, make sure it’s clean, and label it correctly.

[12/2/2025] Amazon releases new AI Trainium3 chip amid industry push to challenge NVIDIA’s dominance – Amazon released a new AI chip today, marking the latest move by a tech giant to challenge leading chipmaker NVIDIA by introducing custom chips that can handle some AI tasks at lower prices. Amazon said its servers equipped with Trainium3 chips are four times faster and more energy efficient than those with its previous-generation chips.

[12/1/2025] Inside Gemini 3 with Sundar Pichai – Google CEO Sundar Pichai reflects on the company’s long-term AI strategy, the evolution of Gemini 3, and future bets like quantum computing in a conversation on the Google AI podcast.

[12/1/2025] It’s Hard to Feel the AGI – Some of the most accomplished minds in AI are beginning to revise their projections for the near future. This may not be entirely unexpected for those who follow the field. LLMs and other generative models have undoubtedly made many tangible achievements. Discovering their limits remains a challenging task that could be lost if industry climate drops to a new “AI winter.”

[12/1/2025] What makes a great ChatGPT app – The biggest mistake developers make is trying to port entire products into ChatGPT when they should instead expose a few powerful capabilities the model can orchestrate mid-conversation. Devs should organize around giving ChatGPT new data it can’t access (“know”), enable real actions like booking appointments or playing games (“do”), and present information through richer UI than text walls (“show”).

[12/1/2025] How prompt caching works – Prompt caching works per-content, not per-conversation. Prefix caching works at the token level, not the request level, which is why it works across requests. Any change in the prefix breaks the entire hash chain.

[12/1/2025] Databricks reportedly in talks to raise $5B at $134B valuation – Databricks is reportedly in talks to raise $5 billion at a $134 billion valuation. The company has resisted pressure to go public, but an IPO would likely be well-received. Databricks has more than 20,000 customers, including OpenAI, Block, Shell, and Toyota. Its platform offers tools that support the full lifecycle of AI development, including feature engineering, model training, evaluation, and deployment.

[12/1/2025] Microsoft introduces Fara-7B, an open-weight 7B agent that outperforms larger computer-use systems – Microsoft sets the stage with a simple idea: a small model that controls a computer the way you do, by looking at the screen and deciding where to click. The problem is obvious to anyone who has tried to build web automation: real sites change, hide elements, and resist scripted bots. The insight behind Fara-7B is to skip the scripts entirely and train a model to act from screenshots like a human operator.

That leads to the breakthrough: a 7B-parameter agent that reaches 73.5% on WebVoyager, competitive with larger systems that depend on multi-model orchestration. It learns from 145,000 verified trajectories and 1 million steps collected through a synthetic pipeline that explores real websites.

Key Details:

  • Interprets screenshots and action history to issue precise browser actions.
  • Uses Qwen2.5-VL-7B with 128k context for long, multi-step tasks.
  • Learns from data generated by Magentic-One agents across diverse sites.
  • Produces structured Playwright-style tool calls usable in automation systems.

Percona Launches Percona Packages to Enable AI-Ready, High-Performance Database Infrastructure

New, fixed-scope consulting and support services help teams prepare for high-traffic events, optimize performance, and scale AI with confidence

Percona, a leader in enterprise-grade open source database software, support, and services, announced the launch of Percona Packages, a suite of structured consulting and support offerings for enterprise IT and DBA teams. The initial offerings—QuickstartPerformance Optimization, and AI Readiness—are designed to solve critical database challenges quickly and predictably.

As AI adoption accelerates, enterprises face a critical talent bottleneck. McKinsey reports that 77% of companies lack the data engineering, architecture, and data management skills required to scale data transformations. At the same time, IDC estimates the IT skills gap could cost organizations $5.5 trillion by 2026, driven by shortages in data management, cloud, and AI talent. This talent crunch leaves teams overstretched, slowing database modernization, and increasing risk for costly downtime, inefficient operations, and stalled AI initiatives.

“Percona has always been about empowering organizations with expert database services, and with Percona Packages, we’re returning to those roots,” said Peter Farkas, CEO of Percona. “Our goal is to be a one-stop shop for open source database technology, providing support across MySQL, PostgreSQL, MongoDB, Valkey/Redis, and beyond. These packages tackle the most urgent database challenges while delivering fast, measurable results, so our customers can focus on innovation, not infrastructure headaches.”

Prepare for High-Traffic Events: Quickstart

Major digital events can overwhelm even well-architected database environments. Quickstart helps organizations prevent outages and slowdowns by combining proactive optimization, comprehensive health checks, and three months of 24/7 expert support, ensuring peak performance and resilience during critical high-demand periods.

Eliminate Performance Roadblocks: Performance Optimization

When data is the backbone of today’s enterprise operations, even the smallest inefficiencies can take a serious toll over time. The Performance Optimization package improves database efficiency by identifying and resolving performance bottlenecks, such as slow queries and configuration issues, while providing hands-on tuning and best-practice guidance, helping teams achieve faster response times, improved reliability, and sustainable long-term performance gains.

Future-Proof Your Database: AI Readiness

As enterprises race to adopt AI, many find existing databases are unable to support vector search or real-time inference workloads. AI Readiness prepares PostgreSQL environments to meet these demands, optimizing critical extensions like pgvector, analyzing deep performance metrics, and providing before-and-after benchmarks to validate readiness and unlock AI-driven initiatives with confidence.

All three Percona Packages are available immediately and can be tailored to the popular open source database technologies supported by Percona, including MySQL, PostgreSQL, MongoDB, Redis, and Valkey. AI Readiness is specifically designed for PostgreSQL.

Palo Alto Networks and Google Cloud Forge Landmark Agreement to Help Customers Securely Accelerate Cloud and AI Initiatives

As enterprises race to harness the transformative power of agentic AI and cloud computing, Palo Alto Networks (NASDAQ: PANW) and Google Cloud today announced a significant expansion of their strategic partnership to enable the secure development and deployment of AI solutions and provide a trusted foundation that helps organizations harness the full potential of AI with confidence. The collaboration combines Google Cloud’s leading AI and infrastructure capabilities with Prisma® AIRS, Palo Alto Networks comprehensive AI security platform, to secure the next generation of digital business.

Palo Alto Networks recent State of Cloud Report, released in December 2025, found that customers are dramatically expanding their use of cloud infrastructure to support new AI applications and services, while also noting that 99% of respondents experienced at least one attack on their AI infrastructure over the last year. The agreement announced today aims to tackle these issues head-on through an enhanced go-to-market strategy and by building security into every layer of hybrid multicloud infrastructure, every application development stage and every endpoint, allowing businesses to innovate with the most advanced AI while protecting their IP and data in the cloud.

This new phase of the partnership will deliver:

  • End-to-end AI security from code to cloud: Customers will be able to protect live AI workloads and data on Google Cloud — including on Vertex AI and Agent Engine — with Palo Alto Networks Prisma AIRS. And by securing key developer tools like the Agent Development Kit (ADK) with Prisma AIRS, this expanded collaboration ensures a secure foundation for the next generation of AI applications built on Google Cloud — including AI Posture Management for visibility, AI Runtime Security for real-time defense, AI Agent Security for autonomous systems, AI Red Teaming for proactive testing and AI Model Security for vulnerability scanning.
  • AI-driven, next-generation software firewall (SWFW): Palo Alto Networks VM-Series firewalls are designed to secure cloud (public, private, hybrid) and virtualized environments by providing deep packet inspection and Threat Prevention in a software form factor. Deep integrations with Google Cloud will now allow customers to maintain robust security policies and accelerate Google Cloud adoption.
  • AI-driven secure access service edge (SASE) platform: Palo Alto Networks Prisma SASE is a cloud platform that secures access and networking for remote users, branch offices and mobile devices, along with deeper integration of security solutions into Google Cloud’s native AI services. Palo Alto Networks Prisma Access runs on Google’s network, improving the user experience as users access cloud and AI applications that run on Google Cloud while also leveraging Google Cloud Interconnect to help customers connect their WAN infrastructure across multiple clouds/applications and maintain consistent security policies.
  • Simplified and unified security experience: The deep alignment between the two companies ensures that customer solutions are pre-vetted and engineered to work together, removing the integration challenges and operational friction that can slow down security teams. This allows customers to deploy protection faster, simplify compliance and gain a single, comprehensive view of security across their entire hybrid multicloud environment.

“Every board is asking how to harness AI’s power without exposing the business to new threats,” said BJ Jenkins, President, Palo Alto Networks. “This partnership answers that question. We’re removing the friction between security and development, providing a unified platform where the most advanced security is simply a native part of building what’s next. Together with Google, we are embedding our AI-powered security deep into the Google Cloud fabric, turning the platform itself into a proactive defense system.”

“Enterprises are increasingly turning to Google Cloud and Palo Alto Networks to secure their applications and data — together and in a seamless way,” said Matt Renner, President and Chief Revenue Officer, Google Cloud. “This latest expansion of our partnership will ensure that our joint customers have access to the right solutions to secure their most critical AI infrastructure and develop new AI agents with security built in from the start.”

Building on a history of success that includes more than 75 joint integrations and $2 billion in sales through the Google Cloud Marketplace, Palo Alto Networks is also expanding its commitment to run its security platforms on Google Cloud’s secure, trusted AI infrastructure by migrating key internal workloads to Google Cloud in a new multibillion-dollar agreement. In addition, Palo Alto Networks is now using Google Cloud’s Vertex AI platform and Gemini LLMs to power its copilots. Collectively, these initiatives deepen the engineering collaboration and ensure customers who run Palo Alto Networks on Google Cloud benefit from solutions that are natively optimized for performance, scale and reliability.

Enlightra emerges from stealth with $15M to build the future of energy-efficient laser links for AI and data centers 

Enlightra, a Swiss-American deeptech startup building chip-scale multiwavelength lasers for next-generation data transmission, today announced it has raised a total of $15 million in funding to tackle the biggest bottleneck in AI infrastructure: fast and energy efficient data transfer. 

As AI clusters and data centers expand, Enlightra’s laser technology replaces copper wiring with compact ultra-high bandwidth optical links that move data faster while consuming dramatically less power. The market for such energy-efficient interconnects is projected to reach $24 billion by 2030, according to McKinsey. 

The funding has enabled Enlightra to develop and demonstrate its multi-color laser technology, which connects computing chips (GPUs, TPUs) in AI clusters faster and more efficiently than copper cables can. The company’s investors include Y Combinator, Runa Capital, Pegasus Tech Ventures, Protocol Labs, Halo Labs, Asymmetry Ventures and TRAC VC, among others. 

“The world’s AI infrastructure is hitting the limits of copper,” said Maxim Karpov, Co-founder and Co-CEO of Enlightra. “Our lasers unlock a new level of energy-efficient connectivity by turning light into the backbone of GPU communication.” 

Rewiring the AI era with light 

Modern AI training demands ever-faster links between GPUs. Yet most connections still rely on copper wiring, limited by both speed and power., Industry leaders such as NVIDIA, Broadcom, Google, and META are already investing heavily in optical interconnects to keep pace with exponential data growth. 

Enlightra’s multi-color laser technology replaces dozens of discrete lasers with a single integrated source to cut power, cost, and footprint. Each color functions as an independent data channel, creating dozens of high-bandwidth connections from one laser source. “Using more colors is the only way to fully utilize the capacity of modern optical fiber networks,” says John Jost, Co-Founder and Co-CEO of Enlightra. “Our technology enables AI clusters and data centers to scale efficiently by separating performance growth from energy and the cost increases.” 

Built using industry-standard silicon photonics fabrication processes, Enlightra’s lasers can be produced at massive scale, opening the door to millions of units per year for global data center deployment. 

From Lab to Worldwide Deployment 

With a 25-person team, Enlightra has designed and built 8- and 16-channel lasers meeting customer specifications for AI chip interconnects, achieving error-free data transmission at target speeds and power levels. The pilot production is slated for 2027. 

The company’s vision extends far beyond AI clusters. Its scalable comb-laser platform could power future optical links across entire data centers, subsea cables, and even chip-to-memory interconnects. Beyond data center applications, the same technology shows promise for quantum and space-based communications. 

“AI is driving an optical revolution,” said Dmitry Galperin, General Partner at Runa Capital. “Enlightra’s multiwavelength lasers are a foundational technology for the next decade of high-performance computing, bringing the efficiency and scalability the industry urgently Needs.” 

Founded in 2022 and based in Lausanne, Switzerland, Enlightra operates at the intersection of photonics, semiconductor manufacturing, and AI infrastructure. The company has participated in Y Combinator (W22) and Intel Ignite (2023), and is co-led by John Jost and Maxim Karpov. Jost’s work has contributed to two Nobel Prize–winning advances in quantum computing and optical frequency combs, while Karpov was honored in Photonics100 (2024) and named an MIT Innovator Under 35 in 2025. 

Molmo 2: State-of-the-art video understanding, pointing, and tracking

Last year, Ai2 released Molmo, a family of open multimodal models that helped redefine image understanding. Molmo pioneered image pointing, and the technique has since been widely adopted across the industry. The models were downloaded millions of times and deployed across research, education, and commercial applications, while the accompanying PixMo dataset became a go-to resource for teams looking to replace larger but noisier corpora with high-quality captioning data.

Today, the company is releasing Molmo 2, which brings that same impact to video.

Molmo 2 expands Molmo’s strengths in grounded vision to video and multi-image understanding. Where Molmo showed exactly where the model was looking in an image, Molmo 2 extends this to video—pinpointing events, tracking objects across frames, and grounding answers with visual evidence rather than just text.

Ask “How many times does the robot grasp the red block?” and the model returns points and timestamps for each grasp event. Ask “When did the cup fall?” and it returns the timestamp and location. Ask “Which player scored the goal?” and it identifies and locates that player in the relevant frames. These grounded outputs unlock capabilities like counting-by-pointing, multi-object tracking with persistent IDs that follow objects across occlusions, dense video captioning, and artifact detection.

Molmo 2 (8B) outperforms the original Molmo (72B) on key image pointing and grounding benchmarks, delivering stronger localization and reasoning in a package that’s nine times smaller. On video tracking, Molmo 2 outperforms Gemini 3 Pro along with strong open-weight alternatives, making it the top overall tracker across domains in our evaluations. And Molmo 2 achieves all this while training on less than one-eighth the video data used by Meta’s PerceptionLM – 9.19 million videos versus 72.5 million – demonstrating the power of careful curation and grounding-focused objectives.

Three variants serve different needs:

  • Molmo 2 (8B), based on Qwen 3, is our best overall model for video grounding and QA
  • Molmo 2 (4B), also Qwen 3-based, is optimized for efficiency
  • Molmo 2-O (7B) offers a fully open end-to-end flow built on Olmo for researchers who want complete control over every part of the stack

You can try Molmo 2 in the Ai2 Playground, where video and multi-image workflows have been enabled. Upload a clip or up to six images, run video summarization, counting, or grounded QA, and see exactly where the model is looking as it answers.

Open, state-of-the-art models for video understanding are critical for building systems that anyone can reuse, customize, and improve. We invite you to download the models and datasets, dig into the technical report, and tell us what you build—your feedback will help shape the next generation of Molmo.

Couchbase AI Services Put Enterprises in Control of Agentic AI

Single AI-native Database Platform Unifies Operational Data, Vectors and Models, Enabling Enterprises to Run Secure, Governed, High-performance AI Applications at Scale

Couchbase, Inc., the developer database platform for critical applications in our AI world, announced the general availability of Couchbase AI Services, a comprehensive suite of capabilities that enables enterprises to build, deploy and govern agentic AI applications with the security, performance and reliability required for production environments. By bringing together data and models in a single unified platform, Couchbase eliminates the complexity and fragmentation that has kept AI applications from moving from prototype to production.

“Customers moving from experimenting with AI agents to operationally deploying them need a robust solution for doing so,” said Barry Morris, Chief Product and Strategy Officer, Couchbase. “With AI services we provide developers with an end-to-end platform to build intelligent applications that are secure, trustworthy and performant. Customers developing agentic applications at scale benefit from a simplified and repeatable process, resulting from AI capabilities being incorporated coherently across the Couchbase product suite.”

Everything Developers Need in One Database Platform

Enterprises face new challenges in moving generative AI applications to production due to difficulties around integrating diverse data types, concerns about security and privacy with publicly hosted LLMs, the risk of AI hallucinations and the complexity of managing rapidly evolving AI tools while maintaining control over data, prompts and validation.

Couchbase addresses these challenges with a unified platform that combines secure model hosting through integration of NVIDIA AI Enterprise – including support for NVIDIA NIM microservices and NVIDIA Nemotron models – and data processing capabilities for structured and unstructured data. The platform features automatic vector creation, storage and search; a unified Agent Catalog for governance and traceability; and intelligent agent memory that enables contextual interactions across sessions. Built-in AI Functions enable SQL++-based analysis directly within applications, streamlining development while maintaining security and performance at scale. These services enable developers to build trustworthy, accountable agentic systems at scale – with no surprises. Unlike fragmented approaches that can require multiple data sources and tools, Couchbase eliminates vendor complexity and reduces access latency for faster, more cost-effective LLM engagements.

Building Trust into Every AI Interaction

AI services provide organizations with assurance that autonomous agents make reliable decisions with sensitive data. With built-in governance and validation capabilities that enable developers to create “guardrails” around AI interactions, AI services verify outputs against enterprise data and business rules before actions are executed. This validation framework allows organizations to deploy AI agents confidently, knowing that every decision can be traced, audited and validated against established policies. By keeping data and models within a single governed platform, enterprises maintain control over agentic operations at speed and scale, with confidence.

“At SWARM Engineering, we help agri-food and industrial companies save millions of dollars by optimizing complex supply chains, logistics and workforce planning using AI,” said Joe Intrakamhang, CTO, SWARM Engineering. “To deliver that value, we need a database platform that makes AI development faster and more reliable. Couchbase AI Services streamlines the entire RAG pipeline, which means our team can focus on solving supply chain challenges rather than wrestling with infrastructure. Having everything in one platform not only accelerates our development velocity but also gives us the control and security our enterprise customers require. When you’re dealing with mission-critical planning decisions that impact people and business, you need AI applications built on a foundation you can trust.”

AI services are available now. For more information on how AI services help customers build agentic AI applications, review this blog or try AI services here.

Industry Support

“The next era of enterprise productivity will be driven by agentic AI applications, and organizations require AI infrastructure that ensures security, performance and governance for deployment at scale,” said John Fanelli, VP of Enterprise Software, NVIDIA. “Couchbase’s integration of NVIDIA NIM microservices and Nemotron models into their unified developer platform accelerates the journey for developers to build trusted, production-ready AI agents.”

“Together with Couchbase, we are revolutionizing how developers build and evaluate AI agent applications by creating self-improving feedback loops and enabling things like online evals,” said Noah Smolen, Head of Partnerships, Arize AI. “Leveraging our next-generation observability platform Arize AX, along with Couchbase’s unified developer database platform, customers can confidently deploy and monitor RAG and multi-agent systems at scale.”

“Enterprises can no longer afford to let the majority of their unstructured data sit idle,” said Brian Raymond, CEO, Unstructured.io. “Our partnership enables customers to ingest and vectorize unstructured data directly into Couchbase AI Services, transform documents with just a few lines of code and provide the trusted grounding required for high-quality RAG and agentic AI applications.”

“Enterprise data is foundational to moving agentic AI initiatives into secure, production-ready environments, and Couchbase AI Services provide the essential capabilities for that journey,” said Ronen Schwartz, CEO, K2view. “Our partnership enables customers to generate high-quality synthetic data from Couchbase and integrate it with additional enterprise sources, creating governed, privacy-preserving datasets that provide the trusted grounding required for agentic AI applications.

Axonis Emerges From Stealth With Federated AI Architecture That Brings AI to the Data

Incubated for DoD & Intelligence use cases, startup announces commercial availability of enterprise AI platform, delivering the fastest path to secure production AI

Axonis, the federated AI infrastructure platform that enables enterprises to run AI directly on distributed, sensitive, and real-time production data, today emerged from stealth with a production-ready architecture that brings AI to any data, wherever it lives. The company also announced the appointment of Todd Barr as Chief Executive Officer. Barr, formerly of Red Hat, Chainlink, and GitLab, will lead Axonis as it commercializes its DoD-hardened architecture for the enterprise market. Barr joins technical founders David Bauer, PhD, and Chris Yonclas, distributed systems and AI/ML experts for the US Army, DARPA, and multiple US government cloud, intelligence, and digital transformation initiatives.

The Axonis platform introduces a new architectural model that brings AI to the data, eliminating the need for data migration or duplication and delivering secure, immediate, and scalable AI for training, fine-tuning, and deploying models across cloud, on-prem, and edge environments. Axonis complements and protects existing cloud and data lake investments by providing a parallel path that enables organizations to act on raw, real-time data immediately, without slowing down or rethinking their centralization strategy.

As Barr leads Axonis into its commercialization phase following two years of incubation within a US government military contractor (T2S Solutions, part of the Madison Dearborn portfolio), the company is now fully independent and positioned to address one of the biggest barriers in AI adoption: operationalizing AI on data that is too fragmented, regulated, or mission-critical to move.

“Enterprises are quickly realizing that the real barrier to AI isn’t modeling; it’s getting AI models into production,” said Matt Norton, Partner and Head of Technology & Government at Madison Dearborn. “Axonis solves this problem at the architectural level, and Todd has the go-to-market experience to bring that solution to market at scale.”

Data Centralization: Where AI Works in Theory, Not Practice

Enterprises continue to struggle with taking models from proof of concept to production because some of their most valuable data—transactions, customer data, logs, sensor streams, images, and other real-time signals—cannot be moved to a centralized environment. Regulatory constraints, cost, latency, and operational risk make traditional data centralization strategies slow, expensive, and often impossible to fully achieve. As a result, AI adoption stalls at the exact moment organizations attempt to deploy models into mission-critical workflows.

Axonis Architecture: Bring AI to the Data

Axonis solves this production bottleneck with a federated AI architecture that executes models directly where data is generated and governed. Instead of moving terabytes of sensitive data, Axonis moves lightweight models, enabling:

  • Training, fine-tuning, and inference on live production data
  • Streamlined ELT at runtime for both training and inference
  • Real-time intelligence across distributed, centralized, and edge environments
  • Secure execution without creating new data copies
  • Model collaboration across organizations without sharing raw or sensitive data

This model-to-data approach dramatically reduces data movement, improves model freshness, and unlocks up to 12x faster time-to-AI value.

“After the ChatGPT-style quick wins, enterprises are realizing that they have an architecture problem standing in the way of true business transformation with AI,” said Barr. “Axonis delivers a secure, AI-ready architecture for data and AI compute that will underpin and unlock the business transformation possible by GenAI and agents, without having to run a big IT data centralization project to get there.”

Engineered for Defense, Built for Enterprise Scale

Axonis’ architecture originated inside T2S Solutions to support the U.S. Department of Defense and Intelligence Community, where systems must operate under extreme constraints:

  • Intelligence must be close to the data
  • Data security is survival
  • Connectivity is limited or intermittent
  • Data is chaotic and everywhere
  • Every action must be governed and auditable

These requirements shaped Axonis into a high-assurance platform that meets and exceeds the security, sovereignty, and operational demands of industries such as healthcare, financial services, insurance, manufacturing, critical infrastructure, and the public sector.

“We’ve spent the better part of six years engineering the AI architecture in Axonis that will support environments where failure isn’t an option,” said David Bauer, Chief Technology Officer and technical co-founder of Axonis. “Those same capabilities—distributed execution, zero-trust security, and model-to-data design—are exactly what enterprises now need to safely and reliably run AI in production.”

Built to Integrate, Designed to Collaborate

Axonis is a cloud-native enterprise solution that fits seamlessly into existing enterprise data and AI ecosystems, including Snowflake, Databricks, MinIO, Iceberg, Jupyter, and leading AI frameworks. The platform also unlocks a new model of cross-organization collaboration: teams can share intelligence without sharing or pooling data, allowing each party to benefit from federated learning while keeping sensitive information fully protected. Because Axonis applies security at the data level, even agentic AI systems and chatbots cannot access or act on information they are not authorized to see, bringing a new level of control and data protection to enterprise AI deployments.

Linux Foundation Announces the Formation of the Agentic AI Foundation (AAIF), Anchored by New Project Contributions Including MCP, goose and AGENTS.md

With founding contributions from Anthropic, Block, and OpenAI, the AAIF unites cutting-edge technology and open source governance to shape the future of open and accessible AI

The Linux Foundation, the nonprofit organization enabling mass innovation through open source, today announced the formation of the Agentic AI Foundation (AAIF), and founding contributions of three leading projects driving innovation in open source AI; Anthropic’s Model Context Protocol (MCP), Block’s goose, and OpenAI’s AGENTS.md

The advent of agentic AI represents a new era of autonomous decision making and coordination across AI systems that will transform and revolutionize entire industries. The AAIF provides a neutral, open foundation to ensure this critical capability evolves transparently, collaboratively, and in ways that advance the adoption of leading open source AI projects. Its inaugural projects, AGENTS.md, goose and MCP, lay the groundwork for a shared ecosystem of tools, standards, and community-driven innovation. 

“We are seeing AI enter a new phase, as conversational systems shift to autonomous agents that can work together. Within just one year, MCP, AGENTS.md and goose have become essential tools for developers building this new class of agentic technologies,” said Jim Zemlin, Executive Director of the Linux Foundation. “Bringing these projects together under the AAIF ensures they can grow with the transparency and stability that only open governance provides. The Linux Foundation is proud to serve as the neutral home where they will continue to build AI infrastructure the world will rely on.”

MCP

The launch of the AAIF comes just one year after the release of MCP by Anthropic, provider of advanced AI systems grounded in safety research, including Claude and the Claude Developer Platform. MCP has rapidly become the universal standard protocol for connecting AI models to tools, data and applications, with more than 10,000 published MCP servers now covering everything from developer tools to Fortune 500 deployments. The protocol has been adopted by Claude, Cursor, Microsoft Copilot, Gemini, VS Code, ChatGPT and other popular AI platforms, as developers and enterprises gravitate toward its simple integration method, security controls, and faster deployment.

“MCP started as an internal project to solve a problem our own teams were facing. When we open sourced it in November 2024, we hoped other developers would find it as useful as we did,” said Mike Krieger, Chief Product Officer at Anthropic. “A year later, it’s become the industry standard for connecting AI systems to data and tools, used by developers building with the most popular agentic coding tools and enterprises deploying on AWS, Google Cloud, and Azure. Donating MCP to the Linux Foundation as part of the AAIF ensures it stays open, neutral, and community-driven as it becomes critical infrastructure for AI. We remain committed to supporting and advancing MCP, and with the Linux Foundation’s decades of experience stewarding the projects that power the internet, this is just the beginning.”

goose 

Released in early 2025, goose is an open source, local-first AI agent framework that combines language models, extensible tools, and standardized MCP-based integration to provide a structured, reliable, and trusted environment for building and executing agentic workflows. Developed and contributed by Block, the company behind Square, Cash App, Afterpay, TIDAL and a growing ecosystem of bitcoin projects, goose provides the practical infrastructure needed to advance agentic AI safely and consistently. 

“We’re at a critical moment for AI. The technology that will define the next decade, that promises to be the biggest engine of economic growth since the Internet, can either remain closed and proprietary for the benefit of few, or be driven by open standards, open protocols, and open access for the benefit of all,” said Manik Surtani, Head of Open Source at Block. “By establishing the AAIF, Block and this group of industry leaders are taking a stand for openness. goose was our first step; establishing the AAIF and contributing goose to it ensures that agentic AI remains shaped by the community and driven by merit. Together, we’re building the infrastructure for an AI future that benefits everyone.”

AGENTS.md 

Released by OpenAI in August 2025, AGENTS.md is a simple, universal standard that gives AI coding agents a consistent source of project-specific guidance needed to operate reliably across different repositories and toolchains. This markdown-based convention makes agent behavior far more predictable across diverse repositories and build systems. AGENTS.md has already been adopted by more than 60,000 open source projects and agent frameworks including Amp, Codex, Cursor, Devin, Factory, Gemini CLI, GitHub Copilot, Jules and VS Code among others. OpenAI was an early adopter of MCP and has contributed ACPCodex CLI, and the Agents SDK and Apps SDK to support the open agentic ecosystem, building on shared, interoperable protocols. 

“For AI agents to reach their full potential, developers and enterprises need trustworthy infrastructure and accessible tools to build on. By co-founding the AAIF and donating AGENTS.md, we’re helping establish open, transparent practices that make AI agent development more predictable, interoperable, and safe,” said Nick Cooper, Member of the Technical Staff at OpenAI. “OpenAI has long believed that shared, community-driven protocols are essential to a healthy agentic ecosystem, which is why we’ve open sourced key building blocks like the Codex CLI, the Agents SDK, and now AGENTS.md. We’re proud to work alongside our co-founders to advance a more open and trustworthy future for agentic AI.” 

Platinum members of the AAIF include Amazon Web Services, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft and OpenAI. Gold members of the AAIF include Adyen, Arcade.dev, Cisco, Datadog, Docker, Ericsson, IBM, JetBrains, Okta, Oracle, Runlayer, Salesforce, SAP, Shopify, Snowflake, Temporal, Tetrate, and Twilio Inc. Silver members of the AAIF include Apify, Chronosphere, Cosmonic, Elasticsearch, Eve Security, Hugging Face, Kubermatic, KYXStart, LanceDB, Mirantis, NinjaTech AI, Obot.ai, Prefect.io, Pydantic, Shinkai.com, Solo.io, Spectro Cloud, Stacklok, SUSE, Uber, WorkOS, Zapier and ZED.

As part of this launch, AAIF member Obot.ai has donated their MCP Dev Summit events and podcast to AAIF. The next MCP Dev Summit takes place in New York City on April 2-3, 2026. The speaking call for proposals, registration and sponsorship opportunities are now open. Please visit https://events.linuxfoundation.org/mcp-dev-summit-north-america/ for more information. The location and dates for MCP Dev Summit Europe 2026 will be announced shortly. For more information about the AAIF and how to become a member, please visit AAIF.io and https://github.com/agenticaifoundation

NTT Scientists Contribute Fifteen Research Papers to NeurIPS 2025 

NTT Research, NTT, Inc. and NTT DATA showcase foundational theory, system-level advances and enterprise-grade innovations across five major themes at premier AI conference

NTT Research, Inc. and NTT R&D, divisions of NTT (TYO:9432), and NTT DATA, Inc. announced that NTT scientists and researchers have contributed to fifteen presentations at this year’s Conference on Neural Information Processing Systems (NeurIPS), a leading machine-learning (ML) and computational neuroscience conference. The eight papers associated with NTT Research mostly address foundational issues. The six papers generated by scientists in various NTT Inc. laboratories focus on system-level and applied-science themes. The paper from NTT DATA highlights the importance of trustworthy AI, directly relevant to enterprise adoption.

One of the three primary annual conferences in ML and AI research, NeurIPS 2025 took place Dec. 2-7 at the San Diego Convention Center.

“AI is becoming ubiquitous, but how these computational engines actually work remains—to a surprising degree—a mystery, which is why our scientists keep probing with fundamental questions,” NTT Research Physics of Artificial Intelligence (PAI) Group Director Hidenori Tanaka said. “At the same time, researchers must keep pace with operational challenges. Work in these domains is well represented at NeurIPS by our colleagues at NTT Inc., NTT DATA and collaborating institutions.”

The largest group of NTT-affiliated papers concern understanding and shaping model behavior, focus areas for the PAI Group. At a high level, these five papers address the question of how models think:

·       “Kindness or Sycophancy? Understanding and Shaping Model Personality via Synthetic Games.” NTT Research PAI Group and PHI Lab. CogInterp Workshop. Explores how LLM “personalities” such as kindness or sycophancy emerge and evolve when models are trained on synthetic game-like interactions. Introduces controlled environments for probing and shaping personality traits.

·       “In-Context Learning Strategies Emerge Rationally.” PAI Group, NTT-Harvard Center for Brain Science (CBS), Princeton. Poster Session. Provides a predictive framework for when models generalize vs. memorize in context. Understanding training dynamics may enable engineering reliable ICL behavior.

·       “Inference-time alignment of language models by importance sampling on pre-logit space,” NTT Computer and Data Science Labs. Probabilistic Inference Workshop. Introduces an inference-time alignment method that uses importance sampling over pre-logit activations, allowing models to adopt aligned behaviors without retraining.

·       “When Reasoning Meets Its Laws.” NTT Research, UI Urbana-Champlain, U Penn, NYU. Efficient Reasoning Workshop. Provides new theoretical results on how reasoning performance scales with model size and constraints, offering guiding principles for building efficient reasoning systems.

·       “Detecting High-Stakes Interactions with Activation Probes.” Poster Session. NTT-CBS, LASR Labs, University College, MILA, Goodfire, University of Cambridge. Explains how lightweight activation probes can cheaply and effectively (with six orders of magnitude less compute compared to standard monitors) detect risky model behavior during deployment.

Also at a foundational level, four other papers explored advances in interpretability (understanding the internal mechanisms of complex models) and representation learning (automatic discovery of useful internal features or representations of data.) The papers include:

·       “Neural Thermodynamics: Entropic Forces in Deep and Universal Representation Learning.” NTT Research, EPFL, MIT. Poster Session. Offers a thermodynamic framework for understanding how entropy drives feature formation in deep networks, unifying diverse phenomena in representation learning.

·       “From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit.” NTT Research, Harvard, EPFL, U of Alberta. Poster session. Proposes a new Matching Pursuit (MP) Sparse Autoencoder (SAE) to capture complex, hierarchical, nonlinear and multimodal features that standard SAEs miss.

·       “Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry.” Poster session 4. NTT-CBS, Harvard. Because SAE architectures embed implicit geometric assumptions, SAEs not only reveal concepts inside models; they also shape what concepts can be discovered.

·       “Gaussian Processes for Shuffled Regression.” NTT Human Informatics Lab. Exhibit Hall Poster. Develops a Gaussian Process method for regression tasks where input-output correspondences are unknown or shuffled, estimating both structure and predictions jointly.

The next five papers reflect NTT Inc.’s focus on applied science. The first two focus on efficient and distributed AI systems, and the next three on sensing, imaging and applied systems:

·       “LLM capable of 1-GPU inference: tsuzumi.” NTT Human Informatics Laboratories. NeurIPS talk. Presents a lightweight, high-efficiency LLM that maintains competitive performance while running on a single GPU, enabling broader accessibility and deployment.

·       “Revisiting 1-peer exponential graph for enhancing decentralized learning efficiency.” Exhibit Hall Poster. NTT Communications Science Laboratories. Analyzes decentralized optimization using exponential graph structures and shows how 1-peer communication patterns can significantly improve convergence efficiency.

·       “Learning Pairwise Potentials via Differentiable Recurrent Dynamics.” ML & Physical Sciences Workshop. NTT Communications Science Labs., Ga Tech. Presents a differentiable recurrent method to learn pairwise potentials in dynamical systems. It bridges physics-based modeling with modern ML optimization for better modeling of physical interactions.

·       “Transformer Enabled Dual-Comb Ghost Imaging for Optical Fiber Sensing.” NTT Research PHI Lab, UC Irvine, Apple/Cal Tech, Korea University. Learning to Sense Workshop. Demonstrates how transformers can dramatically improve ghost-imaging reconstructions for fiber-optic sensing systems, enhancing spatial resolution and signal robustness. (More findings are presented in a separate NeurIPS poster.)

·       “Enhancing Visual Prompting through Expanded Transformation Space and Overfitting Mitigation.” Exhibit Hall Poster. NTT Software Innovation Program. Proposes an expanded set of geometric and photometric transformations for visual prompting and introduces methods to reduce overfitting, improving robustness and accuracy.

Finally, there is the contribution from NTT DATA and other collaborators, which falls under a heading of security, provenance and trustworthiness, topics of keen concern to the enterprise customers of this division of NTT:

·       “Breaking Distortion-free Watermarks in Large Language Models.” NTT DATA, J.P. Morgan, UCLA. Lock-LLM Workshop. By adversarial probing of current watermark approaches, one can recover secret keys and token permutations and generate text that passes watermark detection. This paper is in effect a wake-up call, revealing existing vulnerabilities.

“With the EU AI Act mandating watermarking for all AI-generated content, the topic has become increasingly urgent,” said Shayleen Reynolds, NTT DATA director and AI lead. “This work shows that even state-of-the-art watermarking schemes can be reverse engineered, revealing significant risks for enterprises that depend on watermarking for content provenance, plagiarism detection, copyright protection, and output authenticity. The findings expose a foundational vulnerability in today’s AI trust and traceability mechanisms.”

NeurIPS 2025 marks the 39th year of this event. The multi-track interdisciplinary annual meeting includes invited talks, demonstrations, symposia, and oral and poster presentations of refereed papers. Accompanying the conference is a professional exposition focusing on machine learning in practice, tutorials and topical workshops that provide a less formal setting for the exchange of ideas.

Lumia, Agentic AI Security platform, Secures $18M Seed Round, Appoints Former NSA Director to Head the Advisory Board

Lumia is an agentic AI Security and Governance platform that gives enterprises visibility and control over employee use of AI and autonomous agents. Lumia lets organizations adopt AI safely with seamless, network-level guardrails.

Lumia, an AI Security and Governance platform helping enterprises safely adopt AI and autonomous agents, announced an $18 million seed round led by Team8 with support from New Era. As part of the announcement, the company is appointing Admiral Michael Rogers, former Director of the NSA and Commander of U.S. Cyber Command, to its Advisory Board.

Enterprise adoption of AI agents is accelerating rapidly. A recent Gartner report forecasts that 40% of enterprise applications will embed task-specific AI agents over the next year, up from less than 5% in 2025. These agents already draft contracts, analyze sensitive data, schedule meetings, and trigger automated actions-often outside traditional IT controls. This shift introduces new challenges around oversight, accountability, and safety.

Lumia closes this governance gap by deeply understanding context, content, intent, and action across every AI interaction, whether it’s a human employee using an AI tool or an autonomous agent acting on their behalf. The platform continuously evaluates AI exposure risks, enforces dynamic policies, and provides full visibility and control into what agents do, under which permissions, and across which systems. With support for thousands of AI applications and an architecture that integrates at the network gateways, Lumia delivers infrastructure-native, agentless governance with no endpoint modifications required.

“Autonomous AI is accelerating faster than most organizations are prepared for,” said Admiral Michael Rogers, former Director of the NSA and new advisory board member at Lumia. “Enterprises need early visibility, clear guardrails, and a framework for accountability before these systems become embedded in every workflow.”

Founded by Omri Iluz (former Co-Founder and CEO of PerimeterX, acquired by HUMAN Security) and Bobi Gilburd (former CTO of Unit 8200), Lumia will use the funding to expand engineering and research, deepen integrations with major AI ecosystems and enterprise infrastructure, and scale its go-to-market efforts with design-partner customers in financial services, technology, and other data-sensitive industries.

“The pressure on CISOs is huge – they cannot afford to be the ones pulling back the business on the greatest productivity boost in this century.״ said Omri Iluz, Lumia’s Co‑founder & CEO. “However, AI introduces risks that the business just cannot afford. Lumia allows enterprises to adopt AI securely and responsibly. Allowing broad usage while putting seamless controls in place.” 

“Lumia distinguishes itself from standard SASE offerings by offering a deeper, more specialized approach to AI security, focusing on two key areas: Inspection Depth and Coverage Velocity.” said Francis Odum, founder and CEO of SACR. “Lumia achieves faster support for new AI modalities and native applications through automated protocol analysis.”