Blog Archives

Molmo 2: State-of-the-art video understanding, pointing, and tracking

Last year, Ai2 released Molmo, a family of open multimodal models that helped redefine image understanding. Molmo pioneered image pointing, and the technique has since been widely adopted across the industry. The models were downloaded millions of times and deployed across research, education, and commercial applications, while the accompanying PixMo dataset became a go-to resource for teams looking to replace larger but noisier corpora with high-quality captioning data.

Today, the company is releasing Molmo 2, which brings that same impact to video.

Molmo 2 expands Molmo’s strengths in grounded vision to video and multi-image understanding. Where Molmo showed exactly where the model was looking in an image, Molmo 2 extends this to video—pinpointing events, tracking objects across frames, and grounding answers with visual evidence rather than just text.

Ask “How many times does the robot grasp the red block?” and the model returns points and timestamps for each grasp event. Ask “When did the cup fall?” and it returns the timestamp and location. Ask “Which player scored the goal?” and it identifies and locates that player in the relevant frames. These grounded outputs unlock capabilities like counting-by-pointing, multi-object tracking with persistent IDs that follow objects across occlusions, dense video captioning, and artifact detection.

Molmo 2 (8B) outperforms the original Molmo (72B) on key image pointing and grounding benchmarks, delivering stronger localization and reasoning in a package that’s nine times smaller. On video tracking, Molmo 2 outperforms Gemini 3 Pro along with strong open-weight alternatives, making it the top overall tracker across domains in our evaluations. And Molmo 2 achieves all this while training on less than one-eighth the video data used by Meta’s PerceptionLM – 9.19 million videos versus 72.5 million – demonstrating the power of careful curation and grounding-focused objectives.

Three variants serve different needs:

  • Molmo 2 (8B), based on Qwen 3, is our best overall model for video grounding and QA
  • Molmo 2 (4B), also Qwen 3-based, is optimized for efficiency
  • Molmo 2-O (7B) offers a fully open end-to-end flow built on Olmo for researchers who want complete control over every part of the stack

You can try Molmo 2 in the Ai2 Playground, where video and multi-image workflows have been enabled. Upload a clip or up to six images, run video summarization, counting, or grounded QA, and see exactly where the model is looking as it answers.

Open, state-of-the-art models for video understanding are critical for building systems that anyone can reuse, customize, and improve. We invite you to download the models and datasets, dig into the technical report, and tell us what you build—your feedback will help shape the next generation of Molmo.