What are 'concept prompts' in SAM 3?

Concept prompts are short text phrases that describe an object category, such as 'watering can' or 'flower'. SAM 3 uses these prompts to find all instances of that object in an image or video, making object detection more efficient than manual selection.

SAM 3 boasts impressive speed, running in as little as 30 milliseconds on a single image. While video performance scales with the number of objects, it can achieve near real-time performance, especially with parallel inference across multiple GPUs.

What was the main planning process for SAM 3's concept prompting?

The development focused on atomic visual concepts like 'yellow school bus' to ensure robustness. They also created a new benchmark, 'Seiko,' with over 200,000 unique concepts to capture the diversity of natural language.

How does SAM 3 help in real-world applications beyond basic object detection?

SAM 3's outputs can be used for video editing, special effects, and background modifications. Its ability to segment and identify objects in detail accelerates various fields, from medical research (counting neutrophils) to robotics (underwater trash cleaning) and logistics.

What is the 'data engine' and why is it critical for SAM 3?

The data engine is a novel and critical component for generating training data. It automates the data creation process, significantly reducing annotation time from minutes per data point initially to approximately 25 seconds, enabling the scale and diversity needed for advanced models.

How does SAM 3 integrate with Large Language Models (LLMs)?

SAM 3 can act as a visual agent or 'eyes' for LLMs. It provides visual grounding and understanding, enabling LLMs to tackle more complex tasks that require interpreting images and videos, leading to synergistic performance.

What are the key advantages of SAM 3 compared to Gemini 3 and Florence 2?

SAM 3 demonstrates faster inference speeds, superior grounding capabilities, and accurate segmentation masks, unlike Gemini 3 which primarily does detection. It also surpasses Florence 2 in accuracy and segmentation quality, showcasing significant advancements in visual AI.

How is SAM 3's data annotation process being improved with AI?

AI is used to verify masks and check for exhaustivity, a breakthrough that significantly reduces the need for human annotators on these tasks. This fine-tuning of models like Llama 3.2 has led to 'superhuman' performance in verification, speeding up the overall annotation pipeline.

What is the future outlook for SAM and similar models in the context of AGI?

SAM 3 is seen as a crucial step towards AGI by providing robust visual perception. The future likely involves deeper native integration of visual capabilities into LLMs, moving beyond simple tool calls, enabling more fluid interaction between 'brain' (LLM) and 'eyes' (SAM).

What are the biggest challenges and opportunities for video processing with SAM?

Video processing still lags behind image performance, with challenges in end-to-end training and scaling data engines. Improving video segmentation and annotation strategies, and integrating vision-language models are key areas for future development.

How does SAM 3 impact the computer vision development lifecycle?

SAM 3 accelerates the entire pipeline: automating data collection and labeling, enabling easier fine-tuning for domain-specific adaptations (like medical contexts), and providing scalable infrastructure for deploying models.

Key Moments

SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)

Latent Space Podcast

People & Blogs4 min read76 min video

Dec 18, 2025|2,995 views|60|2

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

SAM 3 enables AI to understand and segment any concept in images and videos using natural language prompts.

Key Insights

SAM 3 introduces 'concept segmentation' allowing users to prompt with natural language phrases like 'yellow school bus'.

It can detect, segment, and track objects in both images and videos, even new instances appearing over time.

SAM 3 significantly improves efficiency in data labeling, estimating to have saved humanity hundreds of years of work.

The model's speed and accuracy are highlighted, with impressive inference times and the ability to handle complex scenes.

SAM 3 is designed to be a visual agent for LLMs, enhancing their understanding and grounding capabilities in multimodal AI.

Meta actively leverages and contributes to the open-source community, with SAM 3 building upon previous versions and community contributions.

THE EVOLUTION OF SEGMENT ANYTHING (SAM)

The Segment Anything (SAM) project from Meta has consistently pushed the boundaries of computer vision. Starting with SAM 1's massive 11-million-image data engine and progressing to SAM 2's memory-based video tracking, each iteration has redefined possibilities. Now, SAM 3 introduces 'concept segmentation,' a leap forward enabling AI to understand and segment objects based on natural language prompts, marking a significant advancement in AI's visual perception capabilities.

SAM 3'S CAPABILITIES AND DEMONSTRATIONS

SAM 3 can detect, segment, and track objects in images and videos using 'concept prompts,' which are short text phrases like 'watering can.' The model also allows for prompt refinement using clicks or visual examples to ensure accuracy. This capability extends to video, where SAM 3 can identify objects in the first frame and then track them, even detecting new instances that appear later. This enables applications in video editing, special effects, and advanced content creation.

PERFORMANCE, SPEED, AND ARCHITECTURAL INNOVATIONS

A key highlight of SAM 3 is its impressive speed, with an inference time of 30 milliseconds for a single image on powerful hardware, making real-time applications feasible. The model utilizes a decoupled detector and tracker architecture, sharing a perception encoder as a visual backbone. This design choice addresses the distinct needs of detection (identity-agnostic) and tracking (identity-preserving), integrating components from Meta's broader AI ecosystem, including LLaMA, for a more unified approach.

THE SEIKO DATASET AND THE IMPORTANCE OF DATA ENGINEERING

The development of SAM 3 was heavily reliant on a novel data engine and the creation of the SEIKO dataset, which features over 200,000 unique concepts, vastly expanding on previous benchmarks. Meta emphasized the critical role of data engineering in AI, automating data generation processes. This massive dataset, with over 70% negative annotations, trains the model to differentiate between objects present and absent in an image, a crucial step for accurate recognition and localization.

REAL-WORLD IMPACT AND APPLICATIONS

The impact of the SAM family of models is evident across various industries, as highlighted by Roboflow. Millions of smart polygon creations are SAM-powered, saving significant human time in data curation. Applications range from medical research (counting neutrophils) and autonomous navigation (underwater trash cleaning bots) to industrial automation and insurance estimates. The model's ability to understand a vast array of concepts is accelerating progress in numerous fields.

SAM 3 AS A VISUAL AGENT FOR LLMS

SAM 3 is positioned as a powerful visual agent for Large Language Models (LLMs), enhancing their ability to understand and interact with visual information. While SAM 3 focuses on atomic concepts, its integration with LLMs allows for more complex reasoning and 'visual grounding.' This synergy enables LLMs to perform tasks like distinguishing features between images or understanding nuanced visual descriptions, moving towards more capable multimodal AI systems.

FINE-TUNING AND ADAPTATION FOR SPECIFIC DOMAINS

The ability to fine-tune SAM 3 is crucial for adapting it to specific domains and user definitions. Even with a few data points or a small number of negative examples, SAM 3 can learn to recognize new concepts or differentiate between similar ones, like 'Whimos' versus generic 'vehicles.' This fine-tuning capability is vital for applications in specialized fields such as medical imaging, where precise segmentation of cells or tissues is required, demonstrating the model's flexibility.

ADVANCEMENTS IN VIDEO UNDERSTANDING

While images are a core focus, SAM 3 also makes strides in video understanding. The introduction of features like 'muscle detection score' aims to improve temporal smoothing and stability within video sequences. Although video processing still presents challenges, particularly in achieving efficient end-to-end training and scaling automated data pipelines, there's significant research potential for further advancements in this area, with applications in robotics and complex action recognition.

ROBOFLOW'S ROLE IN THE SAM ECOSYSTEM

Roboflow plays a key role in democratizing access to models like SAM. They provide infrastructure for deploying SAM 3, enabling zero-shot capabilities, fine-tuning, and automated data labeling. Their platform empowers developers to integrate SAM into their applications, offering tools for everything from data collection to model deployment. This support is critical for accelerating the adoption and use of SAM across a wide range of visual AI tasks.

THE FUTURE OF COMPUTER VISION AND AGI

SAM 3 represents a significant step towards Artificial General Intelligence (AGI) by equipping AI with robust visual perception. The ongoing research focuses on seamless integration of visual and language models, moving beyond tool-call interactions to native embedding. The goal is to create AI that can reason and perceive as effectively as humans, addressing complex tasks that currently require human intervention, ultimately pushing the boundaries of what AI can achieve.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Books

●Studies Cited

●Concepts

Common Questions

SAM 3 is Meta's latest image and video understanding model. It introduces concept prompts, allowing users to find objects using text descriptions, going beyond the manual clicking required in SAM 1 and SAM 2. It also offers improved speed and the ability to detect and track objects across video frames.

Topics

Segmentation Data Annotation Roboflow Sam 3

Mentioned in this video

Software & Apps

Mask R-CNN

An earlier open-source model from Meta that Roboflow has been a believer in, cited as a precursor to SAM.

RoboFlow Rapid

A product/feature from RoboFlow that demonstrates real-time object detection per frame without needing to preserve unique identities.

RoboFlow Playground

An interactive environment to try out SAM 3 and build with it.

COCO

A dataset commonly used in computer vision, mentioned in the context of creating a new benchmark for SAM 3.

Maix

Mentioned as an LLM used in some of the experiments for SAM 3's agent setup.

RF data

Roboflow's Detection Transformer model, designed for real-time segmentation and detection on edge devices.

Perception Encoder

A shared visual backbone component used in SAM 3, developed by the Fair group at Meta.

PyTorch 3D

A library Nikila worked on briefly before focusing on Segment Anything.

Florence 2

An older model used as a comparison point to demonstrate the advancements made by SAM 3 in visual tasks.

Zephyr

An LLM mentioned in Table 8 of the paper regarding SAM 3's agent performance.

Llama 3.2

A version of LLaMA that was fine-tuned for verification tasks in the data engine.

Detectron 2

Another open-source model from Meta that Roboflow has supported, mentioned alongside Mask R-CNN.

NAMA

A model mentioned in the annotation pipeline for generating captions and non-face masks.

VISTRON

A model mentioned in the annotation pipeline for generating captions and non-face masks.

ImageNet

Books

Science Direct

A platform for scientific research publications, mentioned as a place where work citing SAM is published.

Concepts

Vision Transformer

A type of neural network architecture for vision tasks that has surpassed CNNs in many areas.

MDS

Likely refers to a specific type of dataset or benchmark used in relation to SAM 3's development.

VAS

Vision-Language Action Models, a type of task that users might want to perform with SAM 3 and LLMs.

RHF

Reinforcement Learning from Human Feedback, a training paradigm mentioned as a way to go beyond human performance.

LOMO

Likely refers to 'locus of movement' or a similar biological term, mentioned in the context of segmenting cells.

SFT

Supervised Fine-Tuning, a training method mentioned in comparison to RHF.

Seiko

The benchmark created by the Meta team for Segment Anything with Concepts, containing over 200,000 unique concepts.

LLM-based agents

Agents that leverage large language models for complex reasoning and task execution, for which SAM 3 serves as a visual component.

CNNs

Convolutional Neural Networks, mentioned as a vision task architecture that Transformers have surpassed.

Organizations

MSR

Microsoft Research, where Pengchuan previously worked before moving to Meta.

CZI Imaging Institute

An institute visited by Pengchuan where SAM was being used for imaging human cells.

CZI

Abbreviation for the Chan Zuckerberg Initiative, associated with the imaging institute visited by Pengchuan.

IASA

Likely a mishearing or typo, context suggests an organization related to AI safety or standards.

Embari

An aquarium in the US that uses underwater cameras and SAM for species tracking and population monitoring.

Studies & Research

TCGA

Likely a typo or mishearing, but context suggests a biological or medical dataset or research area related to cell imaging.

Companies

Roboflow

Joseph Nelson's company, focused on making the world programmable with AI and providing tools for developers to build and deploy models.

Whimo

Products

T4 GPU

A type of GPU mentioned in the context of real-time transformer performance on edge devices.

AI glasses

Mentioned as a device Pengchuan worked on egocentric foundation models for before joining the SAM team.

NVIDIA H200

Legislation & Policy

TCJA

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free