Key Moments

Best of 2024 in Vision [LS Live @ NeurIPS]

Latent Space PodcastLatent Space Podcast
Science & Technology3 min read56 min video
Dec 22, 2024|1,828 views|37|2
Save to Pod
TL;DR

2024 vision highlights: Sora video, Sam2 segmentation, Detrs object detection, and LLMs gaining vision.

Key Insights

1

Sora revolutionized video generation, enabling high-resolution, long-duration clips through diffusion transformers and massive compute.

2

Sam 2 extends image segmentation to video, offering plug-and-play capability for tracking objects across frames efficiently.

3

Detrs have surpassed YOLO in real-time object detection, offering better accuracy at similar latencies due to pre-training and Transformer architectures.

4

LLMs struggle with fine-grained visual details, as shown by the MMVP benchmark, highlighting limitations in current vision-language models.

5

Florence 2 and PolyGema 2 show advancements in integrating spatial hierarchy and semantic granularity for vision-language tasks.

6

AMv2 offers a promising approach for combining image and pixel tokens with decoder-only transformers, potentially scaling better for vision tasks.

TRANSITION TO VIDEO GENERATION WITH SORA

The year 2024 saw a major shift from per-image to video-based models, with OpenAI's Sora marking a significant leap. Building on prior work like MagVit for tokenizing video, Sora generated 1080p, minute-long videos with impressive realism, including reflections and detailed textures. Replication efforts like OpenSora utilized MagVit V2 and diffusion transformers, highlighting the critical role of temporal compression and vast computational resources. Rectified flows also emerged as a faster alternative to traditional DDPM for generation.

SAM 2: EXTENDING SEGMENTATION TO VIDEO

Building upon the success of Segment Anything Model (SAM), Sam 2 extends its segmentation capabilities to videos. This version, featuring a hierarchical encoder for faster inference, allows for persistent object tracking across frames. It maintains a memory bank of past features and uses cross-attention to generate masks, enabling functionalities like following disappearing objects. The model's training paradigm, which unifies model and dataset creation, is a unique aspect.

DETRS REVOLUTIONIZE REAL-TIME OBJECT DETECTION

Object detection saw a significant challenge to YOLO's long-standing dominance with the rise of Detrs. Papers like RT-DETR, LWD, and Deformable DETR (Define) have pushed performance boundaries. RT-DETR introduced an efficient Transformer encoder for multi-scale features, matching YOLO's speed. LWD highlighted the immense benefit of pre-training for Detrs, a gain less pronounced in YOLO models. Define further refined these by incorporating advanced loss functions, leading to competitive accuracy at low latencies.

LLMS STRUGGLE WITH FINE-GRAINED VISION

Despite advancements, large language models (LLMs) demonstrate a critical limitation: they often fail to perceive fine-grained visual details necessary for tasks like telling time from a watch. The MMVP benchmark reveals that models initialized with contrastive methods like CLIP, while good at matching images and captions, lack the detailed feature extraction needed for such tasks. This highlights a gap in their visual understanding, even when fine-tuned.

ADVANCEMENTS IN MULTIMODAL VISION-LANGUAGE MODELS

Several models in 2024 focused on bridging the gap between visual perception and language reasoning. Florence 2 introduced concepts of spatial hierarchy and semantic granularity, utilizing diverse annotation types like region-text pairs and text-phrase region annotations to improve understanding. Following this, PolyGema 2 employed decoder-only transformers with location tokens and prefix loss for tasks like segmentation, showing promise by increasing model capacity and resolution. AMv2 proposed a simpler approach using decoder-only transformers to reconstruct images and captions, demonstrating scalability and improving performance with increased data and resolution.

MOONDREAM: TINY MODELS FOR EDGE DEPLOYMENT

Vic Korrapati from Moondream discussed the challenge of deploying vision applications on edge devices. Moondream initially developed a 2B parameter model but focused on creating a smaller, 0.5B parameter model through pruning while preserving accuracy. This approach allows developers to build applications with larger models and then distill them for specific deployment targets. A key application demonstrated was reading gauges and clocks, where a chain-of-thought approach, augmented with spelling-based reasoning, improved sample efficiency and interpretability for complex visual tasks.

Common Questions

The biggest trends in computer vision for 2024 include the transition from per-image models to video models using similar techniques, and the rise of detectors like RTD, LWD, and Define surpassing YOLO in real-time object detection.

Topics

Mentioned in this video

Software & Apps
Magvit

A discrete token video tokenizer used at the end of 2023 for video generation, capable of 5-second lengths.

Open Sora

A replication effort for Sora that uses Magvit V2 and demonstrates the benefits of temporal compression in latent spaces.

DDPM

Denoising Diffusion Probability Modeling, a framework for diffusion models that traditionally requires many steps for high-quality samples.

LWD

A paper that demonstrated the significant effectiveness of pre-training for detectors, showing much larger gains compared to YOLO.

GPT-4o

Mentioned as a multimodal model that has become mainstream in the year, highlighting the trend of LLMs incorporating vision capabilities.

Sora

A powerful text-to-video generation model highlighted as the biggest paper of 2024, capable of 1080p video for a minute.

Stable Diffusion Video

Mentioned as related work to Sora in the context of video generation.

VAE

Variational Autoencoder framework, mentioned as being swapped in for the discretization step in Open Sora's approach.

Rectified Flows

A newer framework for diffusion models that allows for faster sampling, often with a single step, and is adopted by high-performance models.

Define

A development that improved detectors by incorporating complex loss functions and other enhancements, bringing them close to 60 AP on COCO.

Claude 3

Included as an example of a multimodal LLM that has become mainstream, showcasing the trend towards vision-language integration.

CLIP

A model used as a vision encoder, hypothesized as a reason why LLMs struggle with fine-grained visual details due to its contrastive training objective.

Gemini

Mentioned as a multimodal model that signifies the mainstream adoption of vision-language models, becoming a major trend in 2024.

Detectors

A class of object detection models that are showing significant improvements and are beginning to surpass YOLO in performance.

DINOv2

A self-supervised foundation model trained purely on image data, which learns fine-grained visual features and is used to identify images difficult for LLMs.

ChatGPT

Mentioned as an example of an LLM that fails to perceive fine-grained visual details, such as watch hands.

Florence 2

A model that aims to incorporate spatial hierarchy and semantic granularity for better vision-language understanding, achieving strong results on COCO.

Gemma

The language encoder used in PolyGemma, with PolyGemma 2 utilizing multiple sizes of Gemma encoders.

MoonDream

A developer-focused vision language model, with capabilities for developers building vision applications on edge devices.

SAM 2

An extension of the SAM strategy applied to video, enabling object tracking and segmentation across frames.

Dolly 3

Mentioned for its technique of using an LLM-captioned corpus to train a diffusion model, a method also relevant for Sora.

Diffusion Transformer

A type of Transformer architecture used in diffusion models for video generation, noted for its performance with increased compute.

RTD

A paper published in 2024 that improved detector architecture by decoupling multiscale features into an efficient Transformer encoder, matching YOLO speed.

LAVA

A vision-language model that performs poorly on the MMVP benchmark, showing negative correlation, likely due to its CLIP initialization and short training.

PolyGemma

A model that uses a decoder-only Transformer and location tokens for pixel space understanding, with PolyGemma 2 showing improved capacity.

AMv2

A model that simplifies combining image and pixel tokens by autor_regressively learning the mean squared error of image tokens for reconstruction.

Llama 3.2

Cited as an example of a vision-language model that has become mainstream, contributing to the overall trend of multimodality in AI.

More from Latent Space

View all 134 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free