How does JEPA prevent representational collapse?

JEPA models prevent collapse by using regularization techniques. This can involve methods like contrastive learning or regularization-based approaches that create a rich and well-defined energy landscape, ensuring that encoded representations are informative and not trivial.

What is the goal of Causal JEPA, and how does it differ from standard JEPA?

Causal JEPA aims to understand object interactions and dynamics by learning object-centric representations. It uses techniques like object masking to force the model to infer relationships between objects, going beyond simple correlation to understand cause-and-effect in dynamics.

What are the key components of a good world model as discussed in the talk?

A good world model requires a good state representation, a good transition model that understands underlying environmental rules, and a good dynamics model that allows the model to react appropriately to actions.

How does the LOVER model simplify JEPA training?

LOVER model simplifies training by avoiding complex tricks like EMA encoders, masking, or pre-trained encoders. It uses a single hyperparameter, C-Reg, for regularization and trains end-to-end from raw pixels, making it significantly faster and easier to tune.

What is C-Reg, and why is it important for LOVER model?

C-Reg (Sketch Isotropic Gaussian Regularizer) is a statistical method used to prevent representation collapse in latent spaces. It optimizes the distribution of embeddings to be isotropic Gaussian, ensuring the embeddings are informative and structured.

How does the LOVER model perform in terms of planning speed compared to other models?

LOVER model is significantly faster at planning, achieving this in under a second compared to Dino model's 47 seconds. This speed is attributed to its single latent state representation and efficient architecture.

Can world models understand intuitive physics?

The talk suggests that world models can show understanding of intuitive physics by, for example, not significantly increasing prediction error when object color changes but showing a large spike when a physically impossible event like teleportation occurs.

Are world models suitable for physical AI and robotics?

The speakers argue that world models are likely essential for physical AI, as interacting with the real world requires predicting the consequences of actions. They contrast this with current diffusion models, which may not inherently understand action outcomes.

Do JEPA models help reduce hallucinations compared to LLMs?

JEPA models, being energy-based and focused on plausible predictions, might be less prone to certain types of hallucinations than LLMs. However, both can still exhibit hallucinations depending on training data, model quality, and how planning is performed.

What are the limitations of current world models like LOVER and Causal JEPA?

Limitations include short-term planning horizons, lack of hierarchical reasoning, reliance on toy environments, and difficulty in specifying complex goals. Object-centric representations can also struggle with occlusions and dynamic object appearance/disappearance.

Key Moments

Stanford CS25: Transformers United V6 I From Representation Learning to World Modeling

Stanford Online

Education5 min read72 min video

Apr 22, 2026|4,721 views|149|7

Stanford Stanford Online Transformers AI Artificial Intelligence

Save to Pod

Key Moments

TL;DR

World models are shifting from pixel-level prediction to latent-space forecasting, with JEPA aiming for more human-like, less uncertain predictions, though object-centric masking proves crucial for understanding interaction dynamics.

Key Insights

The core of JEPA lies in predicting future states within a latent space, rather than directly generating pixel-level outputs, which helps in handling world uncertainty and focusing on meaningful predictions.

Causal JEPA introduces object-centric representations and masking to understand object interactions, demonstrating significant gains (28%) over baselines in tasks requiring counterfactual reasoning and dynamic understanding.

The 'lower model' offers a simple, end-to-end JEPA training from raw pixels with a single hyperparameter, achieving planning speeds up to 50 times faster than baselines like DynaMo.

Causal JEPA's object masking, when compared to models without it, shows a 28% performance gain, highlighting its importance for learning object dynamics beyond simple correlations.

The 'lower model' uses a 'c-reg' regularization technique to ensure latent embeddings are isotropic Gaussian, preventing representation collapse and improving planning efficiency.

While causal JEPA and lower models show promise, limitations include short-term planning horizons, difficulty with occlusions in object-centric representations, and the need for better goal specification methods.

The paradigm shift from generative to predictive world models

Traditional world models, often autoregressive, predict the next state based on the previous one. However, the world is filled with inherent uncertainty. To address this, world models are evolving to incorporate actions, functioning like simulators that take a state and an action to predict the next state. Key components for improving these simulators include developing good state representations, understanding underlying physical rules for a robust transition model, and having a dynamics model that reacts appropriately to actions. While generative models aim to reconstruct input spaces, Joint Embedding Predictive Architectures (JEPA) focus on learning representations in an abstract, latent space. This shift is driven by the idea that for many tasks, like autonomous driving, modeling every minute detail (e.g., tree leaves) is unnecessary and computationally inefficient. JEPA's approach, by encoding observations into a latent space and predicting future states there, aims for more meaningful and human-like predictions, focusing solely on relevant information.

JEPA's latent space prediction and collapse prevention

JEPA fundamentally differs from generative models by encoding both the current and future states into a latent space and comparing representations there, rather than directly generating future pixels. This latent-space comparison aligns with the goal of predicting meaningful future information while acknowledging inherent uncertainties that cannot be precisely predicted at a pixel level. JEPA is conceptualized as an energy-based model, assigning high energy to incompatible state-future pairs and low energy to compatible ones. A significant challenge in JEPA, and energy-based models generally, is the risk of 'collapse,' where the model trivializes predictions, often by outputting a constant value for all inputs. Techniques to prevent this collapse include contrastive learning and regularization-based methods. Projects like ViJepa use methods such as EMA encoders and stop-gradients on target encoders, with temporal masking, to create a well-defined energy landscape and avoid trivial solutions. DynaMo further explores using pre-trained encoders and auxiliary variables like actions to predict future state representations in the latent space.

Causal JEPA: Understanding object interactions through object-centricity and masking

Causal JEPA aims to tackle object interaction and dynamics by moving beyond patch-based representations to an object-centric approach. Instead of predicting raw pixels, it endeavors to understand how individual objects influence each other. This involves learning object-centric representations using mechanisms like slot attention, where features are bound to specific slots representing objects, enabling a more structured understanding. The core innovation lies in object masking: by masking certain object representations and forcing the model to predict them based on the context of other objects and history, it compels the model to learn predictive relationships. This is analogous to reasoning about a monkey eating a banana – if the banana is hidden, the model must infer its state from the monkey's actions. This approach forces the model to learn influence neighborhoods and predictive dependencies, moving beyond simple correlations.

Implementing object masking and addressing action conditioning

In causal JEPA, masked tokens within a bidirectional transformer predict the masked states. The challenge arises when multiple objects are masked, especially since object-centric models are permutationally equivalent with respect to object order. To solve this, the model uses slot identity from an initial unmasked frame to ground the prediction of masked objects. Furthermore, standard action conditioning, like concatenating action embeddings to patch embeddings in DynaMo, is deemed suboptimal. Causal JEPA proposes treating actions as separate graph nodes, integrating them more effectively into the prediction process. This architectural change, combined with object masking and a bidirectional transformer, leads to significant performance gains, particularly in tasks requiring counterfactual reasoning and understanding complex dynamics.

The 'lower model': Simplifying JEPA training and accelerating planning

The 'lower model' presents a simplified approach to JEPA training, operating end-to-end from raw pixels with minimal parameters (15 million) and a single hyperparameter. It avoids common JEPA complexities like EMA, masking, stop-gradients, or pre-trained encoders. The core mechanism for preventing collapse is a regularization technique called 'c-reg' (cross-correlation regularization), which encourages latent embeddings to follow an isotropic Gaussian distribution. This is achieved by projecting embeddings onto numerous random directions and ensuring these 1D projections are Gaussian, leveraging a theorem that implies the high-dimensional joint distribution will also be Gaussian if its marginals are. This rigorous approach results in informative latent embeddings and allows for highly efficient planning, being up to 50 times faster than strong baselines like DynaMo.

Evaluating world models: Control, intuitive physics, and limitations

The effectiveness of trained world models is assessed through two main avenues: online control and intuitive physics understanding. For control, models can sample action trajectories and optimize them to reach a target goal state using techniques like model predictive control. The 'lower model' has demonstrated competitive performance across various tasks (e.g., two room navigation, reacher, push-t, OGBench cube), often outperforming models that rely on proprioception data or have significantly more parameters, particularly in planning speed. For intuitive physics, probing the latent space reveals that models like 'lower model' produce more disentangled representations compared to DynaMo. Furthermore, by observing prediction errors when the world model encounters perturbations (e.g., cube teleportation), researchers can infer its understanding of dynamics. High prediction errors signal violations of the learned world model, akin to human surprise when encountering unexpected events. However, current world models face limitations such as short planning horizons, difficulty with occlusions in object-centric representations, and challenges in specifying goals beyond visual targets.

Mentioned in This Episode

●Software & Apps

●Organizations

●Concepts

●People Referenced

Common Questions

Generative world models operate in pixel space and aim to reconstruct future frames exactly. In contrast, JEPA models operate in a latent space, predicting future states based on current observations, focusing on meaningful predictive information rather than pixel-perfect reconstruction.

Topics

AI & Machine Learning Technology & Innovation Science & Mathematics Deep Learning World Models Representation Learning Causal Inference Joint Embedding Predictive Architecture (JEPA)Object-centric Learning

Mentioned in this video

People

Damian Shir

Lucas Amaze's advisor at MILA.

Damian Yan Lun

Lucas's collaborator on the LOVER model paper.

Yan Lun

A collaborator on the LOVER model paper.

Rendel Bolis Rio

Professor Hazel Nam is working with at Brown University.

Lucas Amaze

A PhD student at MILA and the University of Montreal working on Jepa and planning, who speaks after Hazel.

Hazel Jang Nam

A master student at Brown University working on representation learning, causality, and self-supervised learning, who presents the first part of the talk.

Yann LeCun

One of the originators of the JEPA framework, who framed it as an energy-based model.

Organizations

Brown University

Hazel's affiliation, where she is a master's student focusing on representation learning, causality, and self-supervised learning.

Mila

Lucas Amaze is a PhD student at MILA and the University of Montreal researching Jepa and planning.

University of Montreal

Lucas Amaze is a PhD student at MILA and the University of Montreal researching Jepa and planning.

GRU

Gated Recurrent Unit, used in slot attention models to iteratively update object slots, allowing them to learn object-aligned representations.

Software & Apps

Gaia

An autonomous driving model that uses previous state and action to predict the next driving scenario, mentioned as an example of a generated world model.

Vija

A video JEPA model that uses temporal masking on consecutive frames to predict unseen areas, employing an EMA encoder and stop gradient for regularization.

DINOv2

A pre-trained encoder used in the DINO model, which demonstrated that a frozen DINOv2 encoder could provide meaningful abstractions for planning.

UNI

A model mentioned in the context of video slot attention, aiming to maintain temporal consistency within videos for object order.

UCI Objectcentric Dynino World Model

A baseline model compared against Causal JEPA, lacking object masking and using a different action conditioning method.

Variational Autoencoder

Mentioned as a method that could be used for optimizing distributions to be Gaussian, but not preferred for C-Reg due to generative model usage.

DMiner

A model compared against LOVER model for planning speed and performance on the 'push_t' task.

Stable World Models

An open-source GitHub library for training world models, featuring various baselines, solvers, and environments.

Claude 3.5 Sonnet

A generative AI model that might hallucinate, contrasted with JEPA models which aim for a more grounded representation.

Concepts

Slot Attention

A foundational method in object-centric learning that uses an encoder-decoder framework and self-attention to bind features to object-specific slots.

Vrag

A regularization term used in PLDM models to prevent collapse by making the covariance matrix of latent features identity.

C-Reg

Sketch Isotropic Gaussian Regularizer, a technique used in LOVER model to prevent representation collapse by optimizing the distribution of embeddings to be isotropic Gaussian.

Media

Minecraft

A game environment, with support recently added to the Stable World Models library for research purposes.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free