What is 'usable information' in the context of AI?

Usable information considers not just the raw amount of information content but also the computational constraints required to extract it. For AI, this means measuring information in a way that reflects its accessibility and processability by computational systems, not just its theoretical quantity.

Why is understanding embeddings important?

Embeddings are high-dimensional vectors representing data like text, used in search and other applications. Understanding them involves knowing what information they contain, how they can be inverted to recover original data, and how different models' embeddings relate to each other.

What is the Platonic Representation Hypothesis?

This hypothesis suggests that as models scale in size and data, they converge towards learning similar internal representations of the world. This implies that different models, trained on similar data, might perform similar computations and have aligned embedding spaces.

How does the 'universal geometry of embeddings' apply to AI models?

It suggests that embeddings from different models, despite varied architectures and training, often map to similar geometric regions in the embedding space. This universality allows for model alignment and implies that models learn comparable underlying functions.

What is 'language model capacity' and how is it measured?

Language model capacity refers to how much information a model can store. It can be measured by its rate of memorization on random data, revealing a plateau that represents its limit, regardless of scaling training size or steps.

Can you recover training data from model weights?

It's a very difficult problem, but researchers are exploring ways to approximate or infer training data from model weights. Techniques involve analyzing differences between model checkpoints or using gradient-based selection on large corpora.

What is the core thesis of 'There are no new ideas in AI, only new data sets'?

The thesis argues that major paradigm shifts in AI, like deep learning and transformers, were driven not just by new techniques but crucially by new, large-scale data sets. Future breakthroughs are predicted to similarly stem from novel data sources.

How do the innovations in AI research unfold?

Science, including AI, progresses through paradigm shifts (major innovations) followed by periods of applying existing techniques to new data or contexts. The biggest leaps often come from scale (data size, model size) combined with novel architectures or training methods.

What is the role of 'web-scale pre-training' in AI advancements?

Web-scale pre-training, like that used for GPT, involved training models on vast amounts of internet text. This approach was a key innovation, arguably more critical than specific architectures like Transformers for achieving capabilities seen in models like ChatGPT.

What is the future outlook for AI model development?

Future advancements might come from new data modalities (video, embodied AI), novel reasoning capabilities, or entirely new data sources that push performance beyond current scaling patterns. Improving efficiency and understanding fundamental limits also remain key.

Key Moments

Information Theory for Language Models: Jack Morris

Latent Space Podcast

Science & Technology3 min read79 min video

Jul 2, 2025|10,617 views|309|22

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Jack Morris discusses information theory, LLM research, and the evolving landscape of AI.

Key Insights

The AI research landscape has shifted from academia to industry, with companies now driving major breakthroughs.

Information theory, particularly 'usable information' under computational constraints, offers a new lens for understanding LLMs.

Embeddings, despite appearing random, contain significant recoverable information, impacting applications like vector databases.

The 'Platonic Representation Hypothesis' suggests that as models scale, they converge on learning similar representations of the world.

New datasets and training techniques, rather than just novel architectures, are the primary drivers of paradigm shifts in AI.

Understanding what information is stored in model weights and activations is crucial for both auditing and improving AI systems.

THE SHIFTING RESEARCH LANDSCAPE: ACADEMIA TO INDUSTRY

Jack Morris notes a significant shift in AI research from academic institutions to industry over the past five years. Initially, impactful work on models like BERT and GPT came from professors and PhD students. However, the release of ChatGPT marked a turning point, with fundamental AI science moving into companies. This has changed the power dynamics and the source of novel ideas, with many groundbreaking developments now originating from industry labs rather than university settings.

INFORMATION THEORY AS A LENS FOR LLMS

Morris introduces the concept of 'usable information' or 'V-information,' which considers computational constraints, diverging from Shannon's traditional information theory. This framework suggests that information is more valuable if it's easier to extract and process. This idea helps explain why pre-trained models perform better; they make information more extractable. He posits that the field of deep learning lacks precise terminology to discuss information storage and computation, analogizing it to the early days of telecommunications before the concept of a 'bit' was formalized.

DECODING THE MYSTERY OF EMBEDDINGS

A significant area of Morris's research explores information embedded in vector representations. He highlights that even seemingly random vectors, like those from OpenAI embeddings, contain a large amount of recoverable text, potentially thousands of characters per vector. This has practical implications for vector databases, raising questions about data privacy and security. His work demonstrates that a substantial portion of the original text can be reconstructed from embeddings, a finding that has influenced privacy policies of vector database companies.

THE PLATONIC REPRESENTATION HYPOTHESIS AND UNIVERSAL GEOMETRY

Morris discusses the 'Platonic Representation Hypothesis,' which suggests that as models scale and are trained on vast amounts of data from the same world, they converge towards learning similar internal representations. This idea is supported by observations that different embedding models, even with varying architectures and training data, often produce highly similar outputs or nearest neighbors. His research, using techniques inspired by CycleGAN, shows that embeddings from different models can be aligned, indicating a shared underlying structure, which has implications for model interoperability and efficiency.

DATASETS AS THE ENGINE OF AI PROGRESS

A central thesis from Morris is that paradigm shifts in AI are primarily driven by new datasets, not just new algorithms or architectures. He cites AlexNet and ImageNet, transformers and web-scale pre-training, and instruction tuning with human preference data as examples. He argues that while novel methods are glamorous, the true breakthroughs have consistently involved training on unprecedented scales or types of data. The innovation lies in the data source, which enables scaling of existing techniques rather than solely relying on entirely new methods.

MEASURING AND UNDERSTANDING MODEL CAPACITY

Further research by Morris delves into the information storage capacity of language models, exploring how much data weights can retain. Papers on 'language model capacity' and 'approximating language model training data from weights' investigate this. He notes that current models store a relatively small number of bits per parameter compared to their potential, indicating inefficiencies. While the immediate goal isn't pure memorization but generalization, understanding this capacity is crucial for developing more efficient and capable models, potentially leading to insights into data privacy and the reconstruction of training data from model weights.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●People Referenced

Common Questions

Information theory provides a framework to measure and understand how information is stored and processed within language models. It helps in analyzing concepts like model capacity, data compression, and the extractability of information from model weights and activations.

Topics

AI & Machine Learning Technology & Innovation Science & Mathematics Model Interpretability AI Breakthroughs

Mentioned in this video

Software & Apps

PyTorch

The team behind PyTorch is mentioned as being active on the GPU mode Discord to help users with distributed training.

LLaMA 4

A model mentioned in comparison to Jumba 3N.

Llama

A model family that frequently releases model checkpoints.

AlphaGo

A game developed by DeepMind that Jack Morris found impressive around 2017-2018.

GPT

The first GPT model was released in 2018, representing a paradigm shift with web-scale pre-training.

BERT

A language model that Jack Morris was playing with and that was popular in 2019.

ChatGPT

A major product release in 2022 that significantly increased consumer interest and shifted the landscape of AI research towards companies.

CUDA

A parallel computing platform and programming model mentioned as a valuable skill to learn, though not the only path to high employability.

Rust

A programming language integrated with Python in the development of Mojo.

ImageNet

A large-scale image dataset used for training AlexNet, facilitating a paradigm shift in deep neural networks.

GPT-2

An OpenAI model that Jack Morris found interesting, though BERT was more popular at the time.

GPT-3

A 175 billion parameter model released before InstructGPT, mentioned as a significant development in large language models.

Jax

A framework that some people switch from to learn CUDA.

AlexNet

A deep neural network that marked a paradigm shift in AI around 2010-2012, trained on ImageNet.

Mamba

A model described as a more efficient alternative to Transformers, representing the kind of 'cute new method' researchers often seek.

Companies

DeepSeek

Company that released a 400 billion parameter model, making its base and fine-tuned weights available.

OpenAI

An AI research lab mentioned in relation to Shinyu joining their team and the release of models like GPT-2, GPT-3, and Reasoning Models.

Microsoft

The company behind the DeepSpeed library, which researchers are encouraged to reach out to for assistance.

Modular

The company behind Mojo, with Chris Lanner leading the development of a CUDA replacement combining Python and Rust.

Studies & Research

Attention Is All You Need

The paper that introduced the Transformer architecture, a key innovation in AI.

People

Andrew Ng

A figure who has been advocating for the concept of a 'cognitive core' for AI models.

Thomas Kuhn

Author of 'The Structure of Scientific Revolutions', proposing the concept of paradigm shifts in science.

Organizations

Cornell Tech

A campus of Cornell University located in New York City.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free