Information Theory for Language Models: Jack Morris
Key Moments
Jack Morris discusses information theory, LLM research, and the evolving landscape of AI.
Key Insights
The AI research landscape has shifted from academia to industry, with companies now driving major breakthroughs.
Information theory, particularly 'usable information' under computational constraints, offers a new lens for understanding LLMs.
Embeddings, despite appearing random, contain significant recoverable information, impacting applications like vector databases.
The 'Platonic Representation Hypothesis' suggests that as models scale, they converge on learning similar representations of the world.
New datasets and training techniques, rather than just novel architectures, are the primary drivers of paradigm shifts in AI.
Understanding what information is stored in model weights and activations is crucial for both auditing and improving AI systems.
THE SHIFTING RESEARCH LANDSCAPE: ACADEMIA TO INDUSTRY
Jack Morris notes a significant shift in AI research from academic institutions to industry over the past five years. Initially, impactful work on models like BERT and GPT came from professors and PhD students. However, the release of ChatGPT marked a turning point, with fundamental AI science moving into companies. This has changed the power dynamics and the source of novel ideas, with many groundbreaking developments now originating from industry labs rather than university settings.
INFORMATION THEORY AS A LENS FOR LLMS
Morris introduces the concept of 'usable information' or 'V-information,' which considers computational constraints, diverging from Shannon's traditional information theory. This framework suggests that information is more valuable if it's easier to extract and process. This idea helps explain why pre-trained models perform better; they make information more extractable. He posits that the field of deep learning lacks precise terminology to discuss information storage and computation, analogizing it to the early days of telecommunications before the concept of a 'bit' was formalized.
DECODING THE MYSTERY OF EMBEDDINGS
A significant area of Morris's research explores information embedded in vector representations. He highlights that even seemingly random vectors, like those from OpenAI embeddings, contain a large amount of recoverable text, potentially thousands of characters per vector. This has practical implications for vector databases, raising questions about data privacy and security. His work demonstrates that a substantial portion of the original text can be reconstructed from embeddings, a finding that has influenced privacy policies of vector database companies.
THE PLATONIC REPRESENTATION HYPOTHESIS AND UNIVERSAL GEOMETRY
Morris discusses the 'Platonic Representation Hypothesis,' which suggests that as models scale and are trained on vast amounts of data from the same world, they converge towards learning similar internal representations. This idea is supported by observations that different embedding models, even with varying architectures and training data, often produce highly similar outputs or nearest neighbors. His research, using techniques inspired by CycleGAN, shows that embeddings from different models can be aligned, indicating a shared underlying structure, which has implications for model interoperability and efficiency.
DATASETS AS THE ENGINE OF AI PROGRESS
A central thesis from Morris is that paradigm shifts in AI are primarily driven by new datasets, not just new algorithms or architectures. He cites AlexNet and ImageNet, transformers and web-scale pre-training, and instruction tuning with human preference data as examples. He argues that while novel methods are glamorous, the true breakthroughs have consistently involved training on unprecedented scales or types of data. The innovation lies in the data source, which enables scaling of existing techniques rather than solely relying on entirely new methods.
MEASURING AND UNDERSTANDING MODEL CAPACITY
Further research by Morris delves into the information storage capacity of language models, exploring how much data weights can retain. Papers on 'language model capacity' and 'approximating language model training data from weights' investigate this. He notes that current models store a relatively small number of bits per parameter compared to their potential, indicating inefficiencies. While the immediate goal isn't pure memorization but generalization, understanding this capacity is crucial for developing more efficient and capable models, potentially leading to insights into data privacy and the reconstruction of training data from model weights.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Books
●People Referenced
Common Questions
Information theory provides a framework to measure and understand how information is stored and processed within language models. It helps in analyzing concepts like model capacity, data compression, and the extractability of information from model weights and activations.
Topics
Mentioned in this video
The team behind PyTorch is mentioned as being active on the GPU mode Discord to help users with distributed training.
A model mentioned in comparison to Jumba 3N.
A model family that frequently releases model checkpoints.
Company that released a 400 billion parameter model, making its base and fine-tuned weights available.
A game developed by DeepMind that Jack Morris found impressive around 2017-2018.
A Microsoft library for distributed training that is accessible to researchers.
A team whose members, like Jeremy Howard, can help with distributed training.
The first GPT model was released in 2018, representing a paradigm shift with web-scale pre-training.
Recurrent Neural Networks which, while less scalable than Transformers, could have potentially achieved ChatGPT-like results with web-scale pre-training.
A professor at Cornell who Jack Morris studied under.
A programming language that is suggested as a current sweet spot for employability due to its potential as a CUDA replacement.
Mentioned in a conversation about emergent reasoning in large language models.
The paper that introduced the Transformer architecture, a key innovation in AI.
A model mentioned in comparison to Jumba 3N.
A language model that Jack Morris was playing with and that was popular in 2019.
A figure who has been advocating for the concept of a 'cognitive core' for AI models.
Author of 'The Structure of Scientific Revolutions', proposing the concept of paradigm shifts in science.
A campus of Cornell University located in New York City.
A major product release in 2022 that significantly increased consumer interest and shifted the landscape of AI research towards companies.
A parallel computing platform and programming model mentioned as a valuable skill to learn, though not the only path to high employability.
A programming language integrated with Python in the development of Mojo.
A large-scale image dataset used for training AlexNet, facilitating a paradigm shift in deep neural networks.
An AI research lab mentioned in relation to Shinyu joining their team and the release of models like GPT-2, GPT-3, and Reasoning Models.
An OpenAI model that Jack Morris found interesting, though BERT was more popular at the time.
A 175 billion parameter model released before InstructGPT, mentioned as a significant development in large language models.
The company behind the DeepSpeed library, which researchers are encouraged to reach out to for assistance.
The company behind Mojo, with Chris Lanner leading the development of a CUDA replacement combining Python and Rust.
A framework that some people switch from to learn CUDA.
A deep neural network that marked a paradigm shift in AI around 2010-2012, trained on ImageNet.
A model described as a more efficient alternative to Transformers, representing the kind of 'cute new method' researchers often seek.
More from Latent Space
View all 68 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free