Key Moments

The End of Finetuning — with Jeremy Howard of Fast.ai

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read85 min video
Oct 20, 2023|23,533 views|665|26
Save to Pod
TL;DR

Jeremy Howard discusses fast.ai, making AI accessible, the evolution of NLP, and the future of model training.

Key Insights

1

Fast.ai democratized deep learning by focusing on accessibility, especially through transfer learning.

2

The ULMFiT model laid the groundwork for modern language model pre-training and fine-tuning approaches.

3

Current fine-tuning methods for LLMs may be suboptimal, and continued pre-training might be a better paradigm.

4

The focus on zero-shot and few-shot learning initially overshadowed the effectiveness of fine-tuning.

5

There's a critical need to understand the internal dynamics and data requirements of large language models.

6

Making powerful AI tools accessible to more people is crucial to prevent a dystopian future controlled by elites.

THE FOUNDING OF FAST.AI AND THE ACCESSIBILITY MOVEMENT

Jeremy Howard recounts the inception of fast.ai, born from the belief that deep learning should be accessible to everyone, not just a select few with PhDs. He highlights the initial skepticism towards making deep learning understandable and usable for ordinary people. The core principle of fast.ai from day one was transfer learning, a technique that was largely overlooked but proved key to making the technology more accessible by reducing compute and data requirements.

ULMFiT'S ROLE IN REVOLUTIONIZING NLP

Howard details the development of ULMFiT, a groundbreaking approach in Natural Language Processing (NLP). Trained on Wikipedia, this large language model demonstrated that pre-training on a vast corpus could imbue a model with significant world knowledge. The subsequent fine-tuning steps, refined in ULMFiT, laid the foundation for the multi-stage training process that characterizes modern LLMs like ChatGPT, proving that such models could achieve state-of-the-art results on various tasks.

CHALLENGING CONVENTIONAL WISDOM IN AI RESEARCH

Howard shares his experience of going against the grain in the NLP community. Despite initial resistance andassertions that language was too complex for his approach, ULMFiT's success, and later advancements by others like OpenAI's Alec Radford, validated his strategy. He notes that even established researchers like Radford initially doubted the efficacy of large-scale pre-training before ULMFiT's evidence. This highlights a recurring theme of unconventional ideas eventually proving fruitful.

THE EVOLUTION OF FINE-TUNING AND THE CRITIQUE OF CURRENT METHODS

Reflecting on the current LLM landscape, Howard expresses a view that the standard three-step pre-training and fine-tuning approach, which he pioneered, may no longer be optimal. He suggests that the way fine-tuning is applied today, particularly for tasks like Reinforcement Learning from Human Feedback (RLHF) and instruction tuning, might be leading to issues like catastrophic forgetting. Howard advocates for a paradigm shift towards 'continued pre-training' where all data types are integrated from the start.

RESEARCH PHILOSOPHY: DOING MORE WITH LESS

A consistent theme in Howard's work is the philosophy of achieving more with fewer resources. Whether it's developing accessible courses, efficient software libraries, or researching new model architectures, the goal is to empower a wider range of users. This ethos is evident in fast.ai's research, which often focuses on techniques that reduce data, compute, and educational barriers. Examples include winning the DawnBench competition with efficient methods and exploring the potential of smaller, more capable models.

THE IMPORTANCE OF ACCESSIBILITY AND DISTRIBUTED AI POWER

Howard emphasizes the societal implication of AI accessibility. He argues that concentrating powerful AI technology in the hands of a few elites is a potentially dystopian path. Instead, he advocates for enabling a broader segment of humanity to leverage these tools, believing that widespread access will lead to greater innovation and benefit for society. He draws parallels to historical technological advancements like the printing press, stressing the importance of distributing power rather than centralizing it.

THE FUTURE OF MODEL DEVELOPMENT AND NEW FRONTIERS

Looking ahead, Howard discusses the underexplored potential of fine-tuning, the inefficiency of Reinforcement Learning from Human Feedback (RLHF) as a standalone method, and the promise of combining retrieval-augmented generation (RAG) with fine-tuning. He also touches on the untapped potential of smaller models, the challenges in evaluating them, and the ongoing research into understanding LLM training dynamics, data curation, and the latent capabilities within these models.

EXPLORING NEW LANGUAGES AND HARDWARE FOR AI

The conversation touches upon the burgeoning ecosystem for AI development, including new programming languages and hardware. Howard shares his excitement about Chris Lattner's work on Mojo, a new language designed for AI that aims to simplify complex tasks like FlashAttention. He also discusses the current landscape of AI frameworks, noting the limitations of Python and the ongoing development in areas like JAX, emphasizing the need for better tools that empower developers to innovate more easily.

THE UNRESOLVED MYSTERIES OF LLM LEARNING DYNAMICS

A significant portion of the discussion revolves around the fundamental unknowns in how large language models learn. Howard highlights the need for more rigorous research into training dynamics, data requirements, and the internal workings of models. He likens the current state of LLM understanding to computer vision in its early days, where key insights into layer functions were still emerging. Understanding these dynamics is crucial for improving model training and capabilities.

THE CALL TO ACTION: EMPOWERING INDIVIDUALS TO BUILD

Howard advocates for a proactive approach to developing AI, encouraging individuals to experiment and build. He notes that in open-source communities, those who genuinely contribute and do the work—even small, initial tasks—stand out and attract support. The message is clear: the future of AI innovation depends on empowering a diverse range of builders, not just a privileged few, to achieve valuable work and contribute to a better future.

Common Questions

Jeremy Howard initially pursued a BA in Philosophy from the University of Melbourne, focusing on ethics and cognitive science, which he later found relevant to his AI work.

Topics

Mentioned in this video

Software & Apps
Lasagne

A wrapper for Theano, mentioned as an early deep learning tool.

Stable Diffusion

Fast.ai courses teach from basics to Stable Diffusion in about seven weeks.

IMDb

Jeremy Howard achieved a new state-of-the-art academic result on IMDb within hours of trying his ULMfit approach.

Fastmail

One of the two companies Jeremy Howard founded in June 1999, providing synchronized email.

ChatGPT

Jeremy Howard notes that the three-step system he developed for ULMfit is essentially what powers ChatGPT today.

Code LLaMA

A fine-tuned version of Llama 2 by Meta, which became good at coding but at the cost of forgetting other capabilities (catastrophic forgetting).

LLVM

Chris Lattner's work on LLVM is mentioned as foundational for his subsequent projects like Swift and Mojo.

Fire1.5

A small language model that excels at generating short Python snippets, trained on synthetic data and lacking general world knowledge.

Stockfish

A top chess engine that GPT-4 was compared against, achieving a near-equivalent Elo rating with advanced prompting.

PyTorch

The PyTorch team released a 3D matrix product visualizer, an example of tools that help understand model behavior like attention layers.

LLaMA 2

The base model for Code Llama, which was fine-tuned by Meta.

Swift

Chris Lattner's work at Google involved Swift for TensorFlow, and Jeremy Howard learned Swift to collaborate.

Codex

Mentioned as a tool that enables people from manual jobs to start training language models.

AlexNet

A landmark convolutional neural network in computer vision, mentioned as part of the early development phase similar to current LLM understanding.

Theano

Mentioned as an early deep learning library, along with its rapper Lasagne.

Elmo

A language model developed around the same time as ULMfit, but with a different approach.

TensorFlow

TensorFlow 2 is described as a failure internally at Google, leading to the development of alternative frameworks like Jax and Chris Lattner's Swift for TensorFlow.

GPT-4

Demonstrated advanced chess-playing capabilities (Elo 3400) with sophisticated prompting strategies, showing hidden potential.

CUDA

A parallel computing platform and API model created by Nvidia, mentioned in the context of writing low-level GPU code for optimizations.

Jax

A backup plan for Google's AI future after TensorFlow 2's perceived failure, initially intended as a research project but now a key framework.

BMTM 3B

An underappreciated small language model with 7B quality at 3B size.

Copilot

Mentioned as a tool that enables people from manual jobs to start training language models.

FlashAttention

An optimization technique for attention mechanisms, mentioned as an example of innovations that could be made easier with better languages like Mojo.

ResNet

A paper on ResNets visualizing its loss surface with and without skip connections is cited as an example of the type of work needed to understand model learning dynamics.

Mojo

A new programming language and infrastructure being developed by Modular, aiming to make it easier to build advanced AI tools like FlashAttention.

GPT-5

Jeremy Howard anticipates having a similar visceral reaction to using GPT-5 as he did with GPT-4.

More from Latent Space

View all 191 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free