How did the ULMfit approach contribute to modern language models like ChatGPT?

ULMfit introduced a three-step process: pre-training on a large corpus, fine-tuning on a curated corpus, and further fine-tuning on a specific task. This fundamental approach is still used today in models like ChatGPT.

Why was transfer learning initially a contrarian take in deep learning?

At the time, deep learning was seen as requiring specialized PhD-level knowledge and access to massive compute. The idea of reusing pre-trained models (transfer learning) to make it more accessible was considered impractical or even foolish by many.

What are the core principles behind Fast.ai's educational and research philosophy?

Fast.ai focuses on making deep learning accessible to everyone, emphasizing practical application, doing 'more with less' (less data, compute, complexity), and democratizing powerful AI technologies.

What is Jeremy Howard's current view on the traditional fine-tuning process for LLMs?

He believes the traditional three-step fine-tuning process is flawed and has led to issues like catastrophic forgetting. He advocates for 'continued pre-training' where diverse data types are integrated from the start.

How can individuals get involved in substantive discussions within the AI open-source community?

Joining relevant Discord servers and directly messaging administrators or moderators with a clear interest and demonstrated work (like a GitHub profile or blog posts) can grant access to private channels where deeper discussions occur.

What is the significance of Mojo for AI development?

Mojo, developed by Modular, aims to provide a language and infrastructure designed specifically for AI, potentially making it easier to build high-performance tools and optimizations like FlashAttention.

Why does Jeremy Howard believe fine-tuning is still underappreciated compared to RAG?

He argues that RAG is an 'inefficient hack' and that fine-tuning, when done correctly, integrates knowledge more deeply into model weights, offering greater efficiency and nuance.

What are the key challenges and opportunities in developing small AI models?

Small models offer resource efficiency, but their performance evaluation is often iffy. Discovering what capabilities small models can achieve and making their development and fine-tuning more accessible presents a significant opportunity.

What is the most interesting unsolved question in AI, according to Jeremy Howard?

He believes the most interesting questions revolve around understanding how language models learn, their training dynamics, and rigorously analyzing the data they need for optimal performance and retention.

What is Jeremy Howard's core message regarding the future of AI and societal access?

He advocates for distributing powerful AI tools widely, believing that enabling more people to contribute will lead to greater safety, flourishing, and a better future, rather than concentrating power in an elite few.

Key Moments

The End of Finetuning — with Jeremy Howard of Fast.ai

Latent Space Podcast

Science & Technology4 min read85 min video

Oct 20, 2023|23,539 views|665|26

jeremy howard fast ai fine tuning latent space artificial intelligence

Save to Pod

Key Moments

On this page

TL;DR

Jeremy Howard discusses fast.ai, making AI accessible, the evolution of NLP, and the future of model training.

Key Insights

Fast.ai democratized deep learning by focusing on accessibility, especially through transfer learning.

The ULMFiT model laid the groundwork for modern language model pre-training and fine-tuning approaches.

Current fine-tuning methods for LLMs may be suboptimal, and continued pre-training might be a better paradigm.

The focus on zero-shot and few-shot learning initially overshadowed the effectiveness of fine-tuning.

There's a critical need to understand the internal dynamics and data requirements of large language models.

Making powerful AI tools accessible to more people is crucial to prevent a dystopian future controlled by elites.

THE FOUNDING OF FAST.AI AND THE ACCESSIBILITY MOVEMENT

Jeremy Howard recounts the inception of fast.ai, born from the belief that deep learning should be accessible to everyone, not just a select few with PhDs. He highlights the initial skepticism towards making deep learning understandable and usable for ordinary people. The core principle of fast.ai from day one was transfer learning, a technique that was largely overlooked but proved key to making the technology more accessible by reducing compute and data requirements.

ULMFiT'S ROLE IN REVOLUTIONIZING NLP

Howard details the development of ULMFiT, a groundbreaking approach in Natural Language Processing (NLP). Trained on Wikipedia, this large language model demonstrated that pre-training on a vast corpus could imbue a model with significant world knowledge. The subsequent fine-tuning steps, refined in ULMFiT, laid the foundation for the multi-stage training process that characterizes modern LLMs like ChatGPT, proving that such models could achieve state-of-the-art results on various tasks.

CHALLENGING CONVENTIONAL WISDOM IN AI RESEARCH

Howard shares his experience of going against the grain in the NLP community. Despite initial resistance andassertions that language was too complex for his approach, ULMFiT's success, and later advancements by others like OpenAI's Alec Radford, validated his strategy. He notes that even established researchers like Radford initially doubted the efficacy of large-scale pre-training before ULMFiT's evidence. This highlights a recurring theme of unconventional ideas eventually proving fruitful.

THE EVOLUTION OF FINE-TUNING AND THE CRITIQUE OF CURRENT METHODS

Reflecting on the current LLM landscape, Howard expresses a view that the standard three-step pre-training and fine-tuning approach, which he pioneered, may no longer be optimal. He suggests that the way fine-tuning is applied today, particularly for tasks like Reinforcement Learning from Human Feedback (RLHF) and instruction tuning, might be leading to issues like catastrophic forgetting. Howard advocates for a paradigm shift towards 'continued pre-training' where all data types are integrated from the start.

RESEARCH PHILOSOPHY: DOING MORE WITH LESS

A consistent theme in Howard's work is the philosophy of achieving more with fewer resources. Whether it's developing accessible courses, efficient software libraries, or researching new model architectures, the goal is to empower a wider range of users. This ethos is evident in fast.ai's research, which often focuses on techniques that reduce data, compute, and educational barriers. Examples include winning the DawnBench competition with efficient methods and exploring the potential of smaller, more capable models.

THE IMPORTANCE OF ACCESSIBILITY AND DISTRIBUTED AI POWER

Howard emphasizes the societal implication of AI accessibility. He argues that concentrating powerful AI technology in the hands of a few elites is a potentially dystopian path. Instead, he advocates for enabling a broader segment of humanity to leverage these tools, believing that widespread access will lead to greater innovation and benefit for society. He draws parallels to historical technological advancements like the printing press, stressing the importance of distributing power rather than centralizing it.

THE FUTURE OF MODEL DEVELOPMENT AND NEW FRONTIERS

Looking ahead, Howard discusses the underexplored potential of fine-tuning, the inefficiency of Reinforcement Learning from Human Feedback (RLHF) as a standalone method, and the promise of combining retrieval-augmented generation (RAG) with fine-tuning. He also touches on the untapped potential of smaller models, the challenges in evaluating them, and the ongoing research into understanding LLM training dynamics, data curation, and the latent capabilities within these models.

EXPLORING NEW LANGUAGES AND HARDWARE FOR AI

The conversation touches upon the burgeoning ecosystem for AI development, including new programming languages and hardware. Howard shares his excitement about Chris Lattner's work on Mojo, a new language designed for AI that aims to simplify complex tasks like FlashAttention. He also discusses the current landscape of AI frameworks, noting the limitations of Python and the ongoing development in areas like JAX, emphasizing the need for better tools that empower developers to innovate more easily.

THE UNRESOLVED MYSTERIES OF LLM LEARNING DYNAMICS

A significant portion of the discussion revolves around the fundamental unknowns in how large language models learn. Howard highlights the need for more rigorous research into training dynamics, data requirements, and the internal workings of models. He likens the current state of LLM understanding to computer vision in its early days, where key insights into layer functions were still emerging. Understanding these dynamics is crucial for improving model training and capabilities.

THE CALL TO ACTION: EMPOWERING INDIVIDUALS TO BUILD

Howard advocates for a proactive approach to developing AI, encouraging individuals to experiment and build. He notes that in open-source communities, those who genuinely contribute and do the work—even small, initial tasks—stand out and attract support. The message is clear: the future of AI innovation depends on empowering a diverse range of builders, not just a privileged few, to achieve valuable work and contribute to a better future.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Studies Cited

●Concepts

●People Referenced

Common Questions

Jeremy Howard initially pursued a BA in Philosophy from the University of Melbourne, focusing on ethics and cognitive science, which he later found relevant to his AI work.

Topics

AI & Machine Learning Technology & Innovation Programming & Software Open-source AI Large Language Models AI Education Model Training Natural Language Processing Transfer Learning Deep Learning Accessibility

Mentioned in this video

Software & Apps

Lasagne

A wrapper for Theano, mentioned as an early deep learning tool.

Stable Diffusion

Fast.ai courses teach from basics to Stable Diffusion in about seven weeks.

IMDb

Jeremy Howard achieved a new state-of-the-art academic result on IMDb within hours of trying his ULMfit approach.

Fastmail

One of the two companies Jeremy Howard founded in June 1999, providing synchronized email.

ChatGPT

Jeremy Howard notes that the three-step system he developed for ULMfit is essentially what powers ChatGPT today.

Code LLaMA

A fine-tuned version of Llama 2 by Meta, which became good at coding but at the cost of forgetting other capabilities (catastrophic forgetting).

LLVM

Chris Lattner's work on LLVM is mentioned as foundational for his subsequent projects like Swift and Mojo.

Fire1.5

A small language model that excels at generating short Python snippets, trained on synthetic data and lacking general world knowledge.

Stockfish

A top chess engine that GPT-4 was compared against, achieving a near-equivalent Elo rating with advanced prompting.

PyTorch

The PyTorch team released a 3D matrix product visualizer, an example of tools that help understand model behavior like attention layers.

LLaMA 2

The base model for Code Llama, which was fine-tuned by Meta.

Swift

Chris Lattner's work at Google involved Swift for TensorFlow, and Jeremy Howard learned Swift to collaborate.

Codex

Mentioned as a tool that enables people from manual jobs to start training language models.

AlexNet

A landmark convolutional neural network in computer vision, mentioned as part of the early development phase similar to current LLM understanding.

Theano

Mentioned as an early deep learning library, along with its rapper Lasagne.

Elmo

A language model developed around the same time as ULMfit, but with a different approach.

TensorFlow

TensorFlow 2 is described as a failure internally at Google, leading to the development of alternative frameworks like Jax and Chris Lattner's Swift for TensorFlow.

GPT-4

Demonstrated advanced chess-playing capabilities (Elo 3400) with sophisticated prompting strategies, showing hidden potential.

CUDA

A parallel computing platform and API model created by Nvidia, mentioned in the context of writing low-level GPU code for optimizations.

Jax

A backup plan for Google's AI future after TensorFlow 2's perceived failure, initially intended as a research project but now a key framework.

BMTM 3B

An underappreciated small language model with 7B quality at 3B size.

Copilot

Mentioned as a tool that enables people from manual jobs to start training language models.

FlashAttention

An optimization technique for attention mechanisms, mentioned as an example of innovations that could be made easier with better languages like Mojo.

ResNet

A paper on ResNets visualizing its loss surface with and without skip connections is cited as an example of the type of work needed to understand model learning dynamics.

Mojo

A new programming language and infrastructure being developed by Modular, aiming to make it easier to build advanced AI tools like FlashAttention.

GPT-5

Jeremy Howard anticipates having a similar visceral reaction to using GPT-5 as he did with GPT-4.

Companies

Google

Noted for releasing early large language models and later for developing TPUs and the Jax framework.

Optimal Decisions

Founded by Jeremy Howard in June 1999, it invented a new approach to insurance pricing called profit-optimized insurance pricing.

Anthropic

Followed OpenAI's rapid development model; mentioned as a place for those interested in deep learning not to join.

McKinsey and Company

Jeremy Howard mentions working 80-100 hour weeks there from age 19, impacting his university studies.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free