Key Moments
The End of Finetuning — with Jeremy Howard of Fast.ai
Key Moments
Jeremy Howard discusses fast.ai, making AI accessible, the evolution of NLP, and the future of model training.
Key Insights
Fast.ai democratized deep learning by focusing on accessibility, especially through transfer learning.
The ULMFiT model laid the groundwork for modern language model pre-training and fine-tuning approaches.
Current fine-tuning methods for LLMs may be suboptimal, and continued pre-training might be a better paradigm.
The focus on zero-shot and few-shot learning initially overshadowed the effectiveness of fine-tuning.
There's a critical need to understand the internal dynamics and data requirements of large language models.
Making powerful AI tools accessible to more people is crucial to prevent a dystopian future controlled by elites.
THE FOUNDING OF FAST.AI AND THE ACCESSIBILITY MOVEMENT
Jeremy Howard recounts the inception of fast.ai, born from the belief that deep learning should be accessible to everyone, not just a select few with PhDs. He highlights the initial skepticism towards making deep learning understandable and usable for ordinary people. The core principle of fast.ai from day one was transfer learning, a technique that was largely overlooked but proved key to making the technology more accessible by reducing compute and data requirements.
ULMFiT'S ROLE IN REVOLUTIONIZING NLP
Howard details the development of ULMFiT, a groundbreaking approach in Natural Language Processing (NLP). Trained on Wikipedia, this large language model demonstrated that pre-training on a vast corpus could imbue a model with significant world knowledge. The subsequent fine-tuning steps, refined in ULMFiT, laid the foundation for the multi-stage training process that characterizes modern LLMs like ChatGPT, proving that such models could achieve state-of-the-art results on various tasks.
CHALLENGING CONVENTIONAL WISDOM IN AI RESEARCH
Howard shares his experience of going against the grain in the NLP community. Despite initial resistance andassertions that language was too complex for his approach, ULMFiT's success, and later advancements by others like OpenAI's Alec Radford, validated his strategy. He notes that even established researchers like Radford initially doubted the efficacy of large-scale pre-training before ULMFiT's evidence. This highlights a recurring theme of unconventional ideas eventually proving fruitful.
THE EVOLUTION OF FINE-TUNING AND THE CRITIQUE OF CURRENT METHODS
Reflecting on the current LLM landscape, Howard expresses a view that the standard three-step pre-training and fine-tuning approach, which he pioneered, may no longer be optimal. He suggests that the way fine-tuning is applied today, particularly for tasks like Reinforcement Learning from Human Feedback (RLHF) and instruction tuning, might be leading to issues like catastrophic forgetting. Howard advocates for a paradigm shift towards 'continued pre-training' where all data types are integrated from the start.
RESEARCH PHILOSOPHY: DOING MORE WITH LESS
A consistent theme in Howard's work is the philosophy of achieving more with fewer resources. Whether it's developing accessible courses, efficient software libraries, or researching new model architectures, the goal is to empower a wider range of users. This ethos is evident in fast.ai's research, which often focuses on techniques that reduce data, compute, and educational barriers. Examples include winning the DawnBench competition with efficient methods and exploring the potential of smaller, more capable models.
THE IMPORTANCE OF ACCESSIBILITY AND DISTRIBUTED AI POWER
Howard emphasizes the societal implication of AI accessibility. He argues that concentrating powerful AI technology in the hands of a few elites is a potentially dystopian path. Instead, he advocates for enabling a broader segment of humanity to leverage these tools, believing that widespread access will lead to greater innovation and benefit for society. He draws parallels to historical technological advancements like the printing press, stressing the importance of distributing power rather than centralizing it.
THE FUTURE OF MODEL DEVELOPMENT AND NEW FRONTIERS
Looking ahead, Howard discusses the underexplored potential of fine-tuning, the inefficiency of Reinforcement Learning from Human Feedback (RLHF) as a standalone method, and the promise of combining retrieval-augmented generation (RAG) with fine-tuning. He also touches on the untapped potential of smaller models, the challenges in evaluating them, and the ongoing research into understanding LLM training dynamics, data curation, and the latent capabilities within these models.
EXPLORING NEW LANGUAGES AND HARDWARE FOR AI
The conversation touches upon the burgeoning ecosystem for AI development, including new programming languages and hardware. Howard shares his excitement about Chris Lattner's work on Mojo, a new language designed for AI that aims to simplify complex tasks like FlashAttention. He also discusses the current landscape of AI frameworks, noting the limitations of Python and the ongoing development in areas like JAX, emphasizing the need for better tools that empower developers to innovate more easily.
THE UNRESOLVED MYSTERIES OF LLM LEARNING DYNAMICS
A significant portion of the discussion revolves around the fundamental unknowns in how large language models learn. Howard highlights the need for more rigorous research into training dynamics, data requirements, and the internal workings of models. He likens the current state of LLM understanding to computer vision in its early days, where key insights into layer functions were still emerging. Understanding these dynamics is crucial for improving model training and capabilities.
THE CALL TO ACTION: EMPOWERING INDIVIDUALS TO BUILD
Howard advocates for a proactive approach to developing AI, encouraging individuals to experiment and build. He notes that in open-source communities, those who genuinely contribute and do the work—even small, initial tasks—stand out and attract support. The message is clear: the future of AI innovation depends on empowering a diverse range of builders, not just a privileged few, to achieve valuable work and contribute to a better future.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●Studies Cited
●Concepts
●People Referenced
Common Questions
Jeremy Howard initially pursued a BA in Philosophy from the University of Melbourne, focusing on ethics and cognitive science, which he later found relevant to his AI work.
Topics
Mentioned in this video
A wrapper for Theano, mentioned as an early deep learning tool.
Fast.ai courses teach from basics to Stable Diffusion in about seven weeks.
Jeremy Howard achieved a new state-of-the-art academic result on IMDb within hours of trying his ULMfit approach.
One of the two companies Jeremy Howard founded in June 1999, providing synchronized email.
Jeremy Howard notes that the three-step system he developed for ULMfit is essentially what powers ChatGPT today.
A fine-tuned version of Llama 2 by Meta, which became good at coding but at the cost of forgetting other capabilities (catastrophic forgetting).
Chris Lattner's work on LLVM is mentioned as foundational for his subsequent projects like Swift and Mojo.
A small language model that excels at generating short Python snippets, trained on synthetic data and lacking general world knowledge.
A top chess engine that GPT-4 was compared against, achieving a near-equivalent Elo rating with advanced prompting.
The PyTorch team released a 3D matrix product visualizer, an example of tools that help understand model behavior like attention layers.
The base model for Code Llama, which was fine-tuned by Meta.
Chris Lattner's work at Google involved Swift for TensorFlow, and Jeremy Howard learned Swift to collaborate.
Mentioned as a tool that enables people from manual jobs to start training language models.
A landmark convolutional neural network in computer vision, mentioned as part of the early development phase similar to current LLM understanding.
Mentioned as an early deep learning library, along with its rapper Lasagne.
A language model developed around the same time as ULMfit, but with a different approach.
TensorFlow 2 is described as a failure internally at Google, leading to the development of alternative frameworks like Jax and Chris Lattner's Swift for TensorFlow.
Demonstrated advanced chess-playing capabilities (Elo 3400) with sophisticated prompting strategies, showing hidden potential.
A parallel computing platform and API model created by Nvidia, mentioned in the context of writing low-level GPU code for optimizations.
A backup plan for Google's AI future after TensorFlow 2's perceived failure, initially intended as a research project but now a key framework.
An underappreciated small language model with 7B quality at 3B size.
Mentioned as a tool that enables people from manual jobs to start training language models.
An optimization technique for attention mechanisms, mentioned as an example of innovations that could be made easier with better languages like Mojo.
A paper on ResNets visualizing its loss surface with and without skip connections is cited as an example of the type of work needed to understand model learning dynamics.
A new programming language and infrastructure being developed by Modular, aiming to make it easier to build advanced AI tools like FlashAttention.
Jeremy Howard anticipates having a similar visceral reaction to using GPT-5 as he did with GPT-4.
Noted for releasing early large language models and later for developing TPUs and the Jax framework.
Founded by Jeremy Howard in June 1999, it invented a new approach to insurance pricing called profit-optimized insurance pricing.
Followed OpenAI's rapid development model; mentioned as a place for those interested in deep learning not to join.
Jeremy Howard mentions working 80-100 hour weeks there from age 19, impacting his university studies.
Developed Code Llama, a fine-tuned version of Llama 2.
Mentioned for their 'Trainer' library, which Jeremy Howard initially suspected of containing bugs related to training curve 'clunks'.
Mentioned as a leading lab that rapidly developed working models, contributing to technical debt due to speed. Also discussed in the context of their Scholars program.
A new company founded by Chris Lattner, aiming to create a new language and infrastructure based on their past collaborations and ideas.
Their work on an earlier, less linear version of RLHF showed it performed better than later versions, and they created a 5 1.5 web model that wasn't released.
Jeremy Howard trained a large language model on Wikipedia, seeing it as a significant corpus of text that could teach a model about the world.
Jeremy Howard was President and Chief Scientist here, working on deep learning for medical diagnostics. He was also the top-ranked participant in 2010 and 2011.
Jeremy Howard graduated with a BA in Philosophy from here.
A programming language that helps in writing GPU-optimized code, making innovations like FlashAttention more accessible.
Co-authored a paper with Da Pan on early large language models, focusing on domain-specific corpora.
Worked with Jeremy Howard at Fast.ai and created the Fast progress library.
Mentioned as one of the two strongest people in language models at the time, who was initially skeptical of pre-training on Wikipedia.
Journalist from The New York Times who organized a conversation between Jeremy Howard and Alec Radford.
Co-authored a paper with Lee Qu_Lee on early large language models, though their work focused on domain-specific corpora.
Creator of FlashAttention, who was asked if others had considered similar ideas before him, highlighting the importance of background and initiative.
Co-founder of Fast.ai with Jeremy Howard, focusing on making deep learning accessible.
Considered one of the strongest people in language models, influenced by ULMfit for his GPT work.
Jeremy Howard mentions hearing himself on a podcast with Tanishq, discussing Tanishq's work.
Mentioned for his RNN demonstration showing text ingestion and generation, similar to Jeremy Howard's early work with language models.
Developed a multitask learning model at Salesforce prior to ULMfit, but without the general fine-tuning step.
Met Jeremy Howard at the TensorFlow Dev Summit, later collaborated on Swift for TensorFlow, and founded Modular.
Considered an inefficient hack compared to fine-tuning, used as the current equivalent of few-shot learning.
Jeremy Howard discusses this thought experiment from his cognitive science background as a foundational idea for understanding AI's potential.
More from Latent Space
View all 191 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free