Key Moments
State of the Art: Training 70B LLMs on 10,000 H100 clusters
Key Moments
Imbue and Databricks discuss training large LLMs, infra challenges, and evaluation methods.
Key Insights
Training large LLMs (70B+) requires massive infrastructure, with networking and hardware reliability being critical challenges.
Imbue is releasing infrastructure scripts, evaluation benchmarks, and a hyperparameter optimizer to aid others in training foundation models.
Databricks recently released a text-to-image model trained exclusively on Shutterstock data, emphasizing data provenance.
Evaluation of LLMs is complex, with significant effort dedicated to cleaning datasets and developing robust benchmarks beyond simple loss metrics.
Tool use and function calling are seen as crucial for interacting with structured data, with code generation and SQL being key approaches.
Long context utilization in LLMs presents challenges in evaluation due to annotation costs, with 'needle in a haystack' being a well-known but flawed method.
INTRODUCTION OF GUESTS AND RECENT DEVELOPMENTS
The podcast introduces Josh Albrecht (CTO of Imbue) and Jon Frankle (Chief AI Scientist at Databricks). Frankle, a previous guest, discusses Databricks' acquisition and their latest release: a text-to-image model developed in collaboration with Shutterstock. This model is notable for being trained exclusively on known Shutterstock data, emphasizing data provenance and trust for enterprise customers, although it's currently API-only.
IMBUE'S RELEASE OF TRAINING RESOURCES
Josh Albrecht details Imbue's contributions aimed at democratizing foundation model training. They are releasing infrastructure and training scripts for managing hardware failures, advanced evaluation tools including curated benchmarks and human judgments, and a cost-aware hyperparameter optimizer (CARBS) to improve prediction and scaling. These resources are intended to lower the barrier for companies to train their own models.
INFRASTRUCTURE CHALLENGES IN LARGE-SCALE TRAINING
A significant portion of the discussion revolves around the immense infrastructure challenges. Training on clusters with thousands of H100 GPUs involves complex networking, like three-tier architectures, and demands robust fault tolerance. Failures are common, ranging from hardware defects to infiniband cable theft, requiring sophisticated monitoring and automated health checks for thousands of machines. Imbue's approach involves direct collaboration with hardware vendors to fix issues at the firmware level.
THE COMPLEXITY OF MODEL EVALUATION
Both guests emphasize the critical and difficult nature of evaluating LLMs. Imbue has developed cleaned versions of popular benchmarks and internal evaluations, like a code understanding benchmark. They highlight issues with data contamination and ambiguity in standard evaluations, leading them to create their own data and reproduce examples. The focus is on metrics that are both precise and relevant to desired task performance, moving beyond simple loss.
SCALING LAWS AND HYPERPARAMETER OPTIMIZATION WITH CARBS
Jon Frankle elaborates on CARBS (Cost Aware Region Bayesian Search), a hyperparameter optimization tool. Unlike standard optimizers, CARBS accounts for the cost of sampling different configurations, allowing for the identification of scaling laws for various parameters (layers, learning rate, etc.). This predictability is crucial for efficiently training massive models by guiding experiments and ensuring larger scale runs are accurate from the start.
THE ROLE OF CODE AND STRUCTURED DATA IN AGENTS
The conversation touches on agent capabilities, emphasizing code generation and tool use. Imbue views robust code writing and execution as the ultimate tool, providing access to virtually infinite functionalities. Databricks focuses on enabling models to interact with structured data like SQL databases, seeing this as vital for enterprise customers. While knowledge graphs are explored, the simplicity and efficacy of tools like SQL for structured data interaction are highlighted as key.
LONG CONTEXT WINDOWS AND EMERGENT PROPERTIES
Long context windows are discussed as essential for agents, though their evaluation is challenging due to high annotation costs. Methods like 'needle in a haystack' are critiqued for not measuring holistic context utilization. Databricks favors thousand-shot tasks and considers scaling laws. The concept of emergent properties in LLMs is debated, with the idea that some perceived emergence might be an artifact of log-scale evaluation metrics.
THE FUTURE OF LLM DEVELOPMENT AND INFRASTRUCTURE
Looking ahead, Imbue is focused on making their models useful for coding and reasoning in daily workflows, with internal prototypes for future product releases. Databricks aims to continue delivering value to their extensive customer base, with plans for more community-facing science sharing and potentially new model releases. Both emphasize the ongoing need for innovation in infrastructure, evaluation, and model capabilities.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Books
●Studies Cited
●Concepts
Common Questions
DBRX is Databricks' Mixture-of-Experts language model. It has 132 billion total parameters, with 36 billion active on any input, and was pre-trained on 12 trillion tokens of text and code. Databricks also gave it a dinosaur mascot named DB-Rex.
Topics
Mentioned in this video
A paper suggesting that emergent behaviors in LLMs are more a function of evaluation metrics than true emergence, influencing IMB's approach to evaluation.
A research paper that introduced scaling laws for language models, referenced in the discussion of CARBS's ability to learn similar scaling laws for various hyperparameters.
A coding benchmark for which IMB created its own internal, cleaned version to remove ambiguity and verify examples.
A classic handwritten digit dataset, used as an example to illustrate how benchmarks can have mislabeled examples, making 100% accuracy indicative of a flaw.
A natural language inference benchmark, cleaned by IMB to remove ambiguous examples and ensure data integrity.
A benchmark with many diverse tasks, some of which are very abstract and unrealistic, which IMB generally avoids for core model evaluation.
A difficult, new coding benchmark focused on bug fixing, recognized for its realism but also for the challenge it poses for evaluation.
A paper discussing agent evaluations, noted for examples in its appendix that were found to contain non-optimal solutions.
A reading comprehension benchmark, considered for evaluation but noted for having complex and potentially problematic questions.
A common NLP benchmark mentioned as one of the public evaluations that IMB has reviewed and cleaned for ambiguity and data contamination.
Collaborated with Databricks to create a text-to-image model trained exclusively on Shutterstock's stock photo dataset, notable for strict data provenance.
Acquired Mosaic ML. Recently released a text-to-image model and DBRX, their large language model with a dinosaur mascot named DB-Rex.
Acquired by Databricks, known for their work on large model training and infrastructure solutions.
Collaborated with IMB on firmware fixes for server hardware to ensure stability in large-scale GPU clusters.
Mentioned as designing their hardware (TPUs) and software for Mixture-of-Experts models with high network bandwidth and specific network architectures (like 3D toruses).
Mentioned by the host for getting it right with emoji mascots, implying a positive view on their approach to branding and community engagement.
A distributed Docker registry that uses BitTorrent for optimal image transfer between machines, praised for its efficiency and robustness.
Mentioned in the context of InfiniBand network providers, highlighting complexity in vendor choices for cluster networking.
Collaborated with IMB on firmware fixes and driver updates for their GPUs, essential for stable cluster operation.
A monitoring system used for collecting metrics and understanding the health of individual machines in the GPU cluster.
A benchmark from François Chollet designed to measure abstract reasoning and general intelligence through IQ-test-like problems.
A long-context evaluation method for LLMs, criticized for not measuring holistic context use and for being relatively easy for models to 'trick' without true reasoning.
Described as "the world's simplest agent," giving models the ability to retrieve data from external contexts or databases, bridging unstructured and structured data.
A math reasoning benchmark, noted for having some 'weird' qualities that require careful interpretation.
A cloud object storage service, mentioned as a proxy for local mirror setup for efficient file serving in distributed training.
An open-source framework from NVIDIA for training large language models, providing useful components for distributed training.
A container orchestration system, intentionally not used by IMB for their training clusters due to its overhead and complexity for experimental, fault-tolerant workloads.
An AI model suite that theoretically supports a high number of parallel function calls, indicating advancement in tool-use capabilities.
A hyperparameter optimizer that models the cost of different configurations, enabling efficient identification of scaling laws for model parameters and data mix.
An object storage server, considered for local mirror setup but deemed too complex for the desired simple infrastructure.
An open-source optimization library used to accelerate large-scale model training, providing helpful working examples for tuning.
A coding benchmark referenced in a bet about DBRX's performance, initially looking quite bad but ultimately exceeded expectations.
Databricks' Mixture-of-Experts language model, featuring 132 billion total parameters and 36 billion active parameters on any input, pre-trained on 12 trillion tokens.
An AI model suite that theoretically supports a high number of parallel function calls, indicating advancement in tool-use capabilities.
More from Latent Space
View all 169 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free