What infrastructure tools is IMB open-sourcing to help others train foundation models?

IMB is releasing infrastructure and training scripts for dealing with hardware and failures, evaluation data including a new code reasoning benchmark and privatized versions of 11 open-source benchmarks, and their cost-aware hyperparameter optimizer, CARBS.

What are the common challenges in setting up and maintaining massive GPU clusters?

Challenges include hardware failures, complex networking, ensuring correct and optimal performance, and dealing with firmware updates and driver issues. Fault tolerance is not inherently built in, requiring extensive tooling and automation.

How does IMB manage hardware failures in large GPU clusters?

IMB uses a "metal as a service" approach for OS installation and extensive monitoring tools like Prometheus. They also engage directly with vendors like Dell and NVIDIA for firmware-level fixes to minimize recurring issues.

What is CARBS and how does it improve hyperparameter optimization?

CARBS (Cost-Aware Pareto Region Bayesian Search) is a hyperparameter optimizer that models the cost of sampling different configurations. This allows it to learn scaling laws for various parameters, including data and flops, enabling precise tuning and scaling of models from small to large scales with fewer iterations.

Why did IMB create its own cleaned versions of existing LLM evaluation benchmarks?

IMB found many public benchmarks to be messy, incoherent, or ambiguous, with issues like human labeling errors and data contamination. They cleaned these datasets and created private reproductions to ensure more reliable and accurate model evaluation, especially to prevent testing on training data.

What are the limitations of common long-context evaluations like 'Needle in a Haystack'?

Needle in a Haystack doesn't measure a model's holistic use of context, only its ability to find a single piece of information. Annotating truly comprehensive long-context evaluations is very difficult and expensive, making methods that are 'correct by construction' prone to not testing real-world capability.

How does IMB view the role of function calling and tool use for AI models?

IMB focuses on enabling models to robustly write, execute, and debug code, as this unlocks a vastly larger action space than a fixed list of tools. They see writing code as the 'god tool' that allows models to interact with any API or system.

What is Jonathan's perspective on tool use and structured data interaction for LLMs?

Jonathan emphasizes that tool use, particularly for structured data like SQL, is crucial. LLMs are good with unstructured data, but existing APIs and languages for structured data offer significant advantages models shouldn't ignore. Integrating models with SQL for data querying is a highly useful application for customers.

Are knowledge graphs considered the ultimate solution for data that doesn't fit in SQL tables?

Knowledge graphs are useful for specific problems involving relationships between entities based on certain ontologies. However, they are not a universal solution due to messy borders and ambiguities in real-world data relationships, similar to how SQL has its own specific use cases.

Key Moments

State of the Art: Training 70B LLMs on 10,000 H100 clusters

Latent Space Podcast

Science & Technology3 min read93 min video

Jun 25, 2024|1,521 views|33|4

Save to Pod

Key Moments

TL;DR

Imbue and Databricks discuss training large LLMs, infra challenges, and evaluation methods.

Key Insights

Training large LLMs (70B+) requires massive infrastructure, with networking and hardware reliability being critical challenges.

Imbue is releasing infrastructure scripts, evaluation benchmarks, and a hyperparameter optimizer to aid others in training foundation models.

Databricks recently released a text-to-image model trained exclusively on Shutterstock data, emphasizing data provenance.

Evaluation of LLMs is complex, with significant effort dedicated to cleaning datasets and developing robust benchmarks beyond simple loss metrics.

Tool use and function calling are seen as crucial for interacting with structured data, with code generation and SQL being key approaches.

Long context utilization in LLMs presents challenges in evaluation due to annotation costs, with 'needle in a haystack' being a well-known but flawed method.

INTRODUCTION OF GUESTS AND RECENT DEVELOPMENTS

The podcast introduces Josh Albrecht (CTO of Imbue) and Jon Frankle (Chief AI Scientist at Databricks). Frankle, a previous guest, discusses Databricks' acquisition and their latest release: a text-to-image model developed in collaboration with Shutterstock. This model is notable for being trained exclusively on known Shutterstock data, emphasizing data provenance and trust for enterprise customers, although it's currently API-only.

IMBUE'S RELEASE OF TRAINING RESOURCES

Josh Albrecht details Imbue's contributions aimed at democratizing foundation model training. They are releasing infrastructure and training scripts for managing hardware failures, advanced evaluation tools including curated benchmarks and human judgments, and a cost-aware hyperparameter optimizer (CARBS) to improve prediction and scaling. These resources are intended to lower the barrier for companies to train their own models.

INFRASTRUCTURE CHALLENGES IN LARGE-SCALE TRAINING

A significant portion of the discussion revolves around the immense infrastructure challenges. Training on clusters with thousands of H100 GPUs involves complex networking, like three-tier architectures, and demands robust fault tolerance. Failures are common, ranging from hardware defects to infiniband cable theft, requiring sophisticated monitoring and automated health checks for thousands of machines. Imbue's approach involves direct collaboration with hardware vendors to fix issues at the firmware level.

THE COMPLEXITY OF MODEL EVALUATION

Both guests emphasize the critical and difficult nature of evaluating LLMs. Imbue has developed cleaned versions of popular benchmarks and internal evaluations, like a code understanding benchmark. They highlight issues with data contamination and ambiguity in standard evaluations, leading them to create their own data and reproduce examples. The focus is on metrics that are both precise and relevant to desired task performance, moving beyond simple loss.

SCALING LAWS AND HYPERPARAMETER OPTIMIZATION WITH CARBS

Jon Frankle elaborates on CARBS (Cost Aware Region Bayesian Search), a hyperparameter optimization tool. Unlike standard optimizers, CARBS accounts for the cost of sampling different configurations, allowing for the identification of scaling laws for various parameters (layers, learning rate, etc.). This predictability is crucial for efficiently training massive models by guiding experiments and ensuring larger scale runs are accurate from the start.

THE ROLE OF CODE AND STRUCTURED DATA IN AGENTS

The conversation touches on agent capabilities, emphasizing code generation and tool use. Imbue views robust code writing and execution as the ultimate tool, providing access to virtually infinite functionalities. Databricks focuses on enabling models to interact with structured data like SQL databases, seeing this as vital for enterprise customers. While knowledge graphs are explored, the simplicity and efficacy of tools like SQL for structured data interaction are highlighted as key.

LONG CONTEXT WINDOWS AND EMERGENT PROPERTIES

Long context windows are discussed as essential for agents, though their evaluation is challenging due to high annotation costs. Methods like 'needle in a haystack' are critiqued for not measuring holistic context utilization. Databricks favors thousand-shot tasks and considers scaling laws. The concept of emergent properties in LLMs is debated, with the idea that some perceived emergence might be an artifact of log-scale evaluation metrics.

THE FUTURE OF LLM DEVELOPMENT AND INFRASTRUCTURE

Looking ahead, Imbue is focused on making their models useful for coding and reasoning in daily workflows, with internal prototypes for future product releases. Databricks aims to continue delivering value to their extensive customer base, with plans for more community-facing science sharing and potentially new model releases. Both emphasize the ongoing need for innovation in infrastructure, evaluation, and model capabilities.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Books

●Studies Cited

●Concepts

Common Questions

DBRX is Databricks' Mixture-of-Experts language model. It has 132 billion total parameters, with 36 billion active on any input, and was pre-trained on 12 trillion tokens of text and code. Databricks also gave it a dinosaur mascot named DB-Rex.

Topics

Tool Use AI & Machine Learning Technology & Innovation Programming & Software Code Generation Large Language Models Cloud Computing LLM Training Distributed Training Model Evaluation AI Infrastructure GPU Clusters Hyperparameter Optimization

Mentioned in this video

Products

NVIDIA H100 GPU

The primary GPU used in the discussed 4092-GPU cluster, highlighting the challenges of working with such advanced and high-power hardware.

Books

Emergent Properties of Language Models and Mirage

A paper suggesting that emergent behaviors in LLMs are more a function of evaluation metrics than true emergence, influencing IMB's approach to evaluation.

Chinchilla paper

A research paper that introduced scaling laws for language models, referenced in the discussion of CARBS's ability to learn similar scaling laws for various hyperparameters.

Studies & Research

MBPP

A coding benchmark for which IMB created its own internal, cleaned version to remove ambiguity and verify examples.

MNIST

A classic handwritten digit dataset, used as an example to illustrate how benchmarks can have mislabeled examples, making 100% accuracy indicative of a flaw.

ANLI

A natural language inference benchmark, cleaned by IMB to remove ambiguous examples and ensure data integrity.

Big Bench

A benchmark with many diverse tasks, some of which are very abstract and unrealistic, which IMB generally avoids for core model evaluation.

Sweetbench

A difficult, new coding benchmark focused on bug fixing, recognized for its realism but also for the challenge it poses for evaluation.

AgentBench

A paper discussing agent evaluations, noted for examples in its appendix that were found to contain non-optimal solutions.

RACE

A reading comprehension benchmark, considered for evaluation but noted for having complex and potentially problematic questions.

Hellaswag

A common NLP benchmark mentioned as one of the public evaluations that IMB has reviewed and cleaned for ambiguity and data contamination.

Companies

Shutterstock

Collaborated with Databricks to create a text-to-image model trained exclusively on Shutterstock's stock photo dataset, notable for strict data provenance.

Databricks

Acquired Mosaic ML. Recently released a text-to-image model and DBRX, their large language model with a dinosaur mascot named DB-Rex.

MosaicML

Acquired by Databricks, known for their work on large model training and infrastructure solutions.

Dell

Collaborated with IMB on firmware fixes for server hardware to ensure stability in large-scale GPU clusters.

Google

Mentioned as designing their hardware (TPUs) and software for Mixture-of-Experts models with high network bandwidth and specific network architectures (like 3D toruses).

Hugging Face

Mentioned by the host for getting it right with emoji mascots, implying a positive view on their approach to branding and community engagement.

Kraken

A distributed Docker registry that uses BitTorrent for optimal image transfer between machines, praised for its efficiency and robustness.

Mellanox

Mentioned in the context of InfiniBand network providers, highlighting complexity in vendor choices for cluster networking.

NVIDIA

Collaborated with IMB on firmware fixes and driver updates for their GPUs, essential for stable cluster operation.

Concepts

Prometheus

A monitoring system used for collecting metrics and understanding the health of individual machines in the GPU cluster.

Arc AGI

A benchmark from François Chollet designed to measure abstract reasoning and general intelligence through IQ-test-like problems.

Needle in a Haystack

A long-context evaluation method for LLMs, criticized for not measuring holistic context use and for being relatively easy for models to 'trick' without true reasoning.

Retrieval-Augmented Generation

Described as "the world's simplest agent," giving models the ability to retrieve data from external contexts or databases, bridging unstructured and structured data.

GSM8K

A math reasoning benchmark, noted for having some 'weird' qualities that require careful interpretation.

Software & Apps

A cloud object storage service, mentioned as a proxy for local mirror setup for efficient file serving in distributed training.

Megatron LM

An open-source framework from NVIDIA for training large language models, providing useful components for distributed training.

Kubernetes

A container orchestration system, intentionally not used by IMB for their training clusters due to its overhead and complexity for experimental, fault-tolerant workloads.

Claude

An AI model suite that theoretically supports a high number of parallel function calls, indicating advancement in tool-use capabilities.

CARBS

A hyperparameter optimizer that models the cost of different configurations, enabling efficient identification of scaling laws for model parameters and data mix.

MinIO

An object storage server, considered for local mirror setup but deemed too complex for the desired simple infrastructure.

DeepSpeed

An open-source optimization library used to accelerate large-scale model training, providing helpful working examples for tuning.

HumanEval

A coding benchmark referenced in a bet about DBRX's performance, initially looking quite bad but ultimately exceeded expectations.

DBRX

Databricks' Mixture-of-Experts language model, featuring 132 billion total parameters and 36 billion active parameters on any input, pre-trained on 12 trillion tokens.

Gemini

An AI model suite that theoretically supports a high number of parallel function calls, indicating advancement in tool-use capabilities.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free