AI Dev 25 | Sharon Zhou & Mahdi Ghodsi: Run Deepseek Reasoning and Finetuning on AMD GPUs w/ Lamini
Key Moments
Lamini introduces 'Mommy' for factual AI accuracy, combining LoRA and MoE on AMD GPUs.
Key Insights
Large language models (LLMs) often hallucinate due to their training on diverse internet data, aiming for generalization over factual precision.
Lamini's 'Mommy' (Mixture of Memory Experts) system integrates LoRA and MoE, creating specialized weights to improve factual accuracy within LLMs.
Effective fine-tuning requires high-quality, representative data, objective evaluations, and fast iteration cycles using small data subsets.
Automated data generation pipelines and 'vibes-based feedback' can significantly reduce the manual effort in creating training data.
AMD GPUs, particularly the MI300X, offer high bandwidth (192 GB HBM) suitable for running large models and complex fine-tuning tasks efficiently.
AMD actively supports major AI frameworks like PyTorch and TensorFlow, enabling seamless deployment of AI workloads on their hardware.
THE PROBLEM OF HALLUCINATION IN LLMS
Large language models, despite their impressive capabilities, struggle with factual accuracy due to their training on vast, unfiltered internet data. This training optimizes for generalization, leading to models that are 'pretty good at everything but perfect at nothing.' Consequently, when asked for specific facts—like a date—they may generate plausible but incorrect information, a phenomenon known as hallucination. This becomes particularly problematic for enterprises where factual correctness is paramount for business value and decision-making.
LAMINI'S MOMMY: A SOLUTION FOR FACTUAL REASONING
Lamini addresses hallucination through 'Mommy,' a technique that integrates Mixture of Experts (MoE) with Low-Rank Adaptation (LoRA). This 'Mixture of Memory Experts' approach creates specialized weights (experts) within the model that are trained to retrieve and deliver highly accurate facts with near-zero loss. Unlike traditional methods that rely on external retrieval, Mommy embeds this factual knowledge directly into the model's weights, enhancing reliability without adding significant latency or cost.
OPTIMIZING THE FINE-TUNING PROCESS
Achieving high factual accuracy with fine-tuning requires a strategic approach. Key elements include curating high-quality training data, as models readily commit incorrect information to memory. Establishing objective evaluation sets that are representative of the desired task is crucial for guiding improvement. Furthermore, embracing fast iteration cycles by experimenting with small, representative data subsets allows for rapid debugging and scaling, making the fine-tuning process more efficient and effective.
AUTOMATED DATA GENERATION AND FEEDBACK
The manual process of creating high-quality training data can be tedious. Lamini proposes automated pipelines using LLMs to generate accurate data, focusing on factual correctness. This includes using schema information, query logs, and even 'vibes-based feedback'—intuitive instructions similar to how a human would learn—to guide data generation. This process enables models to generate their own training data, significantly simplifying the fine-tuning process and moving it closer to the ease of prompt engineering.
AMD'S ROLE IN SCALABLE AI DEPLOYMENT
AMD is enabling developers to run complex AI workloads, including Lamini's fine-tuning processes, on their GPUs. The AMD Instinct MI300X features 192 GB of HBM, providing ample memory for large models and contexts, and supports running massive models like DeepSeek-Coder-1 (671B parameters) on a single node. AMD actively collaborates with communities like PyTorch and Hugging Face, ensuring seamless integration and out-of-the-box support for AI frameworks on their hardware.
DEMONSTRATING PRACTICAL APPLICATIONS
The presentation showcased practical applications, including running DeepSeek reasoning apps and fine-tuning open LLMs on AMD GPUs. Use cases like building accurate Text-to-SQL agents, as demonstrated with Lamini's platform, highlight the system's efficacy. Real-world examples, such as a customer achieving a significant improvement in data access for 30,000 users by leveraging this technology, underscore the tangible business value derived from enhanced factual accuracy and optimized model deployment.
ARCHITECTURAL INNOVATIONS: MOE IN LORA SPACE
Lamini's 'Mommy' architecture uniquely applies Mixture of Experts within the LoRA adapter layer, rather than on the base model's feed-forward networks. This means routing inputs to relevant LoRA experts, which are then fused back into the base model. This approach maintains the efficiency of MoE and the low-cost, low-latency benefits of LoRA, enabling the model to retrieve specific factual information without altering the core, frozen weights of the foundational model, thus enhancing targeted accuracy.
ADDRESSING UNSEEN DATA AND GUARDRAILS
When faced with questions outside its training data, an LLM might still hallucinate. To mitigate this, 'guardrails' can be implemented. This involves teaching the model to recognize boundaries and respond with 'I don't know' for topics it hasn't been trained on. While the specialized tuning provides localized determinism for trained facts, applying these guardrails ensures the model doesn't invent information for topics it lacks specific knowledge about, maintaining overall trustworthiness.
SMALL LANGUAGE MODELS (SLMS) AND EFFICIENCY
The discussion highlighted the efficiency of using smaller language models (SLMs) for specific, fact-critical tasks. For instance, a 3-8 billion parameter model can achieve high factual accuracy, outperforming larger general-purpose models on specialized tasks. This creates an arbitrage opportunity, as specialized AI labor (like a finely tuned SLM for SQL queries) is more cost-effective than general-purpose AI that attempts to do everything. These SLMs are more efficient for targeted applications.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Concepts
●People Referenced
Strategies for Factual Fine-Tuning with Lamini
Practical takeaways from this episode
Do This
Avoid This
Common Questions
LLMs are optimized to reduce generalization error across vast internet data, making them good at many things but not perfect. This means they might sample probabilities for facts that are similar but incorrect, rather than providing a direct, accurate answer.
Topics
Mentioned in this video
A library used with Python to create the Snake game demonstration.
Founder and CEO of Lamini, with a PhD from Stanford on generative AI.
A company focused on improving factual accuracy in large language models.
The name given to Lamini's architecture, a Mixture of Memory Experts (MoME), designed to improve factual accuracy.
A use case for data analysis and business intelligence where Lamini's factual fine-tuning can be applied.
An example of a Mixture of Experts model.
A powerful GPU from AMD designed for AI workloads, featuring high HBM bandwidth.
An advanced chatbot interface used to demonstrate reasoning models and RAG applications.
Mentioned as a typo or misstatement for R-L (research lab) or possibly referencing a system name.
More from DeepLearningAI
View all 65 summaries
1 minThe #1 Skill Employers Want in 2026
1 minThe truth about tech layoffs and AI..
2 minBuild and Train an LLM with JAX
1 minWhat should you learn next? #AI #deeplearning
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free