What is Lamini's 'Mommy' architecture?

Mommy stands for Mixture of Memory Experts. It combines LoRA (Low-Rank Adaptation) with Mixture of Experts (MoE) to create specialized 'experts' within the model's weights that can accurately retrieve facts and drastically reduce hallucinations.

What are the key principles for effective LLM fine-tuning?

Key principles include focusing on high-quality, factual data; creating objective and representative evaluation sets; and implementing fast iteration cycles using small data subsets to quickly test improvements.

How does Lamini generate high-quality training data?

Lamini uses agentic pipelines to generate data, focusing on specific facts by looking at schemas, query logs, and other relevant information. They also incorporate 'Vibes-based feedback' and validation steps to ensure accuracy.

Can Lamini's memory tuning be applied to any LLM?

Memory tuning requires access to the model's weights and architecture. It works well with open-source Transformer-based models and custom foundation models, but not with black-box models like GPT-4 without direct access.

What hardware does AMD offer for AI workloads?

AMD offers powerful GPUs like the Instinct MI300X, which has high HBM bandwidth and can run very large models. They are also working to ensure broad compatibility with popular AI frameworks on their hardware.

How does AMD make it easy to run AI frameworks on their GPUs?

AMD actively contributes to open-source communities like PyTorch and Hugging Face, enabling straightforward installation (often just a pip install) and providing robust support for various AI frameworks on their GPUs.

How can guardrails help prevent hallucinations in LLMs?

Guardrails can be embedded by showing positive examples of 'I don't know' responses for queries outside the model's trained knowledge base. This injects localized determinism, making the model more conservative when encountering unfamiliar topics.

Key Moments

AI Dev 25 | Sharon Zhou & Mahdi Ghodsi: Run Deepseek Reasoning and Finetuning on AMD GPUs w/ Lamini

DeepLearning.AI

Entertainment4 min read58 min video

Mar 27, 2025|999 views|10|1

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Lamini introduces 'Mommy' for factual AI accuracy, combining LoRA and MoE on AMD GPUs.

Key Insights

Large language models (LLMs) often hallucinate due to their training on diverse internet data, aiming for generalization over factual precision.

Lamini's 'Mommy' (Mixture of Memory Experts) system integrates LoRA and MoE, creating specialized weights to improve factual accuracy within LLMs.

Effective fine-tuning requires high-quality, representative data, objective evaluations, and fast iteration cycles using small data subsets.

Automated data generation pipelines and 'vibes-based feedback' can significantly reduce the manual effort in creating training data.

AMD GPUs, particularly the MI300X, offer high bandwidth (192 GB HBM) suitable for running large models and complex fine-tuning tasks efficiently.

AMD actively supports major AI frameworks like PyTorch and TensorFlow, enabling seamless deployment of AI workloads on their hardware.

THE PROBLEM OF HALLUCINATION IN LLMS

Large language models, despite their impressive capabilities, struggle with factual accuracy due to their training on vast, unfiltered internet data. This training optimizes for generalization, leading to models that are 'pretty good at everything but perfect at nothing.' Consequently, when asked for specific facts—like a date—they may generate plausible but incorrect information, a phenomenon known as hallucination. This becomes particularly problematic for enterprises where factual correctness is paramount for business value and decision-making.

LAMINI'S MOMMY: A SOLUTION FOR FACTUAL REASONING

Lamini addresses hallucination through 'Mommy,' a technique that integrates Mixture of Experts (MoE) with Low-Rank Adaptation (LoRA). This 'Mixture of Memory Experts' approach creates specialized weights (experts) within the model that are trained to retrieve and deliver highly accurate facts with near-zero loss. Unlike traditional methods that rely on external retrieval, Mommy embeds this factual knowledge directly into the model's weights, enhancing reliability without adding significant latency or cost.

OPTIMIZING THE FINE-TUNING PROCESS

Achieving high factual accuracy with fine-tuning requires a strategic approach. Key elements include curating high-quality training data, as models readily commit incorrect information to memory. Establishing objective evaluation sets that are representative of the desired task is crucial for guiding improvement. Furthermore, embracing fast iteration cycles by experimenting with small, representative data subsets allows for rapid debugging and scaling, making the fine-tuning process more efficient and effective.

AUTOMATED DATA GENERATION AND FEEDBACK

The manual process of creating high-quality training data can be tedious. Lamini proposes automated pipelines using LLMs to generate accurate data, focusing on factual correctness. This includes using schema information, query logs, and even 'vibes-based feedback'—intuitive instructions similar to how a human would learn—to guide data generation. This process enables models to generate their own training data, significantly simplifying the fine-tuning process and moving it closer to the ease of prompt engineering.

AMD'S ROLE IN SCALABLE AI DEPLOYMENT

AMD is enabling developers to run complex AI workloads, including Lamini's fine-tuning processes, on their GPUs. The AMD Instinct MI300X features 192 GB of HBM, providing ample memory for large models and contexts, and supports running massive models like DeepSeek-Coder-1 (671B parameters) on a single node. AMD actively collaborates with communities like PyTorch and Hugging Face, ensuring seamless integration and out-of-the-box support for AI frameworks on their hardware.

DEMONSTRATING PRACTICAL APPLICATIONS

The presentation showcased practical applications, including running DeepSeek reasoning apps and fine-tuning open LLMs on AMD GPUs. Use cases like building accurate Text-to-SQL agents, as demonstrated with Lamini's platform, highlight the system's efficacy. Real-world examples, such as a customer achieving a significant improvement in data access for 30,000 users by leveraging this technology, underscore the tangible business value derived from enhanced factual accuracy and optimized model deployment.

ARCHITECTURAL INNOVATIONS: MOE IN LORA SPACE

Lamini's 'Mommy' architecture uniquely applies Mixture of Experts within the LoRA adapter layer, rather than on the base model's feed-forward networks. This means routing inputs to relevant LoRA experts, which are then fused back into the base model. This approach maintains the efficiency of MoE and the low-cost, low-latency benefits of LoRA, enabling the model to retrieve specific factual information without altering the core, frozen weights of the foundational model, thus enhancing targeted accuracy.

ADDRESSING UNSEEN DATA AND GUARDRAILS

When faced with questions outside its training data, an LLM might still hallucinate. To mitigate this, 'guardrails' can be implemented. This involves teaching the model to recognize boundaries and respond with 'I don't know' for topics it hasn't been trained on. While the specialized tuning provides localized determinism for trained facts, applying these guardrails ensures the model doesn't invent information for topics it lacks specific knowledge about, maintaining overall trustworthiness.

SMALL LANGUAGE MODELS (SLMS) AND EFFICIENCY

The discussion highlighted the efficiency of using smaller language models (SLMs) for specific, fact-critical tasks. For instance, a 3-8 billion parameter model can achieve high factual accuracy, outperforming larger general-purpose models on specialized tasks. This creates an arbitrage opportunity, as specialized AI labor (like a finely tuned SLM for SQL queries) is more cost-effective than general-purpose AI that attempts to do everything. These SLMs are more efficient for targeted applications.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Concepts

●People Referenced

Strategies for Factual Fine-Tuning with Lamini

Practical takeaways from this episode

Do This

Focus on high-quality training data that contains precise facts.

Ensure evaluation sets are representative of the task and objective.

Prioritize fast iteration cycles using small, representative data subsets.

Use synthetic data generation pipelines to create accurate data.

Apply deterministic validation and LLM-based validation for data quality.

Incorporate 'Vibes-based feedback' for more intuitive model training.

Leverage LoRA for efficient fine-tuning and MoE for expert routing.

Ensure access to model weights and architecture for fine-tuning.

Embed guardrails to instruct the model to say 'I don't know' when appropriate.

Avoid This

Do not put wrong facts into the model, as it will remember them.

Avoid subjective evaluations where stakeholders disagree on outputs.

Do not attempt to fine-tune with the entire dataset at once; start small.

Do not rely solely on manual labeling for large datasets; explore automation.

Do not assume hallucinations are less damaging if the output is unbelievable.

Do not expect fine-tuning to work across black-box models like GPT-4 without weight access.

Common Questions

LLMs are optimized to reduce generalization error across vast internet data, making them good at many things but not perfect. This means they might sample probabilities for facts that are similar but incorrect, rather than providing a direct, accurate answer.

Topics

Factual Accuracy LoRA AMD GPUs SQL Generation

Mentioned in this video

Concepts

Texas SQL

A use case for data analysis and business intelligence where Lamini's factual fine-tuning can be applied.

Mommy

The name given to Lamini's architecture, a Mixture of Memory Experts (MoME), designed to improve factual accuracy.

Software & Apps

Mixtral

An example of a Mixture of Experts model.

Robit

Mentioned as a typo or misstatement for R-L (research lab) or possibly referencing a system name.

Pygame

A library used with Python to create the Snake game demonstration.

Open Web UI

An advanced chatbot interface used to demonstrate reasoning models and RAG applications.

Products

AMD Instinct MI300X

A powerful GPU from AMD designed for AI workloads, featuring high HBM bandwidth.

People

Sharon Zhou

Founder and CEO of Lamini, with a PhD from Stanford on generative AI.

Companies

Lamini

A company focused on improving factual accuracy in large language models.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free