Key Moments
⚡️Launching AI Diplomacy: the hardest LLM Game Benchmark yet - Alex Duffy
Key Moments
Alex Duffy discusses AI Diplomacy, LLM benchmarks, and the future of AI in games and creative writing.
Key Insights
AI Diplomacy is a new LLM benchmark that uses the game of Diplomacy to test AI capabilities in strategy, negotiation, and deception.
Games are valuable benchmarks for LLMs because they offer evolving challenges and can teach both AI and humans new strategies.
Benchmarks act as 'memes,' ideas that spread and help evaluate AI's capabilities in various domains beyond typical math and code.
AI should be viewed as a leverage tool to amplify human goals and creativity, with humans defining the objectives and ethical boundaries.
Effective AI integration, especially in creative fields like writing, involves human-AI collaboration with continuous editing and reflection.
Future advancements in AI benchmarks will likely focus on agents (models plus tools) and more complex, interactive scenarios.
EVERY: A Hub for AI Innovation
Alex Duffy, Head of AI at Every, describes the company's unique culture, a blend of experienced founders and engineers passionate about AI. Every operates as a media company with a strong AI product focus, fostering cross-pollination of ideas. They have developed several AI products, including Kora (an email app), Sparkle (desktop organization), Spiral (content transformation), and Monologue (a local whisper flow alternative). This diverse team and product suite position Every as a critical player in testing and developing new AI models.
The Genesis of AI Diplomacy
AI Diplomacy emerged from a community discussion on Twitter, inspired by the potential of games as LLM benchmarks, a concept previously explored by figures like Andrej Karpathy. Duffy, with a weekend to spare, created a basic implementation, which quickly garnered interest from researchers worldwide. Collaborating with Tyler Marquez on the frontend, the project evolved into a functional benchmark, aiming to make AI capabilities more accessible and understandable to a broader audience, not just AI engineers.
Games as Evolving Benchmarks
The use of games like Dota, Go, and Chess for AI benchmarking has a history, with models eventually surpassing human capabilities. Duffy emphasizes that the true potential lies in self-play against other LLMs, creating an infinitely scalable challenge. Diplomacy, in particular, is ideal because it requires complex negotiation, deception, and strategic foresight. As LLMs improve, the game inherently becomes more challenging, providing a dynamic and evolving benchmark for assessing AI's multifaceted abilities.
Technical Insights and Model Behavior
Developing the AI Diplomacy harness involved careful consideration of context management, representing game state, and handling LLM output variations. Duffy notes significant differences in how models like Claude, Gemini, and DeepSeek Reasoner behave. Claude is perceived as too polite to engage in necessary deception, while DeepSeek can be highly aggressive. The harness was designed to be adaptable, using elements like relationship tracking and diary entries to provide crucial context, even when models' natural communication styles vary greatly.
Benchmarks as Disseminating Ideas
Duffy posits that benchmarks function like memes—ideas that spread and gain adoption. He highlights the lifecycle of a benchmark, from conception to saturation, where powerful AI tools become proficient in specific tasks. This process democratizes AI evaluation, allowing individuals to assess AI's capabilities in areas they care about. For creative fields like writing, benchmarks can improve LLM output by incorporating human editors' feedback and guiding AI toward specific stylistic goals and audience considerations.
AI as Leverage and Future Directions
The core philosophy is that AI acts as leverage to amplify human goals, not as an autonomous product. Humans define the objectives, ethical boundaries, and provide feedback for improvement. Duffy plans to enhance AI Diplomacy with a data viewer, improved frontend, and making it playable by the public. He envisions competitions and tournaments, exploring scenarios like human-AI matches and prompt engineers attempting to 'jailbreak' the AI, to further understand and push the boundaries of AI interaction and strategic capabilities.
Creative Writing and Human-AI Collaboration
In creative endeavors like writing, AI is a powerful co-pilot rather than a replacement for human skill. Duffy describes a collaborative process at Every where editors and writers use AI tools, employing structured prompts that include dictation, past successful work, style guides, and editor notes. The key is continuous reflection and editing, ensuring the AI's output aligns precisely with the intended message. This iterative process, combined with human oversight, allows for the consolidation of ideas and the creation of tailored content.
Bridging the Gap: Trust and Understanding
Duffy addresses the common fears associated with AI: the role of humans and trust in AI systems. He argues that benchmarks help build trust by making AI's performance transparent and showing how human feedback can lead to improvement. By defining goals and guiding AI, humans remain in control. This understanding empowers individuals, including those outside the AI field, to leverage AI for personal and professional goals, thereby demystifying AI and encouraging broader adoption and beneficial use.
Mentioned in This Episode
●Products
●Software & Apps
●Companies
●Organizations
●Concepts
●People Referenced
Common Questions
AI Diplomacy is a benchmark game designed to test Large Language Models (LLMs). It was built collaboratively by Alex Duffy and Tyler Marquez, utilizing an open-source implementation for the backend and a custom front-end. The project grew rapidly with contributions from a global community.
Topics
Mentioned in this video
A company involved in AI products, media, and training, known for testing new LLM models.
A large hedge fund that Every provides training and consulting for.
Known for playing Dota, mentioned in the context of AI playing games.
An email app that Alex Duffy wants new AI-focused alternatives for.
A streaming platform where LLM benchmarks like AI Diplomacy are being streamed.
A model playing in the discussed AI Diplomacy game, noted for flowery and aggressive language.
A product by Every that transforms long-form content into short-form content in the user's voice.
Known for playing Go, mentioned as an example of AI in gaming benchmarks.
A game played by OpenAI, used as an example of AI in gaming benchmarks.
Mentioned as a game that has been played against AI for a long time.
A game played by DeepMind, used as an example of AI in gaming benchmarks.
A version of Dota played by OpenAI against top human players, leading to strategy learning.
A model that is described as 'too nice' in AI Diplomacy, often agreeing to draws.
Mentioned as a recent development that Alex Duffy uses for coding assistance.
A product by Naveen from Every, described as a potentially better whisper flow that can run locally.
The model Alex Duffy's mom found best for yoga-related queries.
An LLM benchmark game created as a collaborative effort, streamed on Twitch.
A product whose representative, Isa, gave a talk on clarity and building products.
An email app built by Every, presented as an alternative to Superhuman with AI integration.
An AI that revolutionized Go, used as an example of AI learning from games.
Mentioned as a model that shows threatening behavior in AI Diplomacy.
The specific version of DeepSeek playing in the AI Diplomacy game, showing flowery language.
A routing layer for LLMs, mentioned as a useful tool for managing different models.
Collaborated with Alex Duffy on the front-end development of AI Diplomacy.
Head of AI at Every, leading training and consulting, and creator of AI Diplomacy.
Gave a talk at the conference that was memorable despite a transition issue.
Responded to Alex Duffy's AI Diplomacy project, encouraging its development.
Mentioned as a potential jailbreaker relevant to a hypothetical AI Diplomacy tournament.
Gave a talk at the conference about the lifecycle of benchmarks, inspired by Alex Duffy's talk.
Winner of a human Diplomacy tournament, mentioned in the context of a potential AI Diplomacy tournament.
More from Latent Space
View all 198 summaries
65 minDreamer: the Agent OS for Everyone — David Singleton
88 minWhy Anthropic Thinks AI Should Have Its Own Computer — Felix Rieseberg of Claude Cowork/Code
35 min⚡️Monty: the ultrafast Python interpreter by Agents for Agents — Samuel Colvin, Pydantic
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free