Key Moments

⚡️Launching AI Diplomacy: the hardest LLM Game Benchmark yet - Alex Duffy

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read35 min video
Jun 11, 2025|2,994 views|52|3
Save to Pod
TL;DR

Alex Duffy discusses AI Diplomacy, LLM benchmarks, and the future of AI in games and creative writing.

Key Insights

1

AI Diplomacy is a new LLM benchmark that uses the game of Diplomacy to test AI capabilities in strategy, negotiation, and deception.

2

Games are valuable benchmarks for LLMs because they offer evolving challenges and can teach both AI and humans new strategies.

3

Benchmarks act as 'memes,' ideas that spread and help evaluate AI's capabilities in various domains beyond typical math and code.

4

AI should be viewed as a leverage tool to amplify human goals and creativity, with humans defining the objectives and ethical boundaries.

5

Effective AI integration, especially in creative fields like writing, involves human-AI collaboration with continuous editing and reflection.

6

Future advancements in AI benchmarks will likely focus on agents (models plus tools) and more complex, interactive scenarios.

EVERY: A Hub for AI Innovation

Alex Duffy, Head of AI at Every, describes the company's unique culture, a blend of experienced founders and engineers passionate about AI. Every operates as a media company with a strong AI product focus, fostering cross-pollination of ideas. They have developed several AI products, including Kora (an email app), Sparkle (desktop organization), Spiral (content transformation), and Monologue (a local whisper flow alternative). This diverse team and product suite position Every as a critical player in testing and developing new AI models.

The Genesis of AI Diplomacy

AI Diplomacy emerged from a community discussion on Twitter, inspired by the potential of games as LLM benchmarks, a concept previously explored by figures like Andrej Karpathy. Duffy, with a weekend to spare, created a basic implementation, which quickly garnered interest from researchers worldwide. Collaborating with Tyler Marquez on the frontend, the project evolved into a functional benchmark, aiming to make AI capabilities more accessible and understandable to a broader audience, not just AI engineers.

Games as Evolving Benchmarks

The use of games like Dota, Go, and Chess for AI benchmarking has a history, with models eventually surpassing human capabilities. Duffy emphasizes that the true potential lies in self-play against other LLMs, creating an infinitely scalable challenge. Diplomacy, in particular, is ideal because it requires complex negotiation, deception, and strategic foresight. As LLMs improve, the game inherently becomes more challenging, providing a dynamic and evolving benchmark for assessing AI's multifaceted abilities.

Technical Insights and Model Behavior

Developing the AI Diplomacy harness involved careful consideration of context management, representing game state, and handling LLM output variations. Duffy notes significant differences in how models like Claude, Gemini, and DeepSeek Reasoner behave. Claude is perceived as too polite to engage in necessary deception, while DeepSeek can be highly aggressive. The harness was designed to be adaptable, using elements like relationship tracking and diary entries to provide crucial context, even when models' natural communication styles vary greatly.

Benchmarks as Disseminating Ideas

Duffy posits that benchmarks function like memes—ideas that spread and gain adoption. He highlights the lifecycle of a benchmark, from conception to saturation, where powerful AI tools become proficient in specific tasks. This process democratizes AI evaluation, allowing individuals to assess AI's capabilities in areas they care about. For creative fields like writing, benchmarks can improve LLM output by incorporating human editors' feedback and guiding AI toward specific stylistic goals and audience considerations.

AI as Leverage and Future Directions

The core philosophy is that AI acts as leverage to amplify human goals, not as an autonomous product. Humans define the objectives, ethical boundaries, and provide feedback for improvement. Duffy plans to enhance AI Diplomacy with a data viewer, improved frontend, and making it playable by the public. He envisions competitions and tournaments, exploring scenarios like human-AI matches and prompt engineers attempting to 'jailbreak' the AI, to further understand and push the boundaries of AI interaction and strategic capabilities.

Creative Writing and Human-AI Collaboration

In creative endeavors like writing, AI is a powerful co-pilot rather than a replacement for human skill. Duffy describes a collaborative process at Every where editors and writers use AI tools, employing structured prompts that include dictation, past successful work, style guides, and editor notes. The key is continuous reflection and editing, ensuring the AI's output aligns precisely with the intended message. This iterative process, combined with human oversight, allows for the consolidation of ideas and the creation of tailored content.

Bridging the Gap: Trust and Understanding

Duffy addresses the common fears associated with AI: the role of humans and trust in AI systems. He argues that benchmarks help build trust by making AI's performance transparent and showing how human feedback can lead to improvement. By defining goals and guiding AI, humans remain in control. This understanding empowers individuals, including those outside the AI field, to leverage AI for personal and professional goals, thereby demystifying AI and encouraging broader adoption and beneficial use.

Common Questions

AI Diplomacy is a benchmark game designed to test Large Language Models (LLMs). It was built collaboratively by Alex Duffy and Tyler Marquez, utilizing an open-source implementation for the backend and a custom front-end. The project grew rapidly with contributions from a global community.

Topics

Mentioned in this video

More from Latent Space

View all 198 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free