What is Every and what products do they offer?

Every is a company that started as a media and publishing entity, now focused on AI products. Their offerings include Kora (email app), Sparkle (desktop organization), Spiral (content transformation), and Monologue (AI transcription). They also provide AI training and consulting services.

Why are LLMs being used to play games like Diplomacy?

Games serve as effective benchmarks for LLMs because they are evolutionary; as models improve, the challenges within the game deepen. This allows for scalable testing and reveals how different models perform and reason, similar to how AlphaGo advanced the game of Go.

What are the key challenges in creating an LLM benchmark like AI Diplomacy?

Key challenges include managing long-range context, representing the game board effectively, and maintaining attention to crucial updates. The harness needs to be opinionated yet minimal, focusing on relevant information like unit positions, relationships, and historical messages.

How does AI Diplomacy differ from company-specific LLM evaluations?

AI Diplomacy and similar benchmarks are for general population evaluation, aiming to reveal intrinsic model capabilities. Company-specific evaluations are product-focused, often using more scaffolding and tools to integrate LLMs into specific workflows rather than assessing the raw model.

What is the concept of 'benchmarks as memes'?

The idea suggests that benchmarks, like memes, are concepts that spread and evolve. A benchmark starts with an idea, gains adoption, and can influence how models are trained or evaluated, demonstrating the power of a single person's idea to impact AI development.

How can AI be used effectively in creative writing?

AI can be used as a tool to augment human creativity, not replace it. This involves using AI to consolidate ideas, generate initial drafts based on specific prompts and past work, and then heavily editing and refining the output with editors, maintaining a human voice.

What are the future plans for AI Diplomacy?

Future plans include creating a more user-friendly data viewer, improving the front-end interface, and making the game more playable. They aim to host human vs. AI tournaments and explore how prompt engineers might 'jailbreak' AI players to win quickly.

Key Moments

⚡️Launching AI Diplomacy: the hardest LLM Game Benchmark yet - Alex Duffy

Latent Space Podcast

Science & Technology4 min read35 min video

Jun 11, 2025|2,994 views|52|3

Save to Pod

Key Moments

TL;DR

Alex Duffy discusses AI Diplomacy, LLM benchmarks, and the future of AI in games and creative writing.

Key Insights

AI Diplomacy is a new LLM benchmark that uses the game of Diplomacy to test AI capabilities in strategy, negotiation, and deception.

Games are valuable benchmarks for LLMs because they offer evolving challenges and can teach both AI and humans new strategies.

Benchmarks act as 'memes,' ideas that spread and help evaluate AI's capabilities in various domains beyond typical math and code.

AI should be viewed as a leverage tool to amplify human goals and creativity, with humans defining the objectives and ethical boundaries.

Effective AI integration, especially in creative fields like writing, involves human-AI collaboration with continuous editing and reflection.

Future advancements in AI benchmarks will likely focus on agents (models plus tools) and more complex, interactive scenarios.

EVERY: A Hub for AI Innovation

Alex Duffy, Head of AI at Every, describes the company's unique culture, a blend of experienced founders and engineers passionate about AI. Every operates as a media company with a strong AI product focus, fostering cross-pollination of ideas. They have developed several AI products, including Kora (an email app), Sparkle (desktop organization), Spiral (content transformation), and Monologue (a local whisper flow alternative). This diverse team and product suite position Every as a critical player in testing and developing new AI models.

The Genesis of AI Diplomacy

AI Diplomacy emerged from a community discussion on Twitter, inspired by the potential of games as LLM benchmarks, a concept previously explored by figures like Andrej Karpathy. Duffy, with a weekend to spare, created a basic implementation, which quickly garnered interest from researchers worldwide. Collaborating with Tyler Marquez on the frontend, the project evolved into a functional benchmark, aiming to make AI capabilities more accessible and understandable to a broader audience, not just AI engineers.

Games as Evolving Benchmarks

The use of games like Dota, Go, and Chess for AI benchmarking has a history, with models eventually surpassing human capabilities. Duffy emphasizes that the true potential lies in self-play against other LLMs, creating an infinitely scalable challenge. Diplomacy, in particular, is ideal because it requires complex negotiation, deception, and strategic foresight. As LLMs improve, the game inherently becomes more challenging, providing a dynamic and evolving benchmark for assessing AI's multifaceted abilities.

Technical Insights and Model Behavior

Developing the AI Diplomacy harness involved careful consideration of context management, representing game state, and handling LLM output variations. Duffy notes significant differences in how models like Claude, Gemini, and DeepSeek Reasoner behave. Claude is perceived as too polite to engage in necessary deception, while DeepSeek can be highly aggressive. The harness was designed to be adaptable, using elements like relationship tracking and diary entries to provide crucial context, even when models' natural communication styles vary greatly.

Benchmarks as Disseminating Ideas

Duffy posits that benchmarks function like memes—ideas that spread and gain adoption. He highlights the lifecycle of a benchmark, from conception to saturation, where powerful AI tools become proficient in specific tasks. This process democratizes AI evaluation, allowing individuals to assess AI's capabilities in areas they care about. For creative fields like writing, benchmarks can improve LLM output by incorporating human editors' feedback and guiding AI toward specific stylistic goals and audience considerations.

AI as Leverage and Future Directions

The core philosophy is that AI acts as leverage to amplify human goals, not as an autonomous product. Humans define the objectives, ethical boundaries, and provide feedback for improvement. Duffy plans to enhance AI Diplomacy with a data viewer, improved frontend, and making it playable by the public. He envisions competitions and tournaments, exploring scenarios like human-AI matches and prompt engineers attempting to 'jailbreak' the AI, to further understand and push the boundaries of AI interaction and strategic capabilities.

Creative Writing and Human-AI Collaboration

In creative endeavors like writing, AI is a powerful co-pilot rather than a replacement for human skill. Duffy describes a collaborative process at Every where editors and writers use AI tools, employing structured prompts that include dictation, past successful work, style guides, and editor notes. The key is continuous reflection and editing, ensuring the AI's output aligns precisely with the intended message. This iterative process, combined with human oversight, allows for the consolidation of ideas and the creation of tailored content.

Bridging the Gap: Trust and Understanding

Duffy addresses the common fears associated with AI: the role of humans and trust in AI systems. He argues that benchmarks help build trust by making AI's performance transparent and showing how human feedback can lead to improvement. By defining goals and guiding AI, humans remain in control. This understanding empowers individuals, including those outside the AI field, to leverage AI for personal and professional goals, thereby demystifying AI and encouraging broader adoption and beneficial use.

Mentioned in This Episode

●Products

●Software & Apps

●Companies

●Organizations

●Concepts

●People Referenced

Common Questions

AI Diplomacy is a benchmark game designed to test Large Language Models (LLMs). It was built collaboratively by Alex Duffy and Tyler Marquez, utilizing an open-source implementation for the backend and a custom front-end. The project grew rapidly with contributions from a global community.

Topics

Ai-Ethics AI & Machine Learning Technology & Innovation Science & Mathematics Creative AI AI Development AI Capabilities Human-AI Collaboration LLM Benchmarks Game AI Benchmarking Strategies

Mentioned in this video

Companies

Every

A company involved in AI products, media, and training, known for testing new LLM models.

Walleye

A large hedge fund that Every provides training and consulting for.

OpenAI

Known for playing Dota, mentioned in the context of AI playing games.

Superhuman

An email app that Alex Duffy wants new AI-focused alternatives for.

Twitch

A streaming platform where LLM benchmarks like AI Diplomacy are being streamed.

DeepSeek

A model playing in the discussed AI Diplomacy game, noted for flowery and aggressive language.

Spiral

A product by Every that transforms long-form content into short-form content in the user's voice.

DeepMind

Known for playing Go, mentioned as an example of AI in gaming benchmarks.

Products

Sparkle

A product by Every that cleans up and organizes desktop files.

Organizations

The Athletic

A media company that Every has worked with, specifically with their journalists.

New York Times

A media company that Every has worked with, including journalists in the media sector.

Media

Dota

A game played by OpenAI, used as an example of AI in gaming benchmarks.

Chess

Mentioned as a game that has been played against AI for a long time.

A game played by DeepMind, used as an example of AI in gaming benchmarks.

Dota 5

A version of Dota played by OpenAI against top human players, leading to strategy learning.

Software & Apps

Claude 2.5

A model that is described as 'too nice' in AI Diplomacy, often agreeing to draws.

Claude 4

Mentioned as a recent development that Alex Duffy uses for coding assistance.

Monologue

A product by Naveen from Every, described as a potentially better whisper flow that can run locally.

Gemini 2.5 Pro

The model Alex Duffy's mom found best for yoga-related queries.

AI Diplomacy

An LLM benchmark game created as a collaborative effort, streamed on Twitch.

NotebookLM

A product whose representative, Isa, gave a talk on clarity and building products.

Kora

An email app built by Every, presented as an alternative to Superhuman with AI integration.

AlphaGo

An AI that revolutionized Go, used as an example of AI learning from games.

Gemini

Mentioned as a model that shows threatening behavior in AI Diplomacy.

DeepSeek Reasoner

The specific version of DeepSeek playing in the AI Diplomacy game, showing flowery language.

OpenRouter

A routing layer for LLMs, mentioned as a useful tool for managing different models.

People

Tyler Marquez

Collaborated with Alex Duffy on the front-end development of AI Diplomacy.

Alex Duffy

Head of AI at Every, leading training and consulting, and creator of AI Diplomacy.

Sarah Guo

Gave a talk at the conference that was memorable despite a transition issue.

Andrej Karpathy

Responded to Alex Duffy's AI Diplomacy project, encouraging its development.

Elder Pinus

Mentioned as a potential jailbreaker relevant to a hypothetical AI Diplomacy tournament.

Simon Wilson

Gave a talk at the conference about the lifecycle of benchmarks, inspired by Alex Duffy's talk.

Nome Brown

Winner of a human Diplomacy tournament, mentioned in the context of a potential AI Diplomacy tournament.

Concepts

Pelican on a Bicycle

An example benchmark idea from Simon Wilson that gained adoption and was potentially used for model training/evaluation.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free