How difficult is Factorio to implement as an AI benchmark compared to other games like Minecraft?

Factorio presents a significantly higher complexity than Minecraft. Launching a rocket in Factorio requires mining millions of resources, whereas Minecraft needs only a few hundred. This vast scale provides more signal for differentiating AI model performance across a wider range.

What technical approach was used to integrate AI models with Factorio?

Instead of using Factorio's traditional Lua API, the researchers exploited the admin console of multiplayer servers via the Archon protocol over TCP. This allowed for remote execution of actions and sharding across many servers, enabling large-scale training. Models then generate Python code to invoke high-level actions.

How are AI models evaluated in the Factorio environment (Lab Play vs. Open Play)?

Lab Play involves specific, tiered tasks to evaluate a model's spatial reasoning and ability to use the API within constrained environments. Open Play is unbounded, requiring models to create their own objectives and largest possible factory, testing long-term planning and self-goal setting.

Why did some AI models like Claude perform much better than others like DeepSeek in Factorio's Open Play?

Claude's superior performance in Open Play is hypothesized to stem from better training in long-term planning and objective maintenance. DeepSeek, in contrast, often exhibited myopic behavior, such as repeatedly creating chests instead of progressing towards larger factory goals.

Does incorporating vision (screenshots) improve AI performance in Factorio?

Initial tests suggest that adding vision components does not significantly improve performance and can even degrade it by causing models to hallucinate entities or go on tangents. The researchers are exploring a simplified geometric renderer instead of raw screenshots.

Can AI models in Factorio create and reuse their own skills like in the Voyager paper?

Yes, the models can define classes and functions, effectively creating their own skills or tools. However, the reusability of these skills is limited in Factorio due to the highly dynamic and environment-dependent nature of factory building, unlike the more static environments in games like Minecraft.

What is the performance gap between top AI models and expert human players in Factorio?

There's a substantial gap in competency. While AI can execute pre-defined complex policies rapidly, the time taken in game ticks to achieve milestones like launching a rocket is vastly longer for AI (around 8 hours) compared to human speedrunners (around 1 hour 20 minutes), suggesting humans are roughly 100 times more competent in this regard.

What are the future plans for the Factorio learning environment and its connection to AI alignment?

The next phase involves training models directly on unbounded objectives to investigate instrumental convergence, particularly goal content integrity. The goal is to see if behaviors like the paperclip maximizer emerge naturally, which could redefine approaches to AI alignment by highlighting the critical need to set initial objectives correctly.

Key Moments

⚡️Factorio Learning Environment: the ultimate Game Agent Eval — Jack Hopkins

Latent Space Podcast

Science & Technology3 min read31 min video

Apr 27, 2025|24,044 views|495|76

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

Factorio Learning Environment (FLE) enables unbounded agent evaluation for LLMs in complex tasks.

Key Insights

The Factorio Learning Environment (FLE) offers a novel framework for evaluating LLM agents, particularly in code generation, spatial reasoning, and long-term planning.

FLE leverages Factorio's complex scaling challenges, requiring millions of resources for completion, which provides significant signal for differentiating model capabilities.

The environment uses a code synthesis approach where LLMs write Python code to interact with the game, enabling higher-level actions and abstraction.

FLE includes two modes: 'Lab Play' for structured task completion and spatial reasoning, and 'Open Play' for unbounded factory creation and self-objective setting.

Preliminary results show significant performance differences between models, with Claude outperforming others, especially in long-term planning tasks within Open Play.

Integrating vision into the environment has thus far not improved LLM performance and has sometimes led to hallucinations and worse outcomes due to the game's complexity.

INTRODUCTION TO THE FACTORIO LEARNING ENVIRONMENT (FLE)

The Factorio Learning Environment (FLE) is a new framework designed to evaluate Large Language Model (LLM) agents in complex, unbounded scenarios. Inspired by the game Factorio, known for its intricate industrial simulation and massive resource requirements for completion, FLE provides an API and metrics to assess agents' capabilities in code generation, spatial reasoning, and long-term strategic planning. This environment allows for agent evaluation across a vast spectrum of complexity, from simple tasks to managing factories producing millions of resources per second, offering a rich signal for model comparison.

DESIGN AND IMPLEMENTATION OF FLE

Developing FLE involved creating a robust harness that could scale across multiple Factorio instances. Rather than relying on Factorio's traditional Lua API, the team exploited the game's multiplayer admin console and a protocol called Archon to execute actions remotely across a cluster of servers. This low-level access was adapted into a code synthesis approach, where LLMs generate Python code. This code then invokes high-level game actions, allowing models to manage increasingly complex factories and large-scale operations, a necessity given Factorio's exponential scaling challenges.

DISTINGUISHING EVALUATION MODES: LAB PLAY VS. OPEN PLAY

FLE features two distinct evaluation modes: Lab Play and Open Play. Lab Play presents agents with specific tasks, such as creating a factory for a target entity, thereby measuring their spatial reasoning and ability to operate within constrained environments using the FLE API. Open Play, conversely, is an unbounded sandbox where agents must create the largest possible factory and, crucially, set their own objectives and sub-goals. This mode tests long-term planning and the ability to decompose a grand objective into manageable steps, revealing how models strategize and adapt.

MODEL PERFORMANCE AND EMERGENT BEHAVIORS

Initial evaluations using FLE have highlighted significant performance disparities among leading LLMs. Claude, for example, demonstrated notably superior performance, particularly in the Open Play mode, suggesting advanced long-term planning capabilities. In contrast, models like DeepSeek performed well in Lab Play but struggled in Open Play, often setting myopic objectives like creating excessive numbers of chests. This divergence underscores the importance of strategic objective setting and sustained focus in complex, open-ended problem-solving scenarios, differentiating spatial reasoning from strategic foresight.

CHALLENGES WITH VISION AND REASONING MODELS

Attempts to integrate visual input, such as screenshots of the game state, have not yielded expected performance gains. The complexity of large Factorio factories often leads to LLMs hallucinating entities or misinterpreting the game state, sometimes degrading performance. Similarly, preliminary results with reasoning-focused models suggest they may not outperform general models in FLE, possibly because the environment already involves extended reasoning traces. The team plans to explore vision with a simplified geometric renderer and to further evaluate reasoning models in future iterations.

FUTURE DIRECTIONS AND IMPLICATIONS FOR AI ALIGNMENT

The FLE project is envisioned as a multi-phase initiative. The next phase focuses on training models directly on unbounded objectives, with a particular emphasis on AI alignment. The creators are interested in simulating scenarios like the 'paperclip maximizer' to study goal content integrity and instrumental convergence—the tendency for agents to resist changing their core objectives. By observing whether such behaviors emerge in Factorio, the research aims to inform alignment strategies, potentially shifting attitudes towards the critical importance of setting an AI's initial objectives correctly, given the difficulty of altering them later.

Mentioned in This Episode

●Software & Apps

●Tools

●Companies

●Studies Cited

●Concepts

●People Referenced

Common Questions

The Factorio learning environment is a system designed to benchmark and train AI models within the complex game of Factorio. It was created to address the limitations of simpler environments and to explore AI capabilities in long-term planning, complex systems, and goal setting, drawing inspiration from concepts like the paperclip maximizer.

Topics

Ai Safety Reinforcement Learning AI & Machine Learning Technology & Innovation Code Generation Large Language Models Long-term Planning AI Benchmarking Autonomous Agents Game Agents

Mentioned in this video

People

Jack Hopkins

One of the two researchers behind the Factorio learning environment, who initiated the project as a weekend endeavor.

Doctor Strange

A character from Marvel comics/movies, used as an analogy for recognizing AI limitations and the importance of when to refuse a task.

Software & Apps

GPT

Generative Pre-trained Transformer models, mentioned in comparison to DeepSeek's performance in Lab Play.

Terraform

An infrastructure as code tool mentioned as an example of declarative specification, similar to Factorio blueprints.

Archon

A protocol carried by TCP used to hook into the admin console of multiplayer Factorio servers, enabling remote execution of actions for large-scale AI training.

GPT-4

A language model noted for its use of defensive programming and self-assertions, though sometimes these checks were incorrectly set up.

GPT-4 Mini

A model that performed poorly, at times refusing to continue and stating it needed to be reset, indicating issues with ambition or determination.

Minecraft

A game mentioned for comparison with Factorio regarding complexity and AI benchmarking.

TCP

Transmission Control Protocol, a network protocol used in conjunction with Archon to execute actions in the game remotely.

Factorio

A factory building simulation game used as an environment for benchmarking AI models, noted for its complexity and scale requiring millions of resources to launch a rocket.

Claude Sonnet 3.5

A version of the Claude model characterized by a 'fire and forget' coding style, which is Pythonic but less cautious about errors.

AlphaStar

An AI developed by DeepMind for playing StarCraft, noted for its precise unit management capabilities.

Python

The programming language chosen for the interface with Factorio, as pre-trained language models are proficient in it and it allows for high-level action invocation.

Math Frontier

A problem-solving environment where reasoning models perform well, contrasted with their performance in the Factorio setting where pre-made reasoning traces might negate benefits.

Mind Dojo

A data source or project that inspired the collection of Factorio data and blueprints.

Gemini

A family of AI models whose performance is shown on a log graph, positioned between Claude and GPT-4 in the published results.

Claude

An AI model that demonstrated superior performance in Open Play compared to DeepSeek, attributed perhaps to better training for long-term planning.

CloudFormation

An infrastructure as code service mentioned as an example of declarative specification, similar to Factorio blueprints.

Concepts

Lua

A low-level scripting language that Factorio traditionally uses for mods, but was not suitable for large-scale AI training in this project.

Paperclip Maximizer

A cautionary tale used as motivation for benchmarking AI models, exploring potential negative outcomes of optimizing a single goal, like maximizing paperclips or factory output.

Companies

DeepSeek

An AI model that performed decently in Lab Play but struggled significantly in Open Play, often defaulting to creating excessive numbers of chests.

DeepMind

A research company known for developing AlphaStar, an AI for StarCraft, which had mechanical advantages over human players.

Organizations

Decible

Company where Allesio serves as partner and CTO.

Media

StarCraft

A real-time strategy game mentioned in the context of AI development, contrasted with Factorio and aligned with Age of Empires as a potential domain.

Voyager

A paper previously discussed on the podcast, related to AI learning through games like Minecraft.

Age of Empires 2

A game suggested as a potential environment for AI benchmarking due to its well-designed gameplay and currently poor AI.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free