How does Fullstack-Bench measure AI coding performance?

The benchmark measures performance through a time-limited, task-based evaluation. It involves grading the AI's ability to complete specific backend tasks, such as chat or file management, against predefined rubrics, with human intervention for hints and error reporting.

What were the key findings of the Fullstack-Bench evaluation?

Convex performed exceptionally well, often autonomously solving tasks like chat and to-do apps. However, models sometimes struggled with more complex tasks or subtle API distinctions, highlighting the need for AI-friendly platform design.

What are the challenges in evaluating AI coding agents?

Evaluating AI coders is challenging due to the need for human oversight (hints, error reporting), the difficulty in creating fully automated evaluation processes, and the unpredictable nature of AI 'hallucinations' or getting stuck in loops.

How can AI advancements influence API design?

AI advancements can 'calcify' APIs because models train on existing data. Changing an API becomes difficult as it might break AI training. This necessitates careful design that considers both human developer experience (DX) and AI agent experience (AX).

What makes a platform good for AI coding?

Good AI coding targets have tight feedback loops, strong guardrails (like end-to-end type safety), code that is easy to grade, and utilize standard, procedural code. Strong abstractions provided by well-chosen libraries are also crucial.

What is Convex's strategy for supporting AI agents?

Convex is focused on making its platform easy for AI to generate, optimizing for the common shape of current apps. They are also exploring future patterns like durable execution and closer proximity to data sources for more advanced agent architectures.

Key Moments

Fullstack-Bench: The Eval for Coding Agents — with Sujay Jayakar, Chief Scientist, Convex

Latent Space Podcast

Science & Technology4 min read32 min video

Mar 19, 2025|637 views|14

Save to Pod

Key Moments

TL;DR

Fullstack-Bench evaluates AI coding agents on backend development tasks across different platforms.

Key Insights

Fullstack-Bench is a new benchmark for evaluating AI coding agents on full-stack development tasks.

Convex, a reactive database platform, performs well in the benchmark, particularly for backend development.

The benchmark tests AI agents across various backend architectures (Convex, Supabase, traditional stack).

Evaluating AI code generation is challenging and often requires manual intervention and human feedback.

AI coding agents struggle with subtle API distinctions and complex reasoning tasks like rule-based systems.

Strong abstractions and well-represented, procedural code improve AI agent performance.

The benchmark aims to improve AI coding tools and inform platform design for better AI compatibility.

INTRODUCTION TO FULLSTACK-BENCH

Convex, a reactive database platform designed for full-stack application development, has introduced Fullstack-Bench. This benchmark aims to rigorously evaluate the capabilities of AI coding agents, particularly in autonomously handling backend development tasks. Driven by the increasing use of AI in coding and the desire for more quantitative analysis, Convex developed this benchmark to understand how AI agents perform on tasks typically handled by human developers. The initiative is also a way for Convex to benchmark itself against competitors and identify areas for platform improvement.

BENCHMARK DESIGN AND METHODOLOGY

Fullstack-Bench evaluates AI agents on a 3x3 matrix of tasks and backend architectures. The tasks include building a chat application, a to-do list, and a file management system. These are tested against three backend configurations: Convex, Supabase, and a traditional three-tier stack using FastAPI and Redis. The evaluation involves setting time limits for each task (e.g., 30 minutes for chat and to-do, 60 minutes for files) and involves manual intervention for providing hints and grading. The grading process uses predetermined rubrics to assess functionality, live updates, and overall correctness.

PERFORMANCE EVALUATION AND OBSERVATIONS

Initial results from Fullstack-Bench show that AI agents, particularly when using Convex, can autonomously solve the chat and to-do list tasks with minimal human intervention, albeit with some iterative bug fixing. The agents achieve close to full marks on the files task with Convex within the allotted time. However, other backend configurations showed agents getting lost or making little progress on the files task. These findings help validate hypotheses about Convex's design and provide general insights into what facilitates or hinders AI coding agent performance.

CHALLENGES IN AI CODE EVALUATION

A significant challenge highlighted by the benchmark is the difficulty of evaluating AI code generation accurately and automatically. The manual nature of providing hints and grading is time-consuming. The benchmark involves humans observing if agents get stuck in loops or make incorrect assumptions. Hints are provided by identifying specific issues, such as confusion between `useEffect` and `tanstack query` or errors in data prefix handling. This process underscores the need for more automated and robust evaluation methods for autonomous coding systems.

IMPLICATIONS FOR AI AND PLATFORM DESIGN

The benchmark reveals that AI agents struggle with subtle API distinctions, as seen with Convex's handling of null vs. undefined, which a human might easily grasp but an AI finds challenging. This suggests that platform developers should consider the potential for AI hallucinations and knowledge gaps in their API design. The research indicates that while good developer experience (DX) can translate to good agent experience (AX), it's not a perfect correlation. Convex's close proximity to Firebase, while familiar to humans, can sometimes lead to AI confusion if not perfectly aligned.

FUTURE DIRECTIONS AND GENERALIZATIONS

Future work for Fullstack-Bench includes exploring more automated grading mechanisms, such as using model checkers like LARCH for verifying code correctness and concurrency, and adapting database benchmarking principles like TPC for evaluating agent throughput. The project also aims to investigate token efficiency to improve performance with fewer tokens. General takeaways emphasize the benefit of tight feedback loops, strong guardrails (like end-to-end type safety), highly procedural code, and well-chosen libraries with strong abstractions, all of which contribute to better AI agent performance.

CONVEX AND THE FUTURE OF AGENT BACKENDS

Convex is strategically positioned to serve as the backend for AI agents. The company sees two paths: either AI agents write backends that are similar to current ones but at lower cost, or AI fundamentally changes backend infrastructure. Convex is currently focusing on making existing app patterns easier for AI generation, while acknowledging future potential for new patterns involving durable execution and proximity to data sources like vector databases. This involves investing in features that support AI's evolving capabilities and the new shape of AI-driven applications.

Mentioned in This Episode

●Software & Apps

●Tools

●Companies

●Organizations

●Concepts

Common Questions

Fullstack-Bench is a benchmark developed by Convex to rigorously evaluate the capabilities of AI coding agents, particularly in full-stack application development. Convex created it to better understand how their platform performs with AI coders and to guide future development.

Topics

AI & Machine Learning Technology & Innovation Programming & Software Backend Development Developer Experience Full Stack Development AI Coding Agents API Design Benchmarking AI Agent Architectures

Mentioned in this video

Software & Apps

Cloud 3.5

An AI model that demonstrated impressive autonomous coding capabilities, capable of coding for over 10 minutes and fixing its own bugs when provided with the right feedback.

GPT-4o

An AI model mentioned as performing better than GPT-4 on Convex evals, though not a 'slam dunk' improvement.

PostgreSQL

A traditional database system mentioned as part of the existing landscape that Convex aims to improve upon for modern development needs.

Devin

An AI coding agent for which the Fullstack-Bench is partly inspired by, particularly regarding the idea of exposing a code environment.

FastAPI

A web framework used as part of a traditional three-tier stack in the benchmark comparison. Its flexibility was noted as sometimes leading to AI coding errors.

Claude 3.5

An AI model that performed well on Convex evals and the File task in the Fullstack-Bench, outperforming Claude 3.7 on Convex evals.

SweBench

A benchmark mentioned for its approach of starting from real GitHub data and working towards a goal, seen as analogous to an integration test.

Redis

An in-memory data structure store used as part of a traditional three-tier stack in the benchmark comparison.

TanStack Query

A library that the AI model struggled to combine with SSE (Server-Sent Events) when working with FastAPI, requiring a hint to use useEffect instead.

Claude 3.7

An AI model that performed worse than Claude 3.5 on Convex evals, suggesting that model improvements don't always translate directly to gains on specific benchmarks.

MySQL

A traditional database system mentioned as part of the existing landscape that Convex aims to improve upon for modern development needs.

Convex

A reactive database and computer platform, positioning itself as a competitor to Firebase, designed for full-stack app development. It's being evaluated for its performance with AI coding tools.

Cursor Composer

An AI coding environment used in the benchmark, which was able to autonomously code for a to-do app and fix its own bugs.

Next.js

A framework used for creating the front-end applications in the benchmark's standardized environment.

useEffect

A React hook suggested to the AI model as an alternative to combining TanStack Query with SSE, resolving a confusion it encountered.

Companies

Firebase

Mentioned as a competitor to Convex, serving as a benchmark and point of comparison for backend infrastructure and database solutions.

Temporal

A system mentioned in the context of agent architectures for handling unreliable services through retries and exponential backoffs.

Anthropic

Mentioned for a blog post on agent architectures, which was noted for describing what an agent is NOT, rather than positive definitions.

Organizations

a16z

Venture capital firm mentioned for their investment in Convex and their interest in AI talent and agent development.

Jepson

Mentioned as the origin of 'L', a model checker used for verifying AI-generated code.

TPC

An organization whose database evaluation ideas (like throughput before congestion collapse) are being adapted for evaluating AI models in system engineering tasks.

Concepts

7GUIs

A set of tasks mentioned as a similar benchmark exercise that focuses on common front-end development tasks, with the suggestion of adding backends to it.

Products

Bolt

Mentioned in relation to a past project by Thinkster that conducted real-world comparisons of building clones by swapping front-end and back-end technologies.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free