Key Moments
Fullstack-Bench: The Eval for Coding Agents — with Sujay Jayakar, Chief Scientist, Convex
Key Moments
Fullstack-Bench evaluates AI coding agents on backend development tasks across different platforms.
Key Insights
Fullstack-Bench is a new benchmark for evaluating AI coding agents on full-stack development tasks.
Convex, a reactive database platform, performs well in the benchmark, particularly for backend development.
The benchmark tests AI agents across various backend architectures (Convex, Supabase, traditional stack).
Evaluating AI code generation is challenging and often requires manual intervention and human feedback.
AI coding agents struggle with subtle API distinctions and complex reasoning tasks like rule-based systems.
Strong abstractions and well-represented, procedural code improve AI agent performance.
The benchmark aims to improve AI coding tools and inform platform design for better AI compatibility.
INTRODUCTION TO FULLSTACK-BENCH
Convex, a reactive database platform designed for full-stack application development, has introduced Fullstack-Bench. This benchmark aims to rigorously evaluate the capabilities of AI coding agents, particularly in autonomously handling backend development tasks. Driven by the increasing use of AI in coding and the desire for more quantitative analysis, Convex developed this benchmark to understand how AI agents perform on tasks typically handled by human developers. The initiative is also a way for Convex to benchmark itself against competitors and identify areas for platform improvement.
BENCHMARK DESIGN AND METHODOLOGY
Fullstack-Bench evaluates AI agents on a 3x3 matrix of tasks and backend architectures. The tasks include building a chat application, a to-do list, and a file management system. These are tested against three backend configurations: Convex, Supabase, and a traditional three-tier stack using FastAPI and Redis. The evaluation involves setting time limits for each task (e.g., 30 minutes for chat and to-do, 60 minutes for files) and involves manual intervention for providing hints and grading. The grading process uses predetermined rubrics to assess functionality, live updates, and overall correctness.
PERFORMANCE EVALUATION AND OBSERVATIONS
Initial results from Fullstack-Bench show that AI agents, particularly when using Convex, can autonomously solve the chat and to-do list tasks with minimal human intervention, albeit with some iterative bug fixing. The agents achieve close to full marks on the files task with Convex within the allotted time. However, other backend configurations showed agents getting lost or making little progress on the files task. These findings help validate hypotheses about Convex's design and provide general insights into what facilitates or hinders AI coding agent performance.
CHALLENGES IN AI CODE EVALUATION
A significant challenge highlighted by the benchmark is the difficulty of evaluating AI code generation accurately and automatically. The manual nature of providing hints and grading is time-consuming. The benchmark involves humans observing if agents get stuck in loops or make incorrect assumptions. Hints are provided by identifying specific issues, such as confusion between `useEffect` and `tanstack query` or errors in data prefix handling. This process underscores the need for more automated and robust evaluation methods for autonomous coding systems.
IMPLICATIONS FOR AI AND PLATFORM DESIGN
The benchmark reveals that AI agents struggle with subtle API distinctions, as seen with Convex's handling of null vs. undefined, which a human might easily grasp but an AI finds challenging. This suggests that platform developers should consider the potential for AI hallucinations and knowledge gaps in their API design. The research indicates that while good developer experience (DX) can translate to good agent experience (AX), it's not a perfect correlation. Convex's close proximity to Firebase, while familiar to humans, can sometimes lead to AI confusion if not perfectly aligned.
FUTURE DIRECTIONS AND GENERALIZATIONS
Future work for Fullstack-Bench includes exploring more automated grading mechanisms, such as using model checkers like LARCH for verifying code correctness and concurrency, and adapting database benchmarking principles like TPC for evaluating agent throughput. The project also aims to investigate token efficiency to improve performance with fewer tokens. General takeaways emphasize the benefit of tight feedback loops, strong guardrails (like end-to-end type safety), highly procedural code, and well-chosen libraries with strong abstractions, all of which contribute to better AI agent performance.
CONVEX AND THE FUTURE OF AGENT BACKENDS
Convex is strategically positioned to serve as the backend for AI agents. The company sees two paths: either AI agents write backends that are similar to current ones but at lower cost, or AI fundamentally changes backend infrastructure. Convex is currently focusing on making existing app patterns easier for AI generation, while acknowledging future potential for new patterns involving durable execution and proximity to data sources like vector databases. This involves investing in features that support AI's evolving capabilities and the new shape of AI-driven applications.
Mentioned in This Episode
●Software & Apps
●Tools
●Companies
●Organizations
●Concepts
Common Questions
Fullstack-Bench is a benchmark developed by Convex to rigorously evaluate the capabilities of AI coding agents, particularly in full-stack application development. Convex created it to better understand how their platform performs with AI coders and to guide future development.
Topics
Mentioned in this video
A traditional database system mentioned as part of the existing landscape that Convex aims to improve upon for modern development needs.
An AI coding agent for which the Fullstack-Bench is partly inspired by, particularly regarding the idea of exposing a code environment.
A web framework used as part of a traditional three-tier stack in the benchmark comparison. Its flexibility was noted as sometimes leading to AI coding errors.
An AI model that performed well on Convex evals and the File task in the Fullstack-Bench, outperforming Claude 3.7 on Convex evals.
A benchmark mentioned for its approach of starting from real GitHub data and working towards a goal, seen as analogous to an integration test.
An in-memory data structure store used as part of a traditional three-tier stack in the benchmark comparison.
A library that the AI model struggled to combine with SSE (Server-Sent Events) when working with FastAPI, requiring a hint to use useEffect instead.
An AI model that performed worse than Claude 3.5 on Convex evals, suggesting that model improvements don't always translate directly to gains on specific benchmarks.
A traditional database system mentioned as part of the existing landscape that Convex aims to improve upon for modern development needs.
A reactive database and computer platform, positioning itself as a competitor to Firebase, designed for full-stack app development. It's being evaluated for its performance with AI coding tools.
An AI coding environment used in the benchmark, which was able to autonomously code for a to-do app and fix its own bugs.
An AI model that demonstrated impressive autonomous coding capabilities, capable of coding for over 10 minutes and fixing its own bugs when provided with the right feedback.
A framework used for creating the front-end applications in the benchmark's standardized environment.
A React hook suggested to the AI model as an alternative to combining TanStack Query with SSE, resolving a confusion it encountered.
An AI model mentioned as performing better than GPT-4 on Convex evals, though not a 'slam dunk' improvement.
Mentioned as a competitor to Convex, serving as a benchmark and point of comparison for backend infrastructure and database solutions.
A system mentioned in the context of agent architectures for handling unreliable services through retries and exponential backoffs.
Mentioned for a blog post on agent architectures, which was noted for describing what an agent is NOT, rather than positive definitions.
Venture capital firm mentioned for their investment in Convex and their interest in AI talent and agent development.
Mentioned as the origin of 'L', a model checker used for verifying AI-generated code.
An organization whose database evaluation ideas (like throughput before congestion collapse) are being adapted for evaluating AI models in system engineering tasks.
More from Latent Space
View all 106 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free