Key Moments
Is finetuning GPT4o worth it?
Key Moments
Cosine launches Genie, an AI for SWE-Bench, using fine-tuned GPT-4o and novel data techniques.
Key Insights
Genie achieves state-of-the-art performance on SWE-Bench by leveraging fine-tuned GPT-4o and a specialized data pipeline.
The development of Genie was driven by the limitations of existing LLMs and the need for a more capable AI software engineering tool.
Data collection focused on the *process* of software engineering, including failures and iterations, not just successful code.
Fine-tuning GPT-4o with extensive, curated data was crucial for Genie's capabilities, pushing the boundaries of what's possible.
Genie's workflow includes file retrieval, planning, code writing, and testing, with an emphasis on self-play and iterative improvement.
Access to large context windows and advanced fine-tuning techniques were critical enablers for Genie's development.
FROM MOBILE DEVELOPMENT TO AI AGENTS
The journey to Cosine's Genie began with founders Ali and Sam, who honed their skills building mobile applications and core systems for an acquired startup. Their experience in a fast-paced startup environment, including working with early GPT-3 models, sparked an ambition to automate software engineering tasks. They recognized the potential of large language models but also their limitations, which propelled them towards developing a more sophisticated AI agent.
THE EVOLUTION OF GENIE'S ARCHITECTURE
Initial attempts to build apps with GPT-3 were rudimentary, highlighting the need for better context management within limited token windows. The development of Genie was significantly aided by the advent of larger context windows, particularly 128k, and the ability to fine-tune models. This allowed for more comprehensive input and a better understanding of complex codebases, moving beyond simple code generation to actual software engineering problem-solving.
THE STRATEGY BEHIND FINE-TUNING AND DATA
Genie's advanced capabilities stem from fine-tuning GPT-4o with a meticulously curated dataset. This data emphasized not just correct code, but also the process of software engineering, including runtime errors and iterative development. The goal was to train an AI that could understand *why* and *how* software is built, not just replicate successful outcomes. This approach contrasts with models that primarily learn from final, clean code outputs.
GENIE'S CORE WORKFLOW AND CAPABILITIES
Genie operates through four key stages: finding relevant files, planning actions, writing code, and running tests. A significant innovation is its approach to code retrieval, which mimics human developer behavior by traversing file systems and using definitions, rather than relying solely on traditional semantic search. This method, refined through self-play and extensive training, drastically improves accuracy in locating necessary code snippets.
NAVIGATING THE CHALLENGES OF BENCHMARKING
Achieving state-of-the-art results on SWE-Bench was a major milestone for Genie. The team encountered challenges with submission requirements, such as providing detailed 'trajectories' or reasoning processes, which they could not fully disclose due to proprietary concerns. However, they found success with SWE-Bench Verified, a smaller, more iteration-friendly benchmark, where Genie achieved a leading score, demonstrating its practical effectiveness.
THE FUTURE OF AI IN SOFTWARE ENGINEERING
Cosine envisions Genie as a platform that can be added to any foundational model, enhancing its software engineering capabilities. Their focus is on expanding the dataset, improving performance across different models and languages, and allowing customers to fine-tune versions of Genie on their own private codebases. This approach aims to make AI a more integrated and powerful collaborator in the software development lifecycle.
Mentioned in This Episode
●Software & Apps
●Companies
Genie: A Smarter AI Software Engineering Colleague
Practical takeaways from this episode
Do This
Avoid This
Genie's SWE Bench Pass Rate by Context Window Length
Data extracted from this episode
| Context Window (Tokens) | Pass Rate |
|---|---|
| > 60k | Likely to fail (less than 0.5 probability) |
| < 60k | More likely to succeed |
Genie's Training Data Mix
Data extracted from this episode
| Language | Percentage |
|---|---|
| JavaScript | 49% |
| Python | 21% |
| TypeScript | 14% |
| TSX | 14% |
| Other 11 Languages | Covered |
Common Questions
Genie is an AI software engineering tool developed by Cosign. It's significant because it's fine-tuned on a specific process that emphasizes understanding human problem-solving in software development, achieving state-of-the-art performance on benchmarks like SWE Bench.
Topics
Mentioned in this video
The company founded by Ali who is discussing their AI software engineering tool, Genie. Previously named 'Built'.
The organization behind GPT-3, GPT-4, and other large language models, with whom Cosign collaborates on fine-tuning.
A company that acquired Ali's previous startup, where he worked for about a year and a half.
The original name of Cosign, which was later changed due to pronunciation issues during YC.
A programming language strongly represented in Cosign's training data, noted for its superiority.
A more advanced model that enabled better fine-tuning and performance for Genie, particularly with larger context windows.
A benchmark used to evaluate the performance of AI models in software engineering tasks, which Cosign's Genie has achieved high scores on.
A language model mentioned in the context of OpenAI's progress and early AI models.
A model trained by OpenAI specifically for writing code, which influenced the development of AI coding tools.
A language model whose paper discussed synthetic data generation for code, relevant to Cosign's training methodologies.
Cosign's AI software engineering tool designed to automate development tasks, fine-tuned on specific datasets.
An earlier language model from OpenAI, mentioned in the context of the evolution of AI models.
The specific model available in the early GPT-3 playground that the founders experimented with.
An AI software engineering tool that influenced Cosign's approach and provided a benchmark for comparison with Genie.
An early language model from OpenAI that inspired the founders of Cosign to explore AI for coding tasks.
An AI pair programmer that was emerging around the same time Cosign was exploring AI for code generation.
A smaller, more cost-effective version of SWE Bench, used by Cosign for faster iteration and evaluation of Genie.
A foundational model used in early development of Genie, noted for its limitations in context window and intelligence for software engineering tasks.
A CI/CD platform that Genie can integrate with to run tests and checks on code.
More from Latent Space
View all 185 summaries
86 minNVIDIA's AI Engineers: Brev, Dynamo and Agent Inference at Planetary Scale and "Speed of Light"
72 minCursor's Third Era: Cloud Agents — ft. Sam Whitmore, Jonas Nelle, Cursor
77 minWhy Every Agent Needs a Box — Aaron Levie, Box
42 min⚡️ Polsia: Solo Founder Tiny Team from 0 to 1m ARR in 1 month & the future of Self-Running Companies
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free