How did Cosign develop Genie?

Cosign's journey started with exploring early language models like GPT-3. They iterated through various challenges, including building retrieval tools and dealing with context window limitations, eventually leveraging GPT-4 Turbo fine-tuning and extensive custom data pipelines to create Genie.

What makes Genie's training data different from other AI models?

Genie is trained not just on code, but on the process of software engineering, focusing on how humans approach problems. This includes analyzing PRs, understanding intent, and even incorporating synthetically generated errors and self-improvement loops to learn from mistakes.

How does Genie handle code retrieval?

Instead of relying on traditional semantic search engines, Genie uses a self-play approach trained on its data to mimic a developer's process. It navigates file systems, uses go-to-definition, and traverses code structures to find relevant information, achieving around 66% retrieval accuracy.

What are the core components of Genie's workflow?

Genie's workflow is divided into four main parts: finding files, planning actions, writing code, and running tests. It prioritizes core functionality like file retrieval and planning over external tools like web browsers.

What are the key findings regarding context window length and model performance?

Research indicates that performance for models like Genie often degrades linearly beyond a certain context window length (around 60k tokens for one observed case). While models can handle larger windows, their effectiveness diminishes, making efficient use of context crucial.

How does Genie perform on the SWE Bench benchmark?

Genie achieved a 43.8% score on the SWE Bench Verified dataset, which is considered state-of-the-art. This performance is significantly higher than GPT-4's reported score of around 33% on a similar benchmark.

Why isn't Cosign publishing their SWE Bench trajectories?

As a competitive company, publishing detailed model trajectories could allow competitors to easily replicate their fine-tuning process. They provide the code patches for verification but withhold the internal reasoning steps.

What is Cosign's future plan for Genie?

The plan is to become the world leader in fine-tuning and applying AI models for software engineering. This involves expanding the dataset, supporting more models, and developing specialized versions of Genie for specific customer codebases.

What advice does Cosign have for other startups?

Cosign emphasizes the importance of tackling hard problems and doing the difficult thing, as it often leads to the biggest payoffs. They also highlight the value of passionate teams focused on ambitious goals.

Who is the ideal customer for Genie?

The ideal customer is someone willing to try a new AI-powered development workflow, particularly those using languages like TypeScript, JavaScript, Python, or Java. Companies should be open to exploring abstracted developer work with AI guidance.

Key Moments

Is finetuning GPT4o worth it?

Latent Space Podcast

Science & Technology3 min read62 min video

Aug 22, 2024|2,158 views|55|4

Save to Pod

Key Moments

TL;DR

Cosine launches Genie, an AI for SWE-Bench, using fine-tuned GPT-4o and novel data techniques.

Key Insights

Genie achieves state-of-the-art performance on SWE-Bench by leveraging fine-tuned GPT-4o and a specialized data pipeline.

The development of Genie was driven by the limitations of existing LLMs and the need for a more capable AI software engineering tool.

Data collection focused on the *process* of software engineering, including failures and iterations, not just successful code.

Fine-tuning GPT-4o with extensive, curated data was crucial for Genie's capabilities, pushing the boundaries of what's possible.

Genie's workflow includes file retrieval, planning, code writing, and testing, with an emphasis on self-play and iterative improvement.

Access to large context windows and advanced fine-tuning techniques were critical enablers for Genie's development.

FROM MOBILE DEVELOPMENT TO AI AGENTS

The journey to Cosine's Genie began with founders Ali and Sam, who honed their skills building mobile applications and core systems for an acquired startup. Their experience in a fast-paced startup environment, including working with early GPT-3 models, sparked an ambition to automate software engineering tasks. They recognized the potential of large language models but also their limitations, which propelled them towards developing a more sophisticated AI agent.

THE EVOLUTION OF GENIE'S ARCHITECTURE

Initial attempts to build apps with GPT-3 were rudimentary, highlighting the need for better context management within limited token windows. The development of Genie was significantly aided by the advent of larger context windows, particularly 128k, and the ability to fine-tune models. This allowed for more comprehensive input and a better understanding of complex codebases, moving beyond simple code generation to actual software engineering problem-solving.

THE STRATEGY BEHIND FINE-TUNING AND DATA

Genie's advanced capabilities stem from fine-tuning GPT-4o with a meticulously curated dataset. This data emphasized not just correct code, but also the process of software engineering, including runtime errors and iterative development. The goal was to train an AI that could understand *why* and *how* software is built, not just replicate successful outcomes. This approach contrasts with models that primarily learn from final, clean code outputs.

GENIE'S CORE WORKFLOW AND CAPABILITIES

Genie operates through four key stages: finding relevant files, planning actions, writing code, and running tests. A significant innovation is its approach to code retrieval, which mimics human developer behavior by traversing file systems and using definitions, rather than relying solely on traditional semantic search. This method, refined through self-play and extensive training, drastically improves accuracy in locating necessary code snippets.

NAVIGATING THE CHALLENGES OF BENCHMARKING

Achieving state-of-the-art results on SWE-Bench was a major milestone for Genie. The team encountered challenges with submission requirements, such as providing detailed 'trajectories' or reasoning processes, which they could not fully disclose due to proprietary concerns. However, they found success with SWE-Bench Verified, a smaller, more iteration-friendly benchmark, where Genie achieved a leading score, demonstrating its practical effectiveness.

THE FUTURE OF AI IN SOFTWARE ENGINEERING

Cosine envisions Genie as a platform that can be added to any foundational model, enhancing its software engineering capabilities. Their focus is on expanding the dataset, improving performance across different models and languages, and allowing customers to fine-tune versions of Genie on their own private codebases. This approach aims to make AI a more integrated and powerful collaborator in the software development lifecycle.

Mentioned in This Episode

●Software & Apps

●Companies

Genie: A Smarter AI Software Engineering Colleague

Practical takeaways from this episode

Do This

Focus on extracting as much signal as possible from historical data to understand human problem-solving approaches.

Emulate how humans think about problems, not just how models process them, for better reasoning.

Prioritize retrieval of correct files and planning before worrying about advanced tools like browsers.

Iteratively train models on both correct and incorrect code states to learn from mistakes.

Leverage synthetic data generation, including back-translation and runtime errors, to improve models.

Utilize smaller, faster benchmarks like SWE Bench Verified for efficient iteration.

Be willing to explore and try new AI tools and workflows, even if they differ from conventional methods.

Focus on the hard problems with the biggest potential payoff.

Avoid This

Do not solely rely on models writing code; focus on the broader discipline of software engineering.

Avoid using only final code diffs in training data, as this loses crucial context.

Do not assume basic LLM planning is sufficient for complex software engineering tasks.

Don't over-rely on just open-source data; private workflows often hold unique insights.

Do not neglect the importance of data cleaning and alignment for model usefulness.

Avoid publishing proprietary training data or methods that could be easily distilled by competitors.

Do not see large context windows as a complete solution without considering model intelligence and performance degradation.

Genie's SWE Bench Pass Rate by Context Window Length

Data extracted from this episode

Context Window (Tokens)	Pass Rate
> 60k	Likely to fail (less than 0.5 probability)
< 60k	More likely to succeed

Genie's Training Data Mix

Data extracted from this episode

Language	Percentage
JavaScript	49%
Python	21%
TypeScript	14%
TSX	14%
Other 11 Languages	Covered

Common Questions

Genie is an AI software engineering tool developed by Cosign. It's significant because it's fine-tuned on a specific process that emphasizes understanding human problem-solving in software development, achieving state-of-the-art performance on benchmarks like SWE Bench.

Topics

AI & Machine Learning Technology & Innovation Code Generation Large Language Models Synthetic Data Software Engineering AI Development Tools

Mentioned in this video

Companies

Cosign

The company founded by Ali who is discussing their AI software engineering tool, Genie. Previously named 'Built'.

OpenAI

The organization behind GPT-3, GPT-4, and other large language models, with whom Cosign collaborates on fine-tuning.

Gopuff

A company that acquired Ali's previous startup, where he worked for about a year and a half.

Built

The original name of Cosign, which was later changed due to pronunciation issues during YC.

Software & Apps

Typescript

A programming language strongly represented in Cosign's training data, noted for its superiority.

GPT-4 Turbo

A more advanced model that enabled better fine-tuning and performance for Genie, particularly with larger context windows.

SWE-Bench

A benchmark used to evaluate the performance of AI models in software engineering tasks, which Cosign's Genie has achieved high scores on.

BERT

A language model mentioned in the context of OpenAI's progress and early AI models.

Codex

A model trained by OpenAI specifically for writing code, which influenced the development of AI coding tools.

Llama 3

A language model whose paper discussed synthetic data generation for code, relevant to Cosign's training methodologies.

Genie

Cosign's AI software engineering tool designed to automate development tasks, fine-tuned on specific datasets.

GPT-2

An earlier language model from OpenAI, mentioned in the context of the evolution of AI models.

Da Vinci 2

The specific model available in the early GPT-3 playground that the founders experimented with.

Devin

An AI software engineering tool that influenced Cosign's approach and provided a benchmark for comparison with Genie.

GPT-3

An early language model from OpenAI that inspired the founders of Cosign to explore AI for coding tasks.

Copilot

An AI pair programmer that was emerging around the same time Cosign was exploring AI for code generation.

SWE-Bench Verified

A smaller, more cost-effective version of SWE Bench, used by Cosign for faster iteration and evaluation of Genie.

GPT-3.5

A foundational model used in early development of Genie, noted for its limitations in context window and intelligence for software engineering tasks.

GitHub Actions

A CI/CD platform that Genie can integrate with to run tests and checks on code.

Locations

Exeter

University hometown where Ali and his co-founder met.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free