Key Moments

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read94 min video
Dec 17, 2023|970 views|24|2
Save to Pod
TL;DR

SourceGraph's Cody AI is a 'Normsky' architecture, merging Chomsky and Norvig approaches for superior code intelligence, RAG, and developer productivity.

Key Insights

1

SourceGraph's Cody AI leverages a 'Normsky' architecture, blending Chomsky's formal systems with Norvig's data-driven learning.

2

Effective Retrieval-Augmented Generation (RAG) and high-quality context are paramount, differentiating Cody from other AI coding assistants.

3

Open-source models like StarCoder can compete with proprietary ones when combined with superior context fetching and prompt engineering.

4

The future of AI in coding lies in providing broad codebase understanding and assisting with complex task management, not just line-by-line code generation.

5

AI will augment, not replace, developers, shifting their focus from tedious tasks to higher-level creative and architectural work.

6

Multimodal AI and new interaction patterns hold significant potential for future AI coding assistants beyond current chat and completion interfaces.

FOUNDATIONS OF CODING INTELLIGENCE: FROM GROCK TO SOURCEGRAPH

The conversation opens with introductions to Beyang Liu and Steve Yegge of SourceGraph, highlighting their extensive experience with code indexing and search. Liu's early work at Google with `grok` and `Google Code Search` (later `Kyte`) inspired the founding of SourceGraph with Quinn, aiming to solve the pain points of navigating large codebases. Yegge, with his background at Amazon and Google, shares his insights into tech culture and his appreciation for SourceGraph's engineering prowess, noting his personal use of their tools.

CODY: AN AI CODING AGENT FOCUSED ON CONTEXT QUALITY

Cody is introduced as SourceGraph's AI coding agent, which, while offering familiar features like autocompletion and codebase-aware chat, distinguishes itself through the unparalleled quality of its context. Drawing on SourceGraph's decade-long investment in code understanding for human developers, Cody provides AI with rich, relevant context. This approach prioritizes the RAG mechanism, framing it not as a workaround but as a crucial component for LLMs to accurately understand and interact with proprietary code.

THE "NORMSKY" ARCHITECTURE: MERGING FORMAL SYSTEMS AND DATA-DRIVEN LEARNING

The core of Cody's architecture is dubbed 'Normsky,' a portmanteau of 'Norvig' (representing data-driven machine learning, like LLMs) and 'Chomsky' (representing formal systems, like compilers and parsers). This hybrid approach aims to combine the pattern-recognition power of LLMs with the precise structural understanding derived from code analysis tools. This allows for more reliable code generation, better context utilization, and a more robust overall system, moving beyond the limitations of purely data-driven models.

RAG, CONTEXT, AND THE LIMITATIONS OF AGENTS

The discussion emphasizes that while Retreival-Augmented Generation (RAG) is a critical component, its effectiveness hinges on the quality and relevance of the retrieved context. SourceGraph views RAG as akin to a skilled consultant for the LLM. The team expresses skepticism towards purely agentic approaches that rely solely on LLM orchestration for multi-step processes, arguing that current LLMs lack the consistent reliability needed for fully automated, agentic execution without human oversight.

THE EVOLUTION OF DEV TOOLS AND THE FUTURE FOR ENGINEERING LEADERS

The conversation shifts to the broader impact of AI on software development. While inline code generation tools aid individual developers, the larger challenge is managing code base complexity. SourceGraph aims to equip engineering leaders and tech leads with tools for better codebase understanding, enabling them to manage architectural integrity, identify risks, and track the impact of development on business metrics—a capability enhanced by AI but rooted in their core mission of making code accessible.

DATA PROCESSING AND THE "BIG FRIENDLY GRAPH" (BFG)

A significant portion of the discussion centers on data processing and context fetching. SourceGraph emphasizes that "data mode" refers to a sophisticated pre-processing engine, not just accumulated data. They are developing a new, efficient code graph called 'BFG' (Big Friendly Graph), built using a novel, AI-like iterative experimentation approach. This graph aims to provide rich semantic context, significantly reducing errors like type mismatches in AI-generated code, and is designed for easy integration without complex build system setups.

OPEN SOURCE, MULTIMODAL AI, AND THE FUTURE OF CODING ASSISTANTS

The potential of open-source models is lauded, with StarCoder already competitive for completions when paired with Cody's context. The speakers foresee a future of multimodal AI, opening new interaction paradigms beyond current chat and command interfaces. They suggest that the current form factor of coding assistants is temporary, and new ways of interacting with AI, potentially involving real-time screen analysis or visual input, will emerge, providing more proactive and integrated assistance.

THE ENDURING ROLE OF THE DEVELOPER AND THE DATA ADVANTAGE

Ultimately, the consensus is that AI will augment rather than replace developers. It will handle more of the tedious, boilerplate, and routine tasks, allowing engineers to focus on higher-level creative problem-solving, architecture, and innovation. While agents might eventually automate more complex workflows, the current focus remains on robust, reliable tools like Cody that provide deep code understanding and context, offering a competitive advantage through sophisticated data pre-processing and graph analysis.

Common Questions

Cody's core differentiator is the quality of its context, leveraging Sourcegraph's decade-long experience in building code understanding engines for human developers. It provides codebase-aware chat and inline autocompletion.

Topics

Mentioned in this video

Companies
Temporal

Company of which Max is the head.

Palantir

Company where Beyang Liu worked after Stanford, and where he and Quinn started working together.

OpenAI

An AI company whose models (like GPT-4) Sourcegraph partners with and uses, specifically for chat and commands. The recent drama around their CEO's departure and return caused concern among Sourcegraph's customers.

Google

Company where Beyang Liu interned and Steve Yegge worked, known for its internal code search tools like Grok and Kythe.

Fireworks AI

An inference platform used by Sourcegraph to run StarCoder, allowing them to focus on data fetching and fine-tuning rather than building their own inference stack. Their team includes ex-Meta people knowledgeable in PyTorch.

Microsoft

Mentioned as a potential fallback for Sourcegraph's AI services if OpenAI's stability were to be an issue.

YouTube

Mentioned as an example of a company with excellent recommender engines, drawing parallels to Sourcegraph's approach to code recommendations.

Etsy

Online marketplace where Kelly Norton built the Hound open-source code search project.

Anthropic

An AI company whose models (like Claude) Sourcegraph partners with and uses for chat and commands.

Amazon

Company where Steve Yegge famously worked, described as a giant waterfall of engineers.

Grab

Southeast Asian super app where Steve Yegge was Head of Engineering; initially criticized by one host as a customer, but Steve praised its engineering team and laser focus.

Uber

Mentioned as a comparison to Grab, highlighting Grab's additional functionality as a super app in Southeast Asia.

Hugging Face

An organization driving the open-source AI ecosystem, mentioned for possibly releasing a V2 of StarCoder. Sourcegraph would like to collaborate on benchmarks with them.

Bank of America

Mentioned as an example in a demo of how Cody could generate a stock ticker app.

Codium

A coding assistant mentioned as a competitor in the market.

Spotify

Mentioned as an example of a company with excellent recommender engines, drawing parallels to Sourcegraph's approach to code recommendations.

Wells Fargo

Mentioned as an example in a demo of how Cody could generate a stock ticker app.

Software & Apps
PostgreSQL

The relational database used by Sourcegraph, which found it performed as well as most graph databases for graph workloads in their experience.

Sourcegraph

A company founded 10 years ago to index all code on the internet, now focused on AI coding intelligence with Cody. Initially focused on on-prem deployments, now cloud-hybrid.

TabNine

A coding assistant mentioned as a competitor in the market.

ChatGPT

An AI language model that recently celebrated its one-year anniversary, demonstrating the rapid advancement in AI.

Ruby

A programming language that the speaker spent most of their career writing, noted for being easy to use and read despite not always being the fastest.

Cody

Sourcegraph's AI coding agent designed to provide inline autocompletion, codebase-aware chat, and automate tasks like unit test generation and documentation. Differentiates itself by leveraging Sourcegraph's decade-long expertise in code understanding and context fetching.

Kite

A coding assistant that 'died', mentioned in the context of other coding assistants.

LlamaIndex

An external tool for building LLM applications, which Sourcegraph decided against using due to needing full control over their stack for rapid iteration.

Codex

The AI model that GitHub Copilot is based on.

SQL

A query language, where the speaker notes that few engineers still write queries without AI assistance.

StarCoder

An open-source model used by Cody for inline completions, achieving comparable acceptance rates to Copilot by leveraging Sourcegraph's context fetching.

Google code search

An internal Google developer tool that indexed code and provided a reference graph, later open-sourced as Kythe.

Claude instant

An AI model used by Sourcegraph for completions for a while, but required prompt engineering to output code without extra text.

Grok

The internal system at Google that provided the reference graph for Google Code Search, built by Steve Yegge. It's considered a predecessor to Kythe.

gRPC

Google's internal protocol for backend queries, which is a different approach than LSP.

Kythe

The open-source version of Google's code intelligence system, considered 'Grok V3'.

Code LLaMA

An open-source model that Sourcegraph is not currently using but continuously evaluates against other available models.

BFG

Sourcegraph's internal working name for a new skip-based code graph that is blazing fast, requires zero configuration, and doesn't integrate with build systems, addressing issues like type errors in AI completions.

Typescript

One of the programming languages used in Sourcegraph's AI stack.

Hound

An open-source code search project built at Etsy, similar to Google Code Search.

GPT-4

A Transformer-based LLM from OpenAI, mentioned for its ability to utilize context from the top of its window and its decent chess-playing capability. Sourcegraph still uses GPT-4 for chat and commands.

LangChain

An external tool for building LLM applications, which Sourcegraph decided against using due to needing full control over their stack for rapid iteration.

Non-Stop Cody

An early feature from Cody's first launch that allowed multiple parallel requests to modify a source file, but was difficult to make reliable for general code generation.

Zoekt

A code search system created by Hannes Neinhuis, inspired by the trigram index of Google's original code search.

Lisp

A programming language originally intended for creating rules-based AI systems, representing the Chomsky approach.

React

A JavaScript library mentioned as a framework for building a hypothetical stock ticker app demo with Cody.

Cursor

A coding assistant mentioned as a competitor in the market.

Go

One of the programming languages used in Sourcegraph's AI stack.

GitHub Copilot

A prominent AI coding assistant that Cody is often compared against, particularly regarding context utilization and completion acceptance rates. It uses variants of Codex and local context.

Google Duet AI

A Google coding assistant that initially focused on pulling local context, similar to GitHub Copilot's initial approach.

Skip Protocol

A new protocol developed by Sourcegraph that aims to combine the best ideas from LSP and Kythe, making it easier to write indexers and model symbolic characteristics of code.

Claude

An AI model that reportedly uses context from the bottom of its window better than other models. Sourcegraph uses Claude and GPT-4 for chat and commands.

Replit

A coding platform mentioned for its potential to bootstrap its own proprietary dataset through bounties, though it's rumored they're betting on OpenAI.

Prolog

A logic programming language, mentioned in the context of deterministic Chomsky approaches to AI that didn't scale effectively.

Twitter API

An example of a notoriously challenging API to work with, which the host successfully scraped using Cody web.

Python

Mentioned as an example of a programming language that might become less relevant for human understanding as AI generates more code through natural language.

Rust

One of the programming languages now being used in Sourcegraph's AI stack.

More from Latent Space

View all 191 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free