How does Sourcegraph view RAG (Retrieval Augmented Generation) in AI coding?

Sourcegraph believes RAG is crucial for AI coding agents, acting as a knowledgeable consultant to the LLM that understands the specific codebase. They argue that focusing on superior RAG implementation is more effective than solely pursuing better UI or chat responses.

Why is Sourcegraph bearish on current 'agentic execution' approaches in AI coding?

Sourcegraph is skeptical of purely Transformer-based LLMs acting as orchestrators for fully automated, multi-step agentic workflows. They find current models lack the reliability for first-try working code generation, necessitating a human-in-the-loop and favoring search-based approaches over purely generative ones.

What is the 'Normsky' architecture at Sourcegraph?

'Normsky' is an internal name for Sourcegraph's hybrid AI architecture, combining Norvig's data-driven approach (LLMs) with Chomsky's formal systems (compilers, parsers, knowledge graphs). It stands for Non-agentic Rapid Multisource Code Intelligence, emphasizing deterministic, fast, and diverse context fetching.

How important is data pre-processing for AI coding models?

Data pre-processing is considered the 'unsexy, underappreciated secret' to effective AI coding, essential for feeding high-quality input to models. Sourcegraph uses parsers to dissect code into fine-grained semantic units, ensuring meaningful context for models without proprietary customer data.

Will Domain Specific Languages (DSLs) become obsolete with AI code generation?

While some DSLs might become less necessary as LLMs can easily generate code from natural language, others will remain valuable. The long-term trend suggests humans might not need to understand low-level programming languages as intensely, similar to how few programmers write SQL without AI today.

What are the limitations of LSP and how does Sourcegraph's Skip protocol address them?

LSP, while great for developer ecosystems, is criticized for its range-based approach that doesn't build a true symbolic model of code, leading to latency. Sourcegraph's Skip protocol aims to merge good ideas from LSP and Kythe, simplifying schema while modeling symbolic characteristics to build a robust knowledge graph without complex build system integration.

What is Sourcegraph's BFG and how does it improve AI completions?

BFG (Big Friendly Graph) is Sourcegraph's new Skip-based code graph, designed for blazing fast, zero-configuration indexing. It improves AI completions by pulling in graph context like type information and references, enabling AI tools to make fewer type errors that even a novice human developer wouldn't make.

What is the biggest unsolved question in AI for Sourcegraph?

For Sourcegraph, the most interesting unsolved question is how to achieve reliable, 'first-try' working code generation even for single-hop tasks. They believe addressing this is a necessary building block before robust, multi-step automation through agents can become feasible.

How should engineering managers think about AI and developer productivity?

AI tools will shift developer focus from tedious low-level tasks to higher-level creative ones. Engineering managers can potentially use AI to gain whole-codebase understanding, identify security issues, and ultimately connect technical work to business metrics, making productivity measurable in a CFO-comprehensible way.

Key Moments

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

Latent Space Podcast

Science & Technology4 min read94 min video

Dec 17, 2023|974 views|24|2

ai steve yegge sourcegraph beyang liu code search copilot

Save to Pod

Key Moments

TL;DR

SourceGraph's Cody AI is a 'Normsky' architecture, merging Chomsky and Norvig approaches for superior code intelligence, RAG, and developer productivity.

Key Insights

SourceGraph's Cody AI leverages a 'Normsky' architecture, blending Chomsky's formal systems with Norvig's data-driven learning.

Effective Retrieval-Augmented Generation (RAG) and high-quality context are paramount, differentiating Cody from other AI coding assistants.

Open-source models like StarCoder can compete with proprietary ones when combined with superior context fetching and prompt engineering.

The future of AI in coding lies in providing broad codebase understanding and assisting with complex task management, not just line-by-line code generation.

AI will augment, not replace, developers, shifting their focus from tedious tasks to higher-level creative and architectural work.

Multimodal AI and new interaction patterns hold significant potential for future AI coding assistants beyond current chat and completion interfaces.

FOUNDATIONS OF CODING INTELLIGENCE: FROM GROCK TO SOURCEGRAPH

The conversation opens with introductions to Beyang Liu and Steve Yegge of SourceGraph, highlighting their extensive experience with code indexing and search. Liu's early work at Google with `grok` and `Google Code Search` (later `Kyte`) inspired the founding of SourceGraph with Quinn, aiming to solve the pain points of navigating large codebases. Yegge, with his background at Amazon and Google, shares his insights into tech culture and his appreciation for SourceGraph's engineering prowess, noting his personal use of their tools.

CODY: AN AI CODING AGENT FOCUSED ON CONTEXT QUALITY

Cody is introduced as SourceGraph's AI coding agent, which, while offering familiar features like autocompletion and codebase-aware chat, distinguishes itself through the unparalleled quality of its context. Drawing on SourceGraph's decade-long investment in code understanding for human developers, Cody provides AI with rich, relevant context. This approach prioritizes the RAG mechanism, framing it not as a workaround but as a crucial component for LLMs to accurately understand and interact with proprietary code.

THE "NORMSKY" ARCHITECTURE: MERGING FORMAL SYSTEMS AND DATA-DRIVEN LEARNING

The core of Cody's architecture is dubbed 'Normsky,' a portmanteau of 'Norvig' (representing data-driven machine learning, like LLMs) and 'Chomsky' (representing formal systems, like compilers and parsers). This hybrid approach aims to combine the pattern-recognition power of LLMs with the precise structural understanding derived from code analysis tools. This allows for more reliable code generation, better context utilization, and a more robust overall system, moving beyond the limitations of purely data-driven models.

RAG, CONTEXT, AND THE LIMITATIONS OF AGENTS

The discussion emphasizes that while Retreival-Augmented Generation (RAG) is a critical component, its effectiveness hinges on the quality and relevance of the retrieved context. SourceGraph views RAG as akin to a skilled consultant for the LLM. The team expresses skepticism towards purely agentic approaches that rely solely on LLM orchestration for multi-step processes, arguing that current LLMs lack the consistent reliability needed for fully automated, agentic execution without human oversight.

THE EVOLUTION OF DEV TOOLS AND THE FUTURE FOR ENGINEERING LEADERS

The conversation shifts to the broader impact of AI on software development. While inline code generation tools aid individual developers, the larger challenge is managing code base complexity. SourceGraph aims to equip engineering leaders and tech leads with tools for better codebase understanding, enabling them to manage architectural integrity, identify risks, and track the impact of development on business metrics—a capability enhanced by AI but rooted in their core mission of making code accessible.

DATA PROCESSING AND THE "BIG FRIENDLY GRAPH" (BFG)

A significant portion of the discussion centers on data processing and context fetching. SourceGraph emphasizes that "data mode" refers to a sophisticated pre-processing engine, not just accumulated data. They are developing a new, efficient code graph called 'BFG' (Big Friendly Graph), built using a novel, AI-like iterative experimentation approach. This graph aims to provide rich semantic context, significantly reducing errors like type mismatches in AI-generated code, and is designed for easy integration without complex build system setups.

OPEN SOURCE, MULTIMODAL AI, AND THE FUTURE OF CODING ASSISTANTS

The potential of open-source models is lauded, with StarCoder already competitive for completions when paired with Cody's context. The speakers foresee a future of multimodal AI, opening new interaction paradigms beyond current chat and command interfaces. They suggest that the current form factor of coding assistants is temporary, and new ways of interacting with AI, potentially involving real-time screen analysis or visual input, will emerge, providing more proactive and integrated assistance.

THE ENDURING ROLE OF THE DEVELOPER AND THE DATA ADVANTAGE

Ultimately, the consensus is that AI will augment rather than replace developers. It will handle more of the tedious, boilerplate, and routine tasks, allowing engineers to focus on higher-level creative problem-solving, architecture, and innovation. While agents might eventually automate more complex workflows, the current focus remains on robust, reliable tools like Cody that provide deep code understanding and context, offering a competitive advantage through sophisticated data pre-processing and graph analysis.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●Books

●Concepts

●People Referenced

Common Questions

Cody's core differentiator is the quality of its context, leveraging Sourcegraph's decade-long experience in building code understanding engines for human developers. It provides codebase-aware chat and inline autocompletion.

Topics

AI & Machine Learning Programming & Software Business & Entrepreneurship Developer Tools Large Language Models (LLMs)AI Coding Agents Retrieval Augmented Generation (RAG)Software Development Lifecycle Code Intelligence Context Fetching Code Comprehension Engineering Productivity Open-source AI Models

Mentioned in this video

People

Beyang Liu

Co-founder of Sourcegraph, previously worked in computer vision at Stanford and Palantir, and interned at Google.

Steve Yegge

Previously worked at Amazon and Google, known for blog posts, and then at Grab as Head of Engineering. Also developed the Grok system at Google. Now works at Sourcegraph.

Kelly Norton

Built a similar code search system called Hound at Etsy.

John Carmack

Legendary programmer, mentioned speculatively as someone who might find the next big breakthrough in AI architectures.

Hannes Neinhuis

Created Zoekt, a code search tool heavily inspired by Google's trigram index.

Peter Norvig

Represents a school of thought in AI focusing on data-driven, machine learning models, believing deterministic approaches had failed.

Noam Chomsky

Represents a school of thought in AI focusing on formal, precise systems like compilers and parsers.

George Hotz

A technologist whose concept of 'tera-ped-flops' (a human worth of compute) is used to contextualize the scaling of GPT models.

Sam Altman

CEO of OpenAI, mentioned for his historical context on GPT models and his projection that AI capabilities will reach 'every human ever' levels by the end of the decade.

Companies

Temporal

Company of which Max is the head.

Palantir

Company where Beyang Liu worked after Stanford, and where he and Quinn started working together.

OpenAI

An AI company whose models (like GPT-4) Sourcegraph partners with and uses, specifically for chat and commands. The recent drama around their CEO's departure and return caused concern among Sourcegraph's customers.

Google

Company where Beyang Liu interned and Steve Yegge worked, known for its internal code search tools like Grok and Kythe.

Fireworks AI

An inference platform used by Sourcegraph to run StarCoder, allowing them to focus on data fetching and fine-tuning rather than building their own inference stack. Their team includes ex-Meta people knowledgeable in PyTorch.

Microsoft

Mentioned as a potential fallback for Sourcegraph's AI services if OpenAI's stability were to be an issue.

YouTube

Mentioned as an example of a company with excellent recommender engines, drawing parallels to Sourcegraph's approach to code recommendations.

Etsy

Online marketplace where Kelly Norton built the Hound open-source code search project.

Anthropic

An AI company whose models (like Claude) Sourcegraph partners with and uses for chat and commands.

Amazon

Company where Steve Yegge famously worked, described as a giant waterfall of engineers.

Grab

Southeast Asian super app where Steve Yegge was Head of Engineering; initially criticized by one host as a customer, but Steve praised its engineering team and laser focus.

Uber

Mentioned as a comparison to Grab, highlighting Grab's additional functionality as a super app in Southeast Asia.

Hugging Face

An organization driving the open-source AI ecosystem, mentioned for possibly releasing a V2 of StarCoder. Sourcegraph would like to collaborate on benchmarks with them.

Bank of America

Mentioned as an example in a demo of how Cody could generate a stock ticker app.

Codium

A coding assistant mentioned as a competitor in the market.

Spotify

Mentioned as an example of a company with excellent recommender engines, drawing parallels to Sourcegraph's approach to code recommendations.

Wells Fargo

Mentioned as an example in a demo of how Cody could generate a stock ticker app.

GitHub

A prominent AI coding assistant that Cody is often compared against, particularly regarding context utilization and completion acceptance rates. It uses variants of Codex and local context.

Software & Apps

PostgreSQL

The relational database used by Sourcegraph, which found it performed as well as most graph databases for graph workloads in their experience.

Sourcegraph

A company founded 10 years ago to index all code on the internet, now focused on AI coding intelligence with Cody. Initially focused on on-prem deployments, now cloud-hybrid.

TabNine

A coding assistant mentioned as a competitor in the market.

ChatGPT

An AI language model that recently celebrated its one-year anniversary, demonstrating the rapid advancement in AI.

Ruby

A programming language that the speaker spent most of their career writing, noted for being easy to use and read despite not always being the fastest.

Cody

Sourcegraph's AI coding agent designed to provide inline autocompletion, codebase-aware chat, and automate tasks like unit test generation and documentation. Differentiates itself by leveraging Sourcegraph's decade-long expertise in code understanding and context fetching.

Kite

A coding assistant that 'died', mentioned in the context of other coding assistants.

LlamaIndex

An external tool for building LLM applications, which Sourcegraph decided against using due to needing full control over their stack for rapid iteration.

Codex

The AI model that GitHub Copilot is based on.

SQL

A query language, where the speaker notes that few engineers still write queries without AI assistance.

StarCoder

An open-source model used by Cody for inline completions, achieving comparable acceptance rates to Copilot by leveraging Sourcegraph's context fetching.

Google code search

An internal Google developer tool that indexed code and provided a reference graph, later open-sourced as Kythe.

Claude instant

An AI model used by Sourcegraph for completions for a while, but required prompt engineering to output code without extra text.

Grok

The internal system at Google that provided the reference graph for Google Code Search, built by Steve Yegge. It's considered a predecessor to Kythe.

gRPC

Google's internal protocol for backend queries, which is a different approach than LSP.

Kythe

The open-source version of Google's code intelligence system, considered 'Grok V3'.

Code LLaMA

An open-source model that Sourcegraph is not currently using but continuously evaluates against other available models.

BFG

Sourcegraph's internal working name for a new skip-based code graph that is blazing fast, requires zero configuration, and doesn't integrate with build systems, addressing issues like type errors in AI completions.

Typescript

One of the programming languages used in Sourcegraph's AI stack.

Hound

An open-source code search project built at Etsy, similar to Google Code Search.

GPT-4

A Transformer-based LLM from OpenAI, mentioned for its ability to utilize context from the top of its window and its decent chess-playing capability. Sourcegraph still uses GPT-4 for chat and commands.

LangChain

An external tool for building LLM applications, which Sourcegraph decided against using due to needing full control over their stack for rapid iteration.

Non-Stop Cody

An early feature from Cody's first launch that allowed multiple parallel requests to modify a source file, but was difficult to make reliable for general code generation.

Zoekt

A code search system created by Hannes Neinhuis, inspired by the trigram index of Google's original code search.

Lisp

A programming language originally intended for creating rules-based AI systems, representing the Chomsky approach.

React

A JavaScript library mentioned as a framework for building a hypothetical stock ticker app demo with Cody.

Cursor

A coding assistant mentioned as a competitor in the market.

One of the programming languages used in Sourcegraph's AI stack.

Google Duet AI

A Google coding assistant that initially focused on pulling local context, similar to GitHub Copilot's initial approach.

Skip Protocol

A new protocol developed by Sourcegraph that aims to combine the best ideas from LSP and Kythe, making it easier to write indexers and model symbolic characteristics of code.

Claude

An AI model that reportedly uses context from the bottom of its window better than other models. Sourcegraph uses Claude and GPT-4 for chat and commands.

Replit

A coding platform mentioned for its potential to bootstrap its own proprietary dataset through bounties, though it's rumored they're betting on OpenAI.

Prolog

A logic programming language, mentioned in the context of deterministic Chomsky approaches to AI that didn't scale effectively.

Twitter API

An example of a notoriously challenging API to work with, which the host successfully scraped using Cody web.

Python

Mentioned as an example of a programming language that might become less relevant for human understanding as AI generates more code through natural language.

Rust

One of the programming languages now being used in Sourcegraph's AI stack.

Concepts

Normsky Architecture

Beyang Liu's internal pet name for Sourcegraph's hybrid AI architecture, combining elements of Norvig's data-driven approach and Chomsky's formal systems; stands for Non-agentic Rapid Multisource Code Intelligence.

The Pile

A common training corpus for AI models, used in combination with The Stack.

Moore's Law

The observation that the number of transistors in an integrated circuit doubles approximately every two years, used as an analogy for the rapid scaling of AI capabilities.

The Stack

A common training corpus for AI models, used in combination with The Pile.

Open Code Graph

A protocol being developed by Sourcegraph to define a common language for context providers to offer hints and context to AI developer tools like Cody.

LSP

A protocol that standardized code intelligence features across editors, praised for making code navigation easier but criticized for its range-based approach rather than symbolic modeling of code.

Books

Cheating is All You Need

A post written by Steve Yegge arguing that data models, particularly RAG, are more important than UI or chat responses in AI coding assistants.

The Unreasonable Effectiveness of Data

A famous talk or post by Peter Norvig that contributed to the shift towards data-driven AI approaches.

Organizations

Meta Research

Mentioned for their contributions to the open-source AI ecosystem, which Sourcegraph appreciates.

Stanford University

University where Beyang Liu studied computer vision and Steve Yegge gave a talk about Grok.

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Get Started Free

The "Normsky" architecture for AI coding agents — with Beyang Liu + Steve Yegge of SourceGraph

Key Insights

FOUNDATIONS OF CODING INTELLIGENCE: FROM GROCK TO SOURCEGRAPH

CODY: AN AI CODING AGENT FOCUSED ON CONTEXT QUALITY

THE "NORMSKY" ARCHITECTURE: MERGING FORMAL SYSTEMS AND DATA-DRIVEN LEARNING

RAG, CONTEXT, AND THE LIMITATIONS OF AGENTS

THE EVOLUTION OF DEV TOOLS AND THE FUTURE FOR ENGINEERING LEADERS

DATA PROCESSING AND THE "BIG FRIENDLY GRAPH" (BFG)

OPEN SOURCE, MULTIMODAL AI, AND THE FUTURE OF CODING ASSISTANTS

THE ENDURING ROLE OF THE DEVELOPER AND THE DATA ADVANTAGE

Mentioned in This Episode

Common Questions

Topics

Mentioned in this video

More from Latent Space

Marc Andreessen introspects on Death of the Browser, Pi + OpenClaw, and Why "This Time Is Different"

Moonlake: Multimodal, Interactive, and Efficient World Models — with Fan-yun Sun and Chris Manning

The Stove Guy: Sam D'Amico Shows New AI Cooking Features on America's Most Powerful Stove at Impulse

Mistral: Voxtral TTS, Forge, Leanstral, & Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

Found this useful? Build your knowledge library