Key Moments

Production AI Engineering starts with Evals

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read117 min video
Oct 11, 2024|2,996 views|63|1
Save to Pod
TL;DR

BrainTrust's co-founder discusses evaluating AI production, transitioning from databases to AI engineering, and the evolution of AI tools.

Key Insights

1

Production AI engineering must prioritize evaluation (evals) as a core workflow for driving improvements and decision-making.

2

The evolution of AI, particularly through transformers and LLMs like GPT-3/4, has enabled traditional software engineers to participate more directly in AI development.

3

BrainTrust aims to empower product builders and software engineers with AI tools, focusing on user experience and simplifying complex AI workflows.

4

The AI market is rapidly evolving, with a shift from purely technical solutions to solving business problems, exemplified by the move from fine-tuning to broader automatic optimization.

5

While open-source models have potential, the current production landscape favors reliable, scalable, and easily accessible models, often via APIs, due to operational complexities.

6

The future of AI development involves integrating intelligence seamlessly into applications, rather than building complex agentic systems, making code simpler and more user-centric.

THE FOUNDATIONAL SHIFT: EVALUATION IN AI PRODUCTION

The core idea behind BrainTrust, as explained by its co-founder, is revolutionizing AI production through a strong emphasis on evaluation. The speaker recounts an experience at Impira where implementing an evaluation system dramatically sped up model improvement. This highlights the bottleneck that arises when discussions about model choices are purely hypothetical or based on limited examples. Evals provide a scientific framework to measure progress, identify regressions, and guide development, fundamentally changing how AI applications are iterated upon and improved.

FROM DATABASES TO AI: A CAREER EVOLUTION

The journey from working at Single Store, a HTAP database company, through Impira, an AI unstructured data company, to co-founding BrainTrust, showcases a deep understanding of complex systems and evolving technological landscapes. Early career roles at Microsoft and research, while providing foundational knowledge, lacked the desired impact and creativity. The speaker's experience with Single Store revealed the trade-offs between advanced technology and market accessibility, a lesson that informed subsequent ventures. Impira, though technically innovative, highlighted the difficulty of selling technical solutions without deep customer empathy, especially when targeting line-of-business users, and reinforced the importance of sales and business acumen alongside technical expertise.

THE IMPIRA ACQUISITION AND THE AI PARADIGM SHIFT

The acquisition of Impira by Figma was catalyzed by a rapid technological shift. Impira's initial strength lay in computer vision-based document extraction, requiring extensive data examples. However, the emergence of transformer models like BERT and, critically, GPT-3 and its successors, fundamentally changed the game. The speaker's personal experimentation with these models revealed their power in understanding text and context, quickly cannibalizing previous approaches. This realization led to a strategic pivot to leverage these new models, but ultimately, the founder recognized that the core problem of unstructured data transformation was becoming commoditized, prompting the pursuit of an acquisition to find a new, more impactful direction.

BRAINTRUST: EMPOWERING AI ENGINEERS WITH A DEVELOPER-FIRST APPROACH

BrainTrust emerged from the need for better tools tailored to software engineers entering the AI space. The speaker observed that traditional ML evaluation tools were often inaccessible to software engineers, creating a divide. BrainTrust's platform bridges this gap by offering an end-to-end developer platform that integrates evaluation, data collection, and prompt management. The platform's evolution from an evaluation tool to a debugger and now an IDE-like experience reflects its user-driven development. Key features like the durable, collaborative playground, automatic data ETL via logging, and the ability to define custom tools empower developers to build and iterate on AI products more efficiently.

THE EVOLVING AI MARKET: BEYOND FINE-TUNING AND TOWARDS GENERAL INTELLIGENCE

The discussion touches upon market trends, including the declining use of fine-tuning in production, despite its technical validity. The emphasis is shifting towards automatic optimization as a business goal, achievable through various methods like prompt engineering and in-context learning. The speaker posits that agentic frameworks, while currently popular, might be a temporary workaround for LLMs' current reasoning limitations, predicting that future models will integrate more complex logic directly. This perspective suggests a future where AI capabilities are increasingly embedded within foundational models, simplifying the architecture of AI applications.

THE ROLE OF INFRASTRUCTURE AND MARKET DYNAMICS

BrainTrust's strategic bets, such as its hybrid on-premise model and prioritization of TypeScript, highlight a pragmatic approach to serving a demanding market. The speaker argues that despite market skepticism, these choices have enabled deeper customer integration and resonated with product builders. The conversation also delves into the GPU inference market, suggesting that while margins can be high, availability and reliability are critical differentiators, favoring established players like OpenAI. The panel discusses the evolving landscape of model providers (OpenAI, Anthropic, Meta) and the complexities of integrating them, noting that ease of use and consistent availability remain paramount for production AI.

THE FUTURE OF AI WORKLOADS: SPRINKLING INTELLIGENCE EVERYWHERE

The prevailing trend observed in production AI workloads is the shift towards 'sprinkling intelligence' throughout applications rather than building monolithic agentic systems. This approach involves embedding discreet AI calls for tasks like summarization or data generation within existing software. BrainTrust's platform supports this paradigm by making it easy to integrate AI capabilities, from simple prompt manipulations to more complex agents. The focus is on enhancing user experience and developer productivity by making AI features easily accessible and usable, moving towards a future where building intelligent software is as straightforward as building traditional software.

Common Questions

Impira was founded on the idea of making unstructured data as easy to use as structured data, leveraging advancements in deep learning models like AlexNet. The speaker aimed to tackle the challenges of data extraction that were previously impossible.

Topics

Mentioned in this video

Companies
Brain Trust

The speaker's current company, an end-to-end developer platform for building AI products, centered on an evaluation-driven workflow. It evolved from an eval tool to a debugger and eventually an IDE-like playground.

Palo Alto Networks

A cybersecurity company where Alena's father was president before joining Cloudflare, managing billions in revenue.

Microsoft

The speaker's first internship, working on Bing's distributed compute infrastructure. The experience was impactful but lacked intense creativity and room for interesting work.

Redshift

Amazon's cloud data warehouse, noted for having a feature similar to Snowflake's variant type called 'Super'.

Cloudflare

A web infrastructure and website security company where Alena's father is currently the president.

Hugging Face

A platform for building, training, and deploying machine learning models. The speaker became a top non-employee contributor, working on document QA models.

SingleStore

A leading HTAP (Hybrid Transactional/Analytical Processing) database. The speaker was its first VP of Engineering and discussed its advanced technology but also its high cost and niche market suitability, comparing its evolution to Neon for wider adoption.

Stitch Fix

An online personal styling service, mentioned as a company Impira tried to close a deal with early on.

Impira

The speaker's first AI company, founded on the idea of making unstructured data as easy to use as structured data, leveraging ML models. They learned critical business lessons about sales, customer empathy, and market fit.

Adobe

Mentioned in the context of Figma's acquisition by Adobe, a factor contributing to Figma's sense of stability.

Humanloop

One of the early movers in the AI tooling space, offering durable playgrounds, prompt saving, and eval features, but predating Brain Trust's focus on engineering efficiency and declarative evals.

Cruise

A self-driving car company, where Eden, Brain Trust's Head of Product, was a designer.

Figma

A design tool company that acquired Impira. The speaker was there for eight months and discussed the challenges of integrating AI, especially visual AI, into a high-quality product with an annual release cycle.

Neon

A company founded by Nikita Shamgunov, aiming to provide a hyper-inexpensive PostgreSQL offering with a world's best free tier, contrasting with SingleStore's high-cost model.

Brex

A customer of Brain Trust, whose engineers expressed a desire for Brain Trust's playground to become their IDE.

Databricks

A data and AI company with a famous hybrid on-prem model. While their model is successful, it's often viewed with mixed perspectives, serving as a cautionary tale/inspiration for Brain Trust's hybrid approach.

DataDog

A monitoring and analytics platform that Figma used, mentioned as a comparison point for observability solutions.

Google

Mentioned as a public cloud provider alongside Amazon and Azure, suggesting big companies have special relationships for their AI models.

Snowflake

A cloud-based data warehousing company, praised for its best-in-class implementation of semi-structured data with its 'variant type' but criticized for its expensive packaging and high minimum query time.

OpenAI

A leading AI research and deployment company. Brain Trust leveraged their API for LLM judging and tool interactions, and OpenAI's models are heavily adopted by Brain Trust customers due to reliability and availability.

Firebase

A platform for developing mobile and web applications. The interviewee's analogy for Brain Trust is similar to Firebase for traditional software developers, providing an end-to-end platform.

Pinecone

A vector database company, mentioned as a company that the speaker would host on their podcast to hear an opposing view on vector databases.

Meta

The company behind LLaMA models, seen as a key player invigorated by OpenAI's advancements and contributing to a healthier AI ecosystem.

Hairloom

An AI product that records screen activity and titles it properly, delighting users with a small AI sprinkle of intelligence.

Browserbase

A tool that allows users to run browsers in the cloud, integrated into Brain Trust for defining custom tools.

Zapier

An automation platform and early adopter of Brain Trust, whose engineer Brian provided critical feedback during product development. They also use linear for auto-generating ticket titles.

Anthropic

An AI safety and research company, known for its Claude models. The speaker considered starting a company based on their eval process while interviewing there. Praised for offering strong alternatives to OpenAI, especially with Haiku.

Temporal

A workflow as code platform, where a cloud VPC peering solution was used, demonstrating complex deployment scenarios addressed by Brain Trust's hybrid model.

Vercel

A cloud platform for serverless deployment, a customer and investor in Brain Trust. Malta, an engineer from Vercel, provided a key quote on workflow transformation.

Splunk

A data platform for security, observability, and IT operations, mentioned as a company that built its own database technology to solve the "variant type" problem for observability.

Software & Apps
Bing

Microsoft's search engine, where the speaker worked on distributed compute infrastructure that is now part of Azure.

ClickHouse

A column-oriented database management system for online analytical processing, noted for working on something similar to semi-structured data handling.

PostgreSQL

A powerful, open-source object-relational database system that Neon is building a hyper-inexpensive version of.

AlexNet

A groundbreaking convolutional neural network, whose release made the speaker realize new possibilities for data processing, inspiring Impira.

GPT-3

An advanced language model by OpenAI, which 'totally blew the speaker's mind' with its ability to extract information from unstructured text, even without visual signals, surpassing LayoutLM.

Notion

An early AI adopter that iteratively ships AI features. Its conference was mentioned, and its AI features were initially hacked by founders at a retreat.

VS Code

A popular code editor from Microsoft. The speaker entertains the idea of forking it as a future direction for Brain Trust's IDE-like playground.

Google Sheets

A spreadsheet application, used as an analogy for basic, DIY eval systems that many early AI products resemble. Brain Trust aims to offer more advanced features beyond simple spreadsheet functionality.

Haiku

A model from the Claude 3 family, noted as a smart, cheap, and fast model with tool-calling capabilities, providing a significant foothold for Anthropic.

Google Docs

A collaborative online document editor, used as an analogy for the collaborative and real-time nature of Brain Trust's playground.

Claude

Anthropic's family of large language models. Claude 3, particularly Haiku and Sonnet, offered compelling alternatives to OpenAI for Brain Trust's customers.

Deno

A secure runtime for JavaScript and TypeScript, mentioned in the context of different ways of handling arbitrary code execution in JavaScript.

TensorFlow

An open-source machine learning framework. PyTorch is seen as a huge improvement over TensorFlow, but still not ideal for TypeScript engineers.

Redis

An open-source, in-memory data store, mentioned as an example of a database technology that solves specific technical problems.

Azure

Microsoft's cloud platform. Big companies using OpenAI models often use Azure due to special relationships, but it can present engineering challenges compared to OpenAI's direct endpoints.

Honeycomb

An observability platform that built its own super wide column store. The speaker agrees with their decision given the lack of accessible semi-structured data solutions like Snowflake's variant type.

BERT

A neural network-based technique for natural language processing pre-training, which significantly accelerated text-based information extraction and began to cannibalize Impira's computer vision-based approach.

Exa

A company that provides a search API, integrated into Brain Trust as a custom tool for agents to search the internet.

Coda

A document collaboration platform and early Brain Trust user. They started using the product quickly due to its hybrid on-prem model and focus on evaluations.

Claude 3

Anthropic's latest family of models. Its release, particularly Haiku and Sonnet, significantly shifted market share, allowing developers more options beyond OpenAI.

GPT-4o

OpenAI's multimodal flagship model, whose capabilities are leading to a shift where complex reasoning and agentic logic will be integrated directly into the model, rather than requiring external frameworks.

Grok

An AI chatbot. The speaker mentioned Dylan Patel's calculations on whether Grok is burning cash.

SAP HANA

An in-memory, column-oriented, relational database management system, mentioned as a more expensive alternative to SingleStore in its early days.

Chroma

An open-source embedding database, mentioned as a company that the speaker would host on their podcast to hear an opposing view on vector databases.

DSPY

A framework for programmatically optimizing prompts, mentioned as a step towards automatic optimization, but criticized for its PyTorch-like code structure that might not suit TypeScript engineers.

FastAPI

A modern, fast web framework for building APIs with Python, mentioned as a common tool for building applications that can benefit from sprinkled intelligence.

DuckDB

An in-process SQL OLAP database management system, mentioned for its struct type, which has downsides compared to Snowflake's variant type for handling schema changes.

ChatGPT

An AI chatbot developed by OpenAI, whose release validated the speaker's pivot away from Impira's vision-based document processing, as it could cannibalize their technology.

Sonnet

A model from the Claude 3 family, initially seen as a 'middle child' but now considered both cheap and smart, offering a pleasant communication experience.

NumPy

A Python library fundamental for numerical computing, mentioned as intuitive for continuous mathematicians (ML people) but non-intuitive for the speaker (software engineer) who prefers discrete math.

Cursor

An AI-powered code editor, positioned as complementary to Brain Trust: Cursor enhances traditional software engineering with AI, while Brain Trust brings software engineering best practices to AI development.

Node.js

A JavaScript runtime, which Brain Trust settled on supporting for arbitrary code execution, similar to Vercel's approach.

Next.js

A React framework for web development, used as an example of a common tool for TypeScript engineers.

Linear

A project management tool used at Brain Trust, which integrates with AI to auto-generate ticket titles from Slack complaints, illustrating a single-prompt manipulation use case.

OpenAI Playground

OpenAI's interactive tool for experimenting with language models. Brain Trust's playground was built to be more durable, shareable, and collaborative than this.

Airtable

A database-spreadsheet hybrid software and early Brain Trust user, whose data stayed in their cloud and benefited from the hybrid on-prem model.

PostGIS

An extension for PostgreSQL that adds support for geographic objects. The speaker's ideal database scenario would be a unique combination of storage and execution paradigms, like PostGIS.

PyTorch

An open-source machine learning framework, whose coding style for automatic optimization is seen as non-ergonomic for TypeScript engineers.

AWS

A cloud platform. Customers often have committed spend with AWS, which would ordinarily suggest using cloud services for models, but the speaker notes that direct OpenAI endpoints are often preferred for convenience and access to newest models.

Llama

Meta's family of open-source large language models. LLaMA 3 8B is mentioned as a powerful open-source model that could change the fine-tuning landscape, with many people incentivized for its success.

More from Latent Space

View all 167 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free