AI Dev 25 x NYC | Alex Ker: How Open source Models Actually Run AI Coding at Scale

DeepLearning.AIDeepLearning.AI
Education4 min read25 min video
Dec 2, 2025|1,312 views|18
Save to Pod

Key Moments

TL;DR

Open source AI coding models rival closed-source, offering speed, cost, and control.

Key Insights

1

Open source AI coding models are rapidly closing the performance gap with closed-source alternatives like GPT-5 and Claude.

2

Key advantages of open source models include lower latency, improved reliability at scale, and significant cost reductions for production deployments.

3

Specialized open source models like Qwen3Coder and Kimi K2 excel at specific tasks, with Kimi K2 demonstrating advanced tool-use capabilities through 'interleaf thinking'.

4

Developers can integrate open source models into their workflows using tools like OpenRouter, Cline, and Versell AI SDK, with options ranging from simple API rerouting to dedicated IDEs.

5

Optimizing open source models for specific use cases, such as code autocompletion, requires techniques like KV caching, KV-aware routing, and engram speculation to minimize latency.

6

For production-scale deployments, dedicated infrastructure and fine-tuning open source models offer greater control, performance, and cost-efficiency compared to shared endpoints.

THE EVOLVING LANDSCAPE OF AI CODING MODELS

The AI development landscape is shifting, with open source models increasingly challenging the dominance of closed-source giants like GPT-5 and Claude. While closed-source models were historically favored for their intelligence, the quality gap is narrowing considerably. Recent releases, such as Kimi K2, benchmark competitively with leading proprietary models, signaling a new era where open source alternatives are viable, and often superior, for production applications.

ADVANTAGES OF OPEN SOURCE MODELS

Open source models offer distinct advantages that are crucial for scalable AI applications. Firstly, latency is significantly improved, leading to a more responsive user experience by reducing the time to first token. Secondly, reliability is enhanced, ensuring consistent performance as user traffic grows. Finally, and critically for production, open source models offer substantial cost savings, making AI economically feasible at scale. These factors are essential for keeping pace with the rapid adoption of AI products.

KEY OPEN SOURCE MODELS FOR CODING

Several open source models are at the forefront of AI coding. GLM 4.6 provides strong general reasoning capabilities and is more efficient than its predecessors. Qwen3Coder, a specialist coding model from Alibaba, remains a solid option for prototyping or repetitive programming tasks. The most exciting is Kimi K2, a trillion-parameter model leading benchmarks and demonstrating advanced tool use. Kimi K2's 'interleaf thinking' mimics human problem-solving by iteratively reflecting and adjusting its approach after each action, a significant improvement over traditional chain-of-thought methods.

INTEGRATING OPEN SOURCE MODELS INTO WORKFLOWS

Adopting open source models into existing development workflows is becoming increasingly accessible. Simple methods include rerouting API requests from familiar CLIs like Cloud Code to open source endpoints, which can drastically reduce costs and latency. More comprehensive solutions include using unified platforms like OpenRouter, which offers access to numerous models with failback capabilities. Frameworks such as the Versell AI SDK and tools like Langchain and LlamaIndex also provide robust integrations for building AI-powered applications.

ADVANCED TOOLS FOR OPEN SOURCE AI DEVELOPMENT

For a more integrated experience, IDEs like Cline offer a 'bring your own key' setup and segmented agent modes for planning and acting. This approach simplifies the management of context windows and conversation history. At Baseten, they optimize inference for various open source coding agents. This optimization is crucial for applications like autocomplete, where a fast time to first token and efficient handling of long contexts with short decodes are paramount to maintaining a seamless developer experience and keeping pace with user activity.

OPTIMIZATION TECHNIQUES FOR HIGH-PERFORMANCE INFERENCE

Achieving low latency and high throughput for AI coding applications requires specialized optimization techniques. For autocompletion, crucial metrics are sub-300ms time-to-first-token and efficient handling of long prefill (ingesting code) and short decode (generating completion) phases. Techniques like KV caching (reusing computed key-value pairs), KV-aware routing (directing users to servers with pre-built caches for ongoing conversations), and engram speculation (using a dictionary of common code patterns for draft tokens) significantly speed up inference and drastically improve performance, as demonstrated with Sourcegraph's AMP tab.

PRODUCTION DEPLOYMENT STRATEGIES

Deploying open source models at scale involves dedicated infrastructure, often referred to as 'dedicated deployments.' This approach, supported by Baseten across multiple cloud providers, segments customer traffic onto private instances, bypassing the limitations of shared endpoints. This allows for greater control, optimized performance, and cost-efficiency. Open source models also benefit from fine-tuning, enabling developers to tailor them to specific use cases and further enhance their effectiveness in production environments.

THE FUTURE AND KEY TAKEAWAYS FOR DEVELOPERS

Developers utilizing only closed-source models are missing out on significant advancements and cost efficiencies. The open-source AI ecosystem is rich with tooling and models that are rapidly maturing. The key takeaway is to experiment with these models and tools, not limiting oneself to a single solution. For ML engineers focused on user experience, prioritizing performance, reliability, and control in production is essential for building successful AI applications. Connecting with communities and exploring new models will be critical.

Common Questions

Closed-source models like GPT-5 and Claude, while intelligent, can present limitations in terms of production use due to potential issues with latency, reliability at scale, and higher costs compared to open-source alternatives.

Topics

Mentioned in this video

organizationStanford HAI

Stanford Institute for Human-Centered Artificial Intelligence, where Alex Ker contributed as an editor.

softwareGLM 4.5

Previous iteration of GLM, mentioned for comparison with GLM 4.6's efficiency.

softwareZed

An open-source coding agent that Base 10 helps power.

softwareAMP tab

Sourcegraph's product optimized by Base 10 for autocomplete, achieving 2x higher speed.

personAlex Ker

Growth software engineer at Base 10, speaker at the event, discussing open source AI models for coding.

companyBase 10

Company where Alex Ker works, focusing on enabling developers to build better AI applications and optimizing AI inference.

companyLaunchDarkly

Previous employer of Alex Ker, where he worked on reinforcement learning infrastructure.

softwareGLM 4.6

An open-source model focused on general reasoning, with stellar performance and 30% more efficiency than its predecessor.

softwareVercel AI SDK v5

An integration example for AI coding in workflows, suitable for production and powering Next.js web apps.

companyJP Morgan Chase

Company where the questioner, Yasha Na, works as a data scientist.

softwareQuint3 coder

An open-source specialist coding model from Alibaba, suitable for prototyping or repetitive programming tasks.

softwareKlein

A favorite IDE with over 2 million developers, featuring a bring-your-own-key setup and segmented plan/act modes.

companyNeurable

Previous employer of Alex Ker, where he built ML pipelines.

organizationPI

An AI incubator founded by Alex Ker during college.

softwareLLaMA Index

An integration option for using frontier open-source models.

softwareSourcegraph
toolOpenRouter
toolKimmy K2

More from DeepLearningAI

View all 65 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free