AI Dev 25 x NYC | Alex Ker: How Open source Models Actually Run AI Coding at Scale
Key Moments
Open source AI coding models rival closed-source, offering speed, cost, and control.
Key Insights
Open source AI coding models are rapidly closing the performance gap with closed-source alternatives like GPT-5 and Claude.
Key advantages of open source models include lower latency, improved reliability at scale, and significant cost reductions for production deployments.
Specialized open source models like Qwen3Coder and Kimi K2 excel at specific tasks, with Kimi K2 demonstrating advanced tool-use capabilities through 'interleaf thinking'.
Developers can integrate open source models into their workflows using tools like OpenRouter, Cline, and Versell AI SDK, with options ranging from simple API rerouting to dedicated IDEs.
Optimizing open source models for specific use cases, such as code autocompletion, requires techniques like KV caching, KV-aware routing, and engram speculation to minimize latency.
For production-scale deployments, dedicated infrastructure and fine-tuning open source models offer greater control, performance, and cost-efficiency compared to shared endpoints.
THE EVOLVING LANDSCAPE OF AI CODING MODELS
The AI development landscape is shifting, with open source models increasingly challenging the dominance of closed-source giants like GPT-5 and Claude. While closed-source models were historically favored for their intelligence, the quality gap is narrowing considerably. Recent releases, such as Kimi K2, benchmark competitively with leading proprietary models, signaling a new era where open source alternatives are viable, and often superior, for production applications.
ADVANTAGES OF OPEN SOURCE MODELS
Open source models offer distinct advantages that are crucial for scalable AI applications. Firstly, latency is significantly improved, leading to a more responsive user experience by reducing the time to first token. Secondly, reliability is enhanced, ensuring consistent performance as user traffic grows. Finally, and critically for production, open source models offer substantial cost savings, making AI economically feasible at scale. These factors are essential for keeping pace with the rapid adoption of AI products.
KEY OPEN SOURCE MODELS FOR CODING
Several open source models are at the forefront of AI coding. GLM 4.6 provides strong general reasoning capabilities and is more efficient than its predecessors. Qwen3Coder, a specialist coding model from Alibaba, remains a solid option for prototyping or repetitive programming tasks. The most exciting is Kimi K2, a trillion-parameter model leading benchmarks and demonstrating advanced tool use. Kimi K2's 'interleaf thinking' mimics human problem-solving by iteratively reflecting and adjusting its approach after each action, a significant improvement over traditional chain-of-thought methods.
INTEGRATING OPEN SOURCE MODELS INTO WORKFLOWS
Adopting open source models into existing development workflows is becoming increasingly accessible. Simple methods include rerouting API requests from familiar CLIs like Cloud Code to open source endpoints, which can drastically reduce costs and latency. More comprehensive solutions include using unified platforms like OpenRouter, which offers access to numerous models with failback capabilities. Frameworks such as the Versell AI SDK and tools like Langchain and LlamaIndex also provide robust integrations for building AI-powered applications.
ADVANCED TOOLS FOR OPEN SOURCE AI DEVELOPMENT
For a more integrated experience, IDEs like Cline offer a 'bring your own key' setup and segmented agent modes for planning and acting. This approach simplifies the management of context windows and conversation history. At Baseten, they optimize inference for various open source coding agents. This optimization is crucial for applications like autocomplete, where a fast time to first token and efficient handling of long contexts with short decodes are paramount to maintaining a seamless developer experience and keeping pace with user activity.
OPTIMIZATION TECHNIQUES FOR HIGH-PERFORMANCE INFERENCE
Achieving low latency and high throughput for AI coding applications requires specialized optimization techniques. For autocompletion, crucial metrics are sub-300ms time-to-first-token and efficient handling of long prefill (ingesting code) and short decode (generating completion) phases. Techniques like KV caching (reusing computed key-value pairs), KV-aware routing (directing users to servers with pre-built caches for ongoing conversations), and engram speculation (using a dictionary of common code patterns for draft tokens) significantly speed up inference and drastically improve performance, as demonstrated with Sourcegraph's AMP tab.
PRODUCTION DEPLOYMENT STRATEGIES
Deploying open source models at scale involves dedicated infrastructure, often referred to as 'dedicated deployments.' This approach, supported by Baseten across multiple cloud providers, segments customer traffic onto private instances, bypassing the limitations of shared endpoints. This allows for greater control, optimized performance, and cost-efficiency. Open source models also benefit from fine-tuning, enabling developers to tailor them to specific use cases and further enhance their effectiveness in production environments.
THE FUTURE AND KEY TAKEAWAYS FOR DEVELOPERS
Developers utilizing only closed-source models are missing out on significant advancements and cost efficiencies. The open-source AI ecosystem is rich with tooling and models that are rapidly maturing. The key takeaway is to experiment with these models and tools, not limiting oneself to a single solution. For ML engineers focused on user experience, prioritizing performance, reliability, and control in production is essential for building successful AI applications. Connecting with communities and exploring new models will be critical.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●People Referenced
Common Questions
Closed-source models like GPT-5 and Claude, while intelligent, can present limitations in terms of production use due to potential issues with latency, reliability at scale, and higher costs compared to open-source alternatives.
Topics
Mentioned in this video
Stanford Institute for Human-Centered Artificial Intelligence, where Alex Ker contributed as an editor.
Previous iteration of GLM, mentioned for comparison with GLM 4.6's efficiency.
An open-source coding agent that Base 10 helps power.
Sourcegraph's product optimized by Base 10 for autocomplete, achieving 2x higher speed.
Growth software engineer at Base 10, speaker at the event, discussing open source AI models for coding.
Company where Alex Ker works, focusing on enabling developers to build better AI applications and optimizing AI inference.
Previous employer of Alex Ker, where he worked on reinforcement learning infrastructure.
An open-source model focused on general reasoning, with stellar performance and 30% more efficiency than its predecessor.
An integration example for AI coding in workflows, suitable for production and powering Next.js web apps.
Company where the questioner, Yasha Na, works as a data scientist.
An open-source specialist coding model from Alibaba, suitable for prototyping or repetitive programming tasks.
A favorite IDE with over 2 million developers, featuring a bring-your-own-key setup and segmented plan/act modes.
Previous employer of Alex Ker, where he built ML pipelines.
An AI incubator founded by Alex Ker during college.
An integration option for using frontier open-source models.
More from DeepLearningAI
View all 65 summaries
1 minThe #1 Skill Employers Want in 2026
1 minThe truth about tech layoffs and AI..
2 minBuild and Train an LLM with JAX
1 minWhat should you learn next? #AI #deeplearning
Found this useful? Build your knowledge library
Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.
Try Summify free