Key Moments
The Agent-Native Cloud: 3M Users, 100K Signups/Wk, Data Centers, & Death PRs — Jake Cooper, Railway
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
Railway is building its own cloud infrastructure from bare metal to cut costs and optimize for AI agents, aiming for near-instant deployment and "agent-native" cloud experiences.
Key Insights
Railway is migrating a vast majority of its workloads to its own bare metal data centers, which has a payback period of approximately three months compared to cloud rental.
The company is experiencing rapid growth, with around 100,000 new users signing up per week, contributing to a total user base of 3 million.
Jake Cooper, Railway's founder, has been actively modifying the Linux kernel to improve performance and cost efficiency for their infrastructure.
Railway aims to create an "agent-native" cloud, where AI agents, not just humans, can easily deploy, manage, and iterate on software.
The company's lean team of 35 people manages 3 million users, emphasizing system building over headcount expansion.
Temporal, a workflow orchestration tool, is used extensively by Railway but requires significant expertise, leading them to consider building their own alternative.
From Cloud to Bare Metal: A Quest for Efficiency
Railway is fundamentally reimagining cloud infrastructure by moving aggressively towards owning its physical hardware. Jake Cooper, the founder, revealed they are now operating the vast majority of their workloads from their own bare metal data centers. This strategic shift is driven by significant cost efficiencies, with the payback period for owned hardware being a mere three months, a stark contrast to the continuous expenses of cloud rentals. The company has invested in building its own data centers in locations like Singapore and plans to expand further. This move even involves hands-on kernel development, with Cooper admitting to submitting kernel patches this week to optimize their specific storage layer requirements for agentic workloads. This deep dive into infrastructure is not just about cost savings; it's about gaining the granular control necessary to build the future of computing, especially for the demands of AI agents.
Rapid Growth and the 'Internet is a Horrible Place' Problem
Railway is experiencing explosive growth, adding approximately 100,000 new users per week, bringing their total user base to 3 million. However, this rapid expansion isn't without its challenges. Cooper highlighted the difficulties of operating an open platform on the internet, describing it as a "horrible place" filled with crypto miners and other malicious actors. This reality forced Railway to navigate periods of expansion, where they focused on reaching as many users as possible (leading to a significant loss of about half a million dollars a month during a free tier era), followed by periods of compaction. During these compaction phases, they stripped away features or user segments that didn't align with their core target audience, focusing on sustainable business operations even with a lean team of 35 people. This balancing act between growth and sustainability is crucial for their long-term vision.
The Emergence of Agent-Native Infrastructure
The core thesis behind Railway's infrastructure development is its "agent-native" nature. Cooper believes the next era of software infrastructure isn't just an incremental upgrade on existing models like Heroku; it's about building specifically for AI agents. These agents require capabilities that differ subtly from human users: an enhanced need for version control, incremental testing (akin to feature flags), and deep observability to understand execution paths. Railway's existing features, like environment cloning and forking, are seen as foundational primitives that agents can leverage. The company is prioritizing agentic capabilities as a top-of-funnel initiative, recognizing that the fundamental shift from coding to commanding agents will necessitate new deployment loops, potentially moving beyond traditional Git and CI/CD pipelines. This vision requires systems that can handle workloads at a vastly compressed scale, enabling thousands of agents to operate in parallel.
Rethinking Git and Version Control for Agents
Cooper posited that Git, while revolutionary, has a fundamental limitation in how it handles versioning, often creating 'broken pointers' where cloning loses the upstream context. He mused about what a more continuous or percentage-based versioning system would look like, allowing for streams of changes to be traversed. This is particularly relevant for agentic workflows, where incremental changes and safe rollouts are paramount. The goal is to reach a state where agents can progressively release changes, with high-impact users like major enterprises being the last to receive updates, thereby minimizing risk. This non-deterministic version control, where progress is measured by percentages rather than discrete merges, could fundamentally alter how software is updated and maintained.
Data Centers, Hardware Appreciation, and the Compute Crunch
The move to bare metal includes building and operating dedicated data centers, a significant undertaking. Cooper noted the surprising appreciation in hardware value, stating that the money raised in their last round was less than the bank balance plus the value of their servers, partly due to rising RAM prices. This hardware appreciation is occurring against a backdrop of a 'compute crunch,' where hyperscalers are making massive capital expenditures, yet demand, especially for AI, is outstripping supply. Railway works directly with OEMs and resellers to secure hardware, navigating supply chain complexities. The cost-effectiveness of their own metal infrastructure is crucial for enabling the parallel agent execution envisioned for the future, where compute costs could otherwise become prohibitively expensive.
Beyond Kubernetes: Primitives and Control
Railway deliberately avoids using Kubernetes, opting for a higher order of control over their infrastructure primitives: network, compute, and storage. This control is essential for efficiently placing workloads with extreme precision, which is critical for dense agent execution and memory reuse to manage costs. Building their own metal unlocks performance and cost advantages, making operations like running thousands of agents in parallel economically viable. This approach contrasts with the complexity and abstraction layers of solutions like Kubernetes, allowing Railway to optimize from the ground up for their specific agent-native vision, even to the extent of modifying OS layers and kernel patches.
Temporal's Promise and Pitfalls
Temporal, a powerful workflow orchestration tool, has been a significant part of Railway's stack, powering complex operations like ride-sharing trip management at Uber (as Cadence, its predecessor). Cooper described Temporal as a "jet engine" – incredibly powerful when understood and operated correctly, but demanding a deep understanding of its deterministic workflow history. He highlighted the risk of non-determinism issues arising from complex state management within Temporal, which could lead to production incidents. While acknowledging its theoretical power for agentic tasks requiring discrete, completed workflows, Railway has started building internal alternatives, recognizing the steep learning curve and the potential for complexity to become a bottleneck. This leads them to consider building their own workflow solutions or leveraging emerging platforms like Restate.
The Evolving SDLC: Prompt Requests and the Death of PRs
The future of the Software Development Life Cycle (SDLC) is being reshaped by AI and agents. Cooper predicts the decline of traditional Pull Requests (PRs) in favor of 'Prompt Requests,' where agents interpret and execute prompts. He also suggests traditional code review might become less important if robust systems for validation and testing are in place. This shift emphasizes speed, concurrency, and the ability to iterate rapidly in production. Companies are seeking ways to compress the SDLC, making it dramatically faster and more efficient, leading to a new paradigm where infrastructure and development tools are tightly integrated. The idea of 'cattle not pets,' where infrastructure is treated as disposable, might even evolve towards an era where stateful systems can be cloned and iterated upon with the same ease as stateless ones, thanks to advanced snapshotting and lazy-loading capabilities.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
Common Questions
Railway is described as the easiest way to ship anything. It allows users to deploy instances, repositories, and code through conversational interfaces like Claude or a visual canvas, aiming to simplify the evolution of applications over time.
Topics
Mentioned in this video
Mentioned as a communication platform, compared unfavorably to an internal tool for aggregating context and routing feedback.
An internal tool built by Railway to aggregate all user feedback, customer support, and incidents, clustering them to determine impact and route discussions.
Discussed in the context of its discreet sense of making changes and merging upstream, and the potential for a more non-deterministic, percentage-based stream of changes.
Mentioned as a package manager that has a feature to define rules about not taking new packages.
Used as a comparison point to illustrate the massive scale of capital expenditure in the hyperscaler industry, with hyperscalers spending more than the Manhattan Project on capital expenditures.
Mentioned as a cloud provider that Railway maintains a presence with for bursting workloads, and with whom they work to acquire compute.
Mentioned as a cloud provider that Railway maintains a presence with for bursting workloads, and with whom they work to acquire compute.
Accidentally caught in the crossfire of a tweet about compute acquisition difficulties, indicating a potential issue with their services related to cache invalidation.
Mentioned as an AI model that users can talk to on Railway to deploy services.
Cited as an example of existing tooling that contributes to the 'stacking entropy' of application development, making it complex to manage environments.
Referred to as an example of complex tooling that Railway avoids in favor of higher-order control for placing workloads in specific locations.
Its recent deprecation is discussed as a significant event in the developer tooling space, with Railway seen as a potential successor.
The system behind Nyxpacks, discussed for its potential benefits and also its drawbacks related to image size and complexity for real-world workloads.
Meta's internal serverless system, mentioned in the context of scaling difficulties with systems like Nix.
Part of the underlying infrastructure stack for Railway, related to context-addressable file systems.
Acquired by OpenAI, its technology is described as being related to routing and flagging through different models.
Mentioned as a stage in the evolution of programming languages, and as a language used by Railway.
Mentioned as an example of a model that uses the technology acquired by OpenAI (Statig) for routing and flagging.
Discussed as a critical interface for agents, benefiting from numerous arguments and flags, and serving as a mechanism for telemetry to improve loop closing efficiency.
Mentioned as a stage in the evolution of programming languages.
Mentioned as a system that people have been trying to get the speaker to adopt for years.
Mentioned in the context of Jay Cooper's previous work with distributed systems and trip actions powered by Cadence, and their internal systems for managing complex workflows.
Referred to as having an 'original sin' of being a series of broken pointers, making it difficult to modify small pieces of code without losing upstream context.
Used as an example of a large entity that should be among the last to receive patches due to the critical nature of their operations.
Mentioned as a hardware provider that Railway works with directly to source servers.
Mentioned as a hardware provider that Railway works with directly to source servers.
Criticized for lacking nuance in discussions, particularly regarding venture debt and the importance of specific tools.
A venture capital firm whose partners, John and Jordan, provided advice to Jay Cooper as he scaled Railway.
A venture capital firm from which Railway has raised funds, indicating a focus on enterprise markets.
A venture capital firm from which Railway has raised funds, indicating a focus on enterprise markets.
Mentioned as a networking component that might need to be surpassed by better solutions as workloads become more compressed.
Mentioned as an AI model that can interact with the CLI, benefiting from a large number of arguments and flags.
Mentioned for its 'Fluid Compute' product, which is compared to Railway's approach to modern serverless and stateful workflows.
Mentioned for its container offering, contributing to the evolving landscape of modern serverless and stateful workflows.
Mentioned for its AppRunner service in the context of modern serverless and stateful workflow offerings.
The parent company of Heroku, discussed in the context of why Heroku may have stagnated due to a lack of focus on its core business.
Referred to in the context of its early days and focus strategy, and also for its internal serverless system (XF AS) paper.
Mentioned as a company doing interesting work in the workflow management space.
Acquired Statig, a company involved in routing and flagging through different models like GPT-5.
Mentioned as a company that Anthropic is competing with in the design space.
Praised for its theoretical capabilities in managing complex workflows and agentic journeys, but noted for its steep learning curve and potential for non-determinism issues.
Mentioned as a well-known feature flagging product, contrasted with an earlier, simpler version built by the speaker.
Mentioned as having a system called 'Gatekeeper' related to feature flagging and incremental rollouts.
Described as an amazing and stellar company moving into various zones, including competing with Figma's space.
More from Latent Space
View all 215 summaries
120 minThe Next War Is Already Here — Yaroslav Azhnyuk, The Fourth Law & Noah Smith, Noahpinion
67 minInside Abridge: The AI Listening to 100 Million Doctor Visits — Abridge's Janie Lee & Chai Asawa
23 minSenior Dev: This "Grill Me" Prompt Is Going Viral Among Top Engineers
92 min🔬Top Black Holes Physicist: GPT5 can do Vibe Physics, here's what I found
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free