⚡️Traversal: Causal ML and Reinforcement Learning

Latent Space PodcastLatent Space Podcast
Science & Technology4 min read46 min video
Oct 5, 2025|2,410 views|34|1
Save to Pod

Key Moments

TL;DR

Traversal uses causal ML and RL for AI-driven incident troubleshooting, going beyond traditional observability.

Key Insights

1

Traversal leverages causal ML and reinforcement learning, combined with LLMs and AI agents, to tackle complex incident troubleshooting.

2

Traditional observability tools and current AI applications struggle with the sheer volume and fragmentation of enterprise telemetry data.

3

The core challenge is distinguishing correlation from causation and finding the true root cause among numerous potential symptoms in a vast data haystack.

4

Traversal's approach involves intelligent data collection and context building using AI agents that employ statistical tests alongside semantic understanding.

5

The platform aims to move beyond simple alert automation to provide deeper intelligence for incident resolution and eventually self-healing capabilities.

6

The business model is outcome-focused, offering value by reducing investigation time and complexity, being agnostic to specific data sources.

THE ORIGINS AND VISION OF TRAVERSAL

Anish and Raz, founders of Traversal, bring deep expertise in AI, causal machine learning, and reinforcement learning from academic backgrounds at MIT, Columbia, and Cornell Tech. Their motivation for starting Traversal stemmed from observing the rapid advancements in LLMs and AI agents, recognizing a critical need for specialized tools to address complex, long-standing problems in large-scale enterprise systems. They identified incident troubleshooting and root cause analysis as a perfect intersection of their technical expertise and a significant market pain point, aiming to move beyond correlation to true causation.

ADDRESSING THE LIMITATIONS OF CURRENT SOLUTIONS

Modern microservice architectures generate massive amounts of fragmented telemetry data, including logs, metrics, traces, code, and communication artifacts. Traditional observability tools and current AI applications often falter when trying to make sense of this data deluge. They struggle to filter signal from noise and to accurately identify the root cause when thousands of potential symptoms, or 'fake needles,' appear simultaneously. Traversal positions itself as a solution designed to navigate this complexity, moving beyond simple symptom correlation to uncover true cause-and-effect relationships.

TRAVERSAL'S UNIQUE TECHNICAL APPROACH

Traversal's innovation lies in its integrated use of causal ML, reinforcement learning, and AI agents. The system intelligently collects context by querying multiple data sources (logs, metrics, traces, deployments, etc.) sequentially and adaptively. It employs proprietary statistical tests combined with semantic understanding from LLMs to winnow down the vast search space. This 'semantics meets statistics' framework allows AI agents to dynamically select the best tests to filter information, offering a more robust approach than relying solely on LLMs or deterministic workflows, especially for time-series data.

THE INCIDENT RESOLUTION WORKFLOW

Users initiate an investigation by providing an approximate time of the incident and a brief description, often triggered by an alert or incident channel. Traversal's AI then autonomously builds the necessary context by querying various observability systems, respecting rate limits and striving for answers within minutes. While users can provide initial context to accelerate the process, the system aims to minimize the burden on on-call engineers. The output can range from identifying impacted services and their connections to suggesting specific remediation actions, such as rolling back a commit.

EVOLVING TOWARDS SELF-HEALING AND FUTURE CAPABILITIES

Traversal is progressing towards self-healing capabilities, starting with issues that can be reliably resolved by identifying the root cause and executing pre-existing automation scripts. While currently handling around 10-20% of incidents autonomously (akin to L1/L2 engineer capabilities), they anticipate that within 6-12 months, more complex issues requiring senior engineer validation will also become confidently addressable. Long-term, they envision AI agents that can orchestrate or even refactor codebases, though this 'full agentic system' for code rewriting is likely several years away, requiring advanced capabilities beyond current unit testing.

BUSINESS MODEL AND MARKET POSITIONING

Traversal operates with a business model focused on delivering an outcome – faster and more accurate incident investigation – rather than solely on data storage, which is common in the observability market. They position themselves as a neutral 'Switzerland,' agnostic to the specific observability tools customers use, thus offering value in fragmented enterprise environments. Pricing is a blend of infrastructure complexity and the number of investigations, with a future shift towards more outcome-based pricing as self-healing capabilities mature and become more consistently reliable.

THE AI CHALLENGES AND BENCHMARKING

The core AI challenges include managing massive contextual data that exceeds LLM token limits and handling the non-deterministic nature of complex incidents, which lack playbooks. Traversal emphasizes the importance of rigorous evaluation (eval) pipelines and internal benchmarking, which forms a significant part of their intellectual property. They are actively refining their use of models, finding reasoning models critical and observing shifts in performance between OpenAI, Anthropic, and Google's offerings, with a particular focus on tool-calling and 'unstucking' capabilities. Building high-fidelity simulation environments for evaluation remains a difficult pursuit, underscoring the need for real-world production data and historical incidents.

Common Questions

Traversal is a company leveraging causal ML and reinforcement learning to address complex software maintenance and incident response. It aims to significantly improve the speed and accuracy of troubleshooting by intelligently searching through vast amounts of system data.

Topics

Mentioned in this video

More from Latent Space

View all 63 summaries

Found this useful? Build your knowledge library

Get AI-powered summaries of any YouTube video, podcast, or article in seconds. Save them to your personal pods and access them anytime.

Try Summify free