Key Moments
AI Security After Codex and Claude Code — Zico Kolter & Matt Fredrikson, Gray Swan
Want to know something specific about what's covered?
We've already dissected every moment. Ask and we will deliver (with timestamps).
Key Moments
AI agents can now write code and browse the web autonomously, but human red teamers struggle to find vulnerabilities that automated systems like 'Shade' can easily exploit, posing a significant future security risk.
Key Insights
Automated red teaming models, like Gray Swan's 'Shade,' are now surpassing human red teamers in their ability to find vulnerabilities in large language models.
AI security is distinct from traditional cybersecurity; AI systems have inherent vulnerabilities that can be exploited, necessitating a different security mindset.
The "lethal trifecta" for prompt injection attacks involves ingesting untrusted data, accessing private information, and the ability to exfiltrate it.
While capability scaling in models is rapid, safety and robustness do not scale naively with size and often require explicit training.
Gray Swan's 'Signal' is a custom-trained filter model designed to sit between users, LLMs, and tool calls to detect and prevent policy violations.
AI agents are increasingly being adopted by enterprises, shifting the focus of AI security from frontier labs to broader commercial applications.
Automated red teaming surpasses human capabilities
The discussion highlights a significant shift in AI security: automated red teaming systems are now outperforming human experts in discovering vulnerabilities within AI models. Gray Swan's system, named 'Shade,' has demonstrated superior ability in breaking models compared to human red teamers. This advancement is crucial because as AI agents become more powerful—capable of writing code, browsing the web, and acting on behalf of users—they introduce new attack surfaces. The effectiveness of automated tools like Shade suggests that traditional human-led security testing may soon become insufficient. The challenge lies in the fact that these automated attackers can identify flaws that are inherently 'out of distribution' for the models they are testing, bypassing normal behaviors. While not yet at fully 'superhuman' levels, the speed and efficacy of automated techniques in finding breaks within a given timeframe are notable.
AI security demands a new mindset
AI security is fundamentally different from traditional cybersecurity. While AI can be used to enhance traditional security measures, the AI systems themselves introduce unique vulnerabilities. These systems can be 'tricked,' similar to how humans can be deceived, requiring a distinct security approach. A critical concern is the potential for correlated failures, where a few widely used models (like Codex and Claude Code) harbor vulnerabilities that can be exploited across many applications. This creates a new class of exploits. The prevailing mindset is that AI models should be treated as inherently untrusted entities, and Gray Swan focuses on understanding and mitigating the risks introduced by adopting and deploying AI, rather than simply using AI to improve existing cyber infrastructure.
The 'lethal trifecta' of prompt injection
Prompt injection, a key concern in AI security, arises from a combination of three elements, termed the 'lethal trifecta.' First, there must be the ability to ingest external data from untrusted sources, such as web content or user inputs. Second, the AI agent must have access to private or sensitive internal information that would be valuable to an attacker. Finally, the agent needs the capability to exfiltrate this information and send it to an external destination. These three components together create a significant risk. The discussion draws a parallel to traditional software vulnerabilities, noting that just as software is used productively despite known vulnerabilities, AI will likely be deployed despite its own security risks. The goal is not necessarily to achieve perfect, provable mitigation—which is akin to achieving zero-bug software—but to significantly improve the usability-security trade-off.
Robustness and safety are not inherent in scale
A key finding is that simply increasing the size of AI models does not automatically improve their safety or robustness against attacks. Unlike many other capabilities that scale with model size, safety features require explicit training. Models are not inherently better at resisting jailbreaks or adversarial prompts just because they are larger. Specialized models need to be trained for tasks like red teaming to achieve high performance. This highlights the importance of targeted training and defense mechanisms, such as Gray Swan's 'Signal' model, which is specifically trained to detect policy violations. The relationship between model capability and attack success rate is not strongly correlated; larger, more capable models can still be highly vulnerable to specific types of attacks.
Gray Swan's dual approach: Red teaming and defense
Gray Swan operates on two main fronts: adversarial testing (red teaming) and building defensive systems. Their red teaming efforts include running a community of human red teamers through prize challenges to discover vulnerabilities, as well as employing automated red teaming systems like 'Shade.' On the defense side, they have developed 'Signal,' a filter model that acts as a protective layer between users, LLMs, and tool calls. Signal is designed to identify and prevent policy violations, working best when custom-trained for specific enterprise needs. This dual approach allows them to not only find weaknesses but also to train their defensive models effectively, creating a continuous improvement loop. The effectiveness of Signal is linked to their ability to rigorously test it against the very vulnerabilities they aim to prevent.
The Human Browser Agent Robustness Challenge
A compelling experiment, the Human Browser Agent Robustness Challenge, pitted human participants against AI agents in completing web-based tasks. Red teamers could choose to either attempt to 'fish' a human or 'prompt inject' an AI browser agent. Surprisingly, humans were ranked relatively low in robustness, with skilled human red teamers achieving 60-70% success in fishing human participants. While some AI models showed strong resistance, many were easily prompt-injected. This challenge underscored that AI and humans fall prey to different types of vulnerabilities. Humans might resist obvious phishing attempts, but advanced AI models can still be susceptible to sophisticated prompt injections, while humans can be easily deceived by social engineering tactics that might not fool an AI.
The evolving role of agents in science and enterprise
The conversation touches on how advancements in AI agents, particularly coding agents, could revolutionize scientific research, specifically in the field of interpretability (mechanistic interpretability or 'mechan'). Agents can potentially automate the process of hypothesis testing and experimentation, turning these disciplines into more robust sciences. For enterprises, the trend is clear: AI agents like Codex and Claude are moving beyond frontier labs and becoming integral to operations. This widespread adoption necessitates robust security measures. Gray Swan's Series A funding is largely aimed at scaling their technology for enterprise deployment, recognizing that organizations adopting these tools need solutions to ensure safety and policy compliance, often proactively seeking these solutions before incidents occur.
The future of AI security: Automation and specialized solutions
Looking ahead, the future of AI security will likely involve greater automation and specialization. The ability of AI agents to automate complex tasks, such as writing secure code or conducting interpretability research, is poised to significantly raise the baseline for what is achievable. Companies like Gray Swan are focusing on developing specialized solutions like 'Signal' because generic, open-source guardrails may not be sufficient for enterprise-grade security. The development of agent-native identity and capability management remains a significant challenge, with current models often operating under the assumption that an agent has the user's permissions, a default that is expected to change. Ultimately, the goal is to advance the Pareto frontier, improving both usability and security, a process that requires continuous research and development.
Mentioned in This Episode
●Software & Apps
●Companies
●Organizations
●People Referenced
Common Questions
Grace Swan's mission is to empower everyone to use AI safely and securely. They focus on understanding and mitigating the vulnerabilities inherent in AI systems, especially large language models and agents, rather than just using AI for traditional cybersecurity.
Topics
Mentioned in this video
An automated red teaming model developed by Grace Swan, found to be more effective at breaking AI models than human red teamers.
A code generation model mentioned as one of the foundational LLMs where vulnerabilities can emerge, leading to new exploit classes.
A programming language mentioned as a contrast to more secure but less user-friendly languages, representing a trade-off in development.
A programming language mentioned as a 'sweeter spot' between usability and security guarantees, compared to languages like Python or those requiring strict type checking.
An LLM mentioned alongside Codex as a potential future tool for writing secure code, if agents become skilled enough.
An open-source LLM mentioned in the context of labs releasing their own open-source security guards.
A state-of-the-art frontier model mentioned as an example that might fall for certain deceptive tactics that humans would avoid.
The full spelling of 'Signal', Grace Swan's defense model, emphasizing its use as a filter.
A company founded from Carnegie Mellon research focused on empowering safe and secure AI usage, offering both red teaming services and defense products.
One of Grace Swan's investors, mentioned in the context of the founders attending the Snowflake Summit.
Mentioned as a company that has been strong in deploying formal verification techniques for secure software development.
Mentioned as a company historically good at formal verification, though more on the research side compared to Amazon's deployment focus.
An AI model discussed as having a large attack surface. Grace Swan has developed numerous 'brakes' for it, particularly concerning computer use and connecting to personal devices.
An airline mentioned as having experienced a prompt injection breach, though not considered a major event.
Co-founder of Grace Swan and faculty at Carnegie Mellon, who has been researching AI vulnerabilities and attack surfaces for over a decade.
Co-founder of Grace Swan and faculty at Carnegie Mellon, focusing on the unique security challenges of AI systems compared to traditional software.
A researcher who inspired Ziko and Matt's work on adversarial settings in AI, mentioned as a friend of the podcast.
Mentioned for his idea of giving up on traditional methods and focusing on AI's potential to automate science, particularly interpretability.
A university from which Grace Swan's founders originate, known as a significant center for AI research, including self-driving and language learning.
Mentioned as an organization that works with Grace Swan, providing independent judges and evaluators for competitions.
Mentioned alongside UKC as providing expertise as independent judges and evaluators for competitions.
An AI underwriting company that Grace Swan works with. It assesses the risk of AI deployments and can prescribe mitigations.
A voluntary industry standard for security compliance, mentioned as a model for potential AI compliance frameworks, though noted as being more product than security expert-driven.
Mentioned as a potential framework for AI insurance, distinct from SOC 2, with varying levels of legal bindingness.
More from Latent Space
View all 229 summaries
34 min⚡️Every product of the future will be a living system — Ronak Malde, Trajectory.ai
61 minThe AI Frontier: from FLOPs to Megawatts — Anjney Midha, AMP
77 min🔬 The Limits of AI in Science - Why We Need Self-Driving Labs — Joseph Krause, Radical AI
41 min⚡️Making DeepSeek v4 outperform Opus 4.7 with Taste — @AhmadAwais , CommandCode.ai
Ask anything from this episode.
Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.
Get Started Free