How does AI security differ from traditional cybersecurity?

AI systems have inherent vulnerabilities that can be exploited similarly to how people can be tricked. This requires a different mindset for security, considering that multiple AI systems often rely on a few core models, meaning a vulnerability in one could affect many.

What is prompt injection and why is it a concern?

Prompt injection is a type of attack where a third party manipulates an AI agent's input (prompt) to make it perform unintended or malicious actions, such as leaking data, stealing credentials, or misbehaving.

What are Grace Swan's main offerings for AI security?

Grace Swan offers two main services: automated red teaming (using models like 'Shade' to find vulnerabilities) and defense mechanisms (with their 'Signal' model to filter and enforce policies), alongside operating the 'Arena' for community-driven red teaming challenges.

Can AI models be better at red teaming than humans?

Yes, Grace Swan has found that their automated red teaming system, 'Shade,' can now outperform human red teamers in identifying vulnerabilities within a given time frame, though 'superhuman' general red teaming levels have not yet been reached.

How do AI models differ from humans in terms of vulnerabilities?

AI models and humans fall for different types of tricks. While humans are susceptible to phishing, AI models can be vulnerable to prompt injection. Conversely, some AI models might fall for simple deceptive text that a human would easily recognize as fake.

What is the 'lethal trifecta' of prompt injection risks?

The lethal trifecta refers to the conditions necessary for a prompt injection attack: the ability to ingest untrusted external data, access to private or sensitive internal information, and the capability to exfiltrate that data.

When should enterprises adopt AI security solutions like 'Signal'?

Enterprises should consider adopting solutions like 'Signal' when they have already released AI systems and encountered incidents, or when they need to enforce specific, unique policies that go beyond general LLM capabilities and cannot be handled by prompting alone, especially when dealing with tool use.

Is AI security a solved problem?

No, AI security is still very much a research problem. While great strides are being made, new vulnerabilities and attack vectors are constantly being discovered, and the field is continuously learning how to make models more robust and enforce policies effectively.

How can agents automate scientific research?

Agents can automate complex tasks like performing experiments for mechanistic interpretability or writing secure code. This capability allows for scaling scientific research that was previously limited by human patience and manpower, potentially accelerating discovery.

What are future trends for AI security in enterprises?

The trend is towards AI security becoming a central concern for all enterprises, not just AI labs. Companies will increasingly need solutions to manage the security of agents as they are adopted for core operations, mirroring the growth seen in AI adoption itself.

How will agent identity and access control evolve?

Agent identity will likely evolve from agents having human permissions to a more nuanced system, possibly starting with personas that separate work and home life. This will allow for better control over agent capabilities and access to resources.

Key Moments

AI Security After Codex and Claude Code — Zico Kolter & Matt Fredrikson, Gray Swan

Latent Space Podcast

Science & Technology6 min read68 min video

Jun 22, 2026|914 views|29

Save to Pod

Want to know something specific about what's covered?

We've already dissected every moment. Ask and we will deliver (with timestamps).

Key Moments

TL;DR

AI agents can now write code and browse the web autonomously, but human red teamers struggle to find vulnerabilities that automated systems like 'Shade' can easily exploit, posing a significant future security risk.

Key Insights

Automated red teaming models, like Gray Swan's 'Shade,' are now surpassing human red teamers in their ability to find vulnerabilities in large language models.

AI security is distinct from traditional cybersecurity; AI systems have inherent vulnerabilities that can be exploited, necessitating a different security mindset.

The "lethal trifecta" for prompt injection attacks involves ingesting untrusted data, accessing private information, and the ability to exfiltrate it.

While capability scaling in models is rapid, safety and robustness do not scale naively with size and often require explicit training.

Gray Swan's 'Signal' is a custom-trained filter model designed to sit between users, LLMs, and tool calls to detect and prevent policy violations.

AI agents are increasingly being adopted by enterprises, shifting the focus of AI security from frontier labs to broader commercial applications.

Automated red teaming surpasses human capabilities

The discussion highlights a significant shift in AI security: automated red teaming systems are now outperforming human experts in discovering vulnerabilities within AI models. Gray Swan's system, named 'Shade,' has demonstrated superior ability in breaking models compared to human red teamers. This advancement is crucial because as AI agents become more powerful—capable of writing code, browsing the web, and acting on behalf of users—they introduce new attack surfaces. The effectiveness of automated tools like Shade suggests that traditional human-led security testing may soon become insufficient. The challenge lies in the fact that these automated attackers can identify flaws that are inherently 'out of distribution' for the models they are testing, bypassing normal behaviors. While not yet at fully 'superhuman' levels, the speed and efficacy of automated techniques in finding breaks within a given timeframe are notable.

AI security demands a new mindset

AI security is fundamentally different from traditional cybersecurity. While AI can be used to enhance traditional security measures, the AI systems themselves introduce unique vulnerabilities. These systems can be 'tricked,' similar to how humans can be deceived, requiring a distinct security approach. A critical concern is the potential for correlated failures, where a few widely used models (like Codex and Claude Code) harbor vulnerabilities that can be exploited across many applications. This creates a new class of exploits. The prevailing mindset is that AI models should be treated as inherently untrusted entities, and Gray Swan focuses on understanding and mitigating the risks introduced by adopting and deploying AI, rather than simply using AI to improve existing cyber infrastructure.

The 'lethal trifecta' of prompt injection

Prompt injection, a key concern in AI security, arises from a combination of three elements, termed the 'lethal trifecta.' First, there must be the ability to ingest external data from untrusted sources, such as web content or user inputs. Second, the AI agent must have access to private or sensitive internal information that would be valuable to an attacker. Finally, the agent needs the capability to exfiltrate this information and send it to an external destination. These three components together create a significant risk. The discussion draws a parallel to traditional software vulnerabilities, noting that just as software is used productively despite known vulnerabilities, AI will likely be deployed despite its own security risks. The goal is not necessarily to achieve perfect, provable mitigation—which is akin to achieving zero-bug software—but to significantly improve the usability-security trade-off.

Robustness and safety are not inherent in scale

A key finding is that simply increasing the size of AI models does not automatically improve their safety or robustness against attacks. Unlike many other capabilities that scale with model size, safety features require explicit training. Models are not inherently better at resisting jailbreaks or adversarial prompts just because they are larger. Specialized models need to be trained for tasks like red teaming to achieve high performance. This highlights the importance of targeted training and defense mechanisms, such as Gray Swan's 'Signal' model, which is specifically trained to detect policy violations. The relationship between model capability and attack success rate is not strongly correlated; larger, more capable models can still be highly vulnerable to specific types of attacks.

Gray Swan's dual approach: Red teaming and defense

Gray Swan operates on two main fronts: adversarial testing (red teaming) and building defensive systems. Their red teaming efforts include running a community of human red teamers through prize challenges to discover vulnerabilities, as well as employing automated red teaming systems like 'Shade.' On the defense side, they have developed 'Signal,' a filter model that acts as a protective layer between users, LLMs, and tool calls. Signal is designed to identify and prevent policy violations, working best when custom-trained for specific enterprise needs. This dual approach allows them to not only find weaknesses but also to train their defensive models effectively, creating a continuous improvement loop. The effectiveness of Signal is linked to their ability to rigorously test it against the very vulnerabilities they aim to prevent.

The Human Browser Agent Robustness Challenge

A compelling experiment, the Human Browser Agent Robustness Challenge, pitted human participants against AI agents in completing web-based tasks. Red teamers could choose to either attempt to 'fish' a human or 'prompt inject' an AI browser agent. Surprisingly, humans were ranked relatively low in robustness, with skilled human red teamers achieving 60-70% success in fishing human participants. While some AI models showed strong resistance, many were easily prompt-injected. This challenge underscored that AI and humans fall prey to different types of vulnerabilities. Humans might resist obvious phishing attempts, but advanced AI models can still be susceptible to sophisticated prompt injections, while humans can be easily deceived by social engineering tactics that might not fool an AI.

The evolving role of agents in science and enterprise

The conversation touches on how advancements in AI agents, particularly coding agents, could revolutionize scientific research, specifically in the field of interpretability (mechanistic interpretability or 'mechan'). Agents can potentially automate the process of hypothesis testing and experimentation, turning these disciplines into more robust sciences. For enterprises, the trend is clear: AI agents like Codex and Claude are moving beyond frontier labs and becoming integral to operations. This widespread adoption necessitates robust security measures. Gray Swan's Series A funding is largely aimed at scaling their technology for enterprise deployment, recognizing that organizations adopting these tools need solutions to ensure safety and policy compliance, often proactively seeking these solutions before incidents occur.

The future of AI security: Automation and specialized solutions

Looking ahead, the future of AI security will likely involve greater automation and specialization. The ability of AI agents to automate complex tasks, such as writing secure code or conducting interpretability research, is poised to significantly raise the baseline for what is achievable. Companies like Gray Swan are focusing on developing specialized solutions like 'Signal' because generic, open-source guardrails may not be sufficient for enterprise-grade security. The development of agent-native identity and capability management remains a significant challenge, with current models often operating under the assumption that an agent has the user's permissions, a default that is expected to change. Ultimately, the goal is to advance the Pareto frontier, improving both usability and security, a process that requires continuous research and development.

Mentioned in This Episode

●Software & Apps

●Companies

●Organizations

●People Referenced

Common Questions

Grace Swan's mission is to empower everyone to use AI safely and securely. They focus on understanding and mitigating the vulnerabilities inherent in AI systems, especially large language models and agents, rather than just using AI for traditional cybersecurity.

Topics

Ai-Ethics Ai Safety AI & Machine Learning Technology & Innovation AI Security Prompt Injection Model Robustness Adversarial Attacks Red Teaming LLM Vulnerabilities

Mentioned in this video

Software & Apps

Shade

An automated red teaming model developed by Grace Swan, found to be more effective at breaking AI models than human red teamers.

Codex

A code generation model mentioned as one of the foundational LLMs where vulnerabilities can emerge, leading to new exploit classes.

Python

A programming language mentioned as a contrast to more secure but less user-friendly languages, representing a trade-off in development.

Rust

A programming language mentioned as a 'sweeter spot' between usability and security guarantees, compared to languages like Python or those requiring strict type checking.

Claude

An LLM mentioned alongside Codex as a potential future tool for writing secure code, if agents become skilled enough.

Llama

An open-source LLM mentioned in the context of labs releasing their own open-source security guards.

Opus

A state-of-the-art frontier model mentioned as an example that might fall for certain deceptive tactics that humans would avoid.

Cypnal

The full spelling of 'Signal', Grace Swan's defense model, emphasizing its use as a filter.

Companies

Grace Swan

A company founded from Carnegie Mellon research focused on empowering safe and secure AI usage, offering both red teaming services and defense products.

Snowflake

One of Grace Swan's investors, mentioned in the context of the founders attending the Snowflake Summit.

Amazon

Mentioned as a company that has been strong in deploying formal verification techniques for secure software development.

Microsoft

Mentioned as a company historically good at formal verification, though more on the research side compared to Amazon's deployment focus.

OpenClaw

An AI model discussed as having a large attack surface. Grace Swan has developed numerous 'brakes' for it, particularly concerning computer use and connecting to personal devices.

Hertz

An airline mentioned as having experienced a prompt injection breach, though not considered a major event.

People

Matt Fredrikson

Co-founder of Grace Swan and faculty at Carnegie Mellon, who has been researching AI vulnerabilities and attack surfaces for over a decade.

Ziko Kolter

Co-founder of Grace Swan and faculty at Carnegie Mellon, focusing on the unique security challenges of AI systems compared to traditional software.

Ian Goodfellow

A researcher who inspired Ziko and Matt's work on adversarial settings in AI, mentioned as a friend of the podcast.

Neil Nandanda

Mentioned for his idea of giving up on traditional methods and focusing on AI's potential to automate science, particularly interpretability.

Organizations

Carnegie Mellon University

A university from which Grace Swan's founders originate, known as a significant center for AI research, including self-driving and language learning.

UKC

Mentioned as an organization that works with Grace Swan, providing independent judges and evaluators for competitions.

Casey

Mentioned alongside UKC as providing expertise as independent judges and evaluators for competitions.

AIC

An AI underwriting company that Grace Swan works with. It assesses the risk of AI deployments and can prescribe mitigations.

Legislation & Policy

SOC 2

A voluntary industry standard for security compliance, mentioned as a model for potential AI compliance frameworks, though noted as being more product than security expert-driven.

Sarbanes-Oxley Act

Mentioned as a potential framework for AI insurance, distinct from SOC 2, with varying levels of legal bindingness.

Ask anything from this episode.

Save it, chat with it, and connect it to Claude or ChatGPT. Get cited answers from the actual content — and build your own knowledge base of every podcast and video you care about.

Get Started Free