The term "red teaming" comes from Cold War military exercises where a designated adversary team (red) would attack the plans of the defending team (blue). In cybersecurity, it evolved into the practice of hiring ethical hackers to find vulnerabilities before malicious ones do. AI red teaming applies the same philosophy: assume the model has weaknesses, then systematically find them. The key difference from traditional pen testing is that AI models fail in fuzzy, probabilistic ways — there's no single exploit that "roots" a language model, but rather a landscape of prompts and contexts where the model behaves unexpectedly or harmfully.
Modern AI red teaming typically covers several categories of failure. Safety testing probes for harmful content generation — can you get the model to produce instructions for weapons, detailed self-harm content, or child exploitation material? Bias and fairness testing checks whether the model treats demographic groups differently or reinforces stereotypes. Factuality testing looks for confident hallucinations, especially in high-stakes domains like medicine and law. Privacy testing checks whether the model will regurgitate personal information from its training data (researchers have extracted verbatim training data from GPT-3, including phone numbers and email addresses). And capability evaluations assess whether the model could assist with genuinely dangerous tasks like bioweapons design or cyberattacks — these are the evaluations that inform whether a model is safe to deploy at all.
The practice has professionalized rapidly. Anthropic, OpenAI, Google DeepMind, and Meta all run internal red teams before major releases, and they increasingly bring in external specialists. Anthropic partnered with domain experts in biosecurity and cybersecurity for Claude's pre-release evaluations. OpenAI ran a large-scale external red teaming exercise for GPT-4 with over 50 experts. Startups like HackerOne and Scale AI have built red-teaming-as-a-service platforms. There's also a growing community of independent AI red teamers — DEF CON's 2023 Generative AI Red Teaming event had thousands of participants testing models from multiple providers simultaneously, and it surfaced real vulnerabilities that the companies subsequently patched.
Automated red teaming is an increasingly important complement to human testing. The idea is to use one AI model to generate adversarial prompts that test another model's defenses. Techniques include gradient-based attacks (Greedy Coordinate Gradient, or GCG, which finds nonsensical but effective adversarial suffixes), LLM-as-attacker approaches (where a "red" model iteratively refines jailbreak prompts based on the target's responses), and fuzzing (systematically mutating known-successful attacks to find new variants). Anthropic and other labs use these automated methods to test at scale — a human red teamer might try hundreds of attacks in a session, while an automated system can try millions. The catch is that automated methods tend to find "weird" failures (responses to gibberish tokens) while humans are better at finding socially realistic attack vectors (the kind actual users would attempt).
A practical gotcha for anyone doing red teaming: the results are highly sensitive to how you frame the exercise. If you only test for the failures you expect, you'll only find those. The most valuable red teaming often comes from people with domain expertise unrelated to AI — a social worker might spot manipulation patterns that a security researcher wouldn't think to test, while a chemist would know which synthesis instructions are actually dangerous versus which are textbook knowledge. This is why diverse red teams consistently find more and different vulnerabilities than homogeneous ones. It's also why red teaming is never "done" — every new use case, every new integration, every model update potentially introduces failure modes that previous testing didn't cover.