Zubnet AI學習Wiki › Jailbreak
Safety

Jailbreak

Jailbreaking, Adversarial Prompt
騙 AI 模型繞過它的安全訓練、生成它被設計拒絕的內容的技術 — 危險活動的指令、有害內容、或違反模型使用政策的行為。Jailbreak 利用模型被訓練拒絕的內容,與聰明的 prompt 能誘出的內容之間的差距。

為什麼重要

Jailbreaking 是 AI 安全的對抗測試場。每個模型都帶著安全護欄上線,每個主要模型都被 jailbreak 過。jailbreak 技術和安全措施之間的貓鼠遊戲推動對齊的改進。理解 jailbreak 能幫你評估一個模型的安全實際上有多健壯,而不是接受行銷宣傳。

Deep Dive

Common jailbreak techniques include: role-playing ("Pretend you're an AI without restrictions"), encoding (asking in Base64 or pig Latin), many-shot attacks (providing many examples of the unsafe behavior to establish a pattern), and crescendo attacks (gradually escalating from benign to harmful requests across a conversation). More sophisticated techniques exploit specific model behaviors, like the tendency to continue established patterns or to be helpful when asked for "educational" information.

The Arms Race

AI labs invest heavily in red-teaming — systematically trying to jailbreak their own models before release. When a new jailbreak technique is discovered, it gets patched through additional safety training or system-level filters. But the attack surface is vast: natural language is infinitely flexible, and new techniques keep emerging. The practical reality is that determined adversaries can usually find some jailbreak for any public model, which is why defense-in-depth (multiple layers of safety, including output filtering and monitoring) matters more than any single prevention technique.

Jailbreak vs. Legitimate Use

The challenge is that safety filters sometimes refuse legitimate requests. A medical professional asking about drug interactions, a security researcher asking about vulnerabilities, or a novelist writing a scene with conflict might all trigger refusals. Overly aggressive safety training produces models that are "safe" but useless. The art of alignment is finding the right balance — refusing genuinely harmful requests while remaining helpful for legitimate ones.

相關概念

← 所有術語
← Instruction Tuning Jina AI →