Differential Privacy: Definition & Meaning — AI Wiki

一個在聚合資料分析和模型訓練裡保證個體隱私的數學框架。有了差分隱私,加入或移除任何單個個體的資料最多改變輸出一個小的、有界的量。這意味著你能從資料集中學到有用的模式,不透露關於其中任何特定人的資訊。

為什麼重要

當 AI 在越來越個人化的資料上訓練(健康記錄、金融交易、訊息),差分隱私提供了已知最強的保證 — 個體資料不能從模型中被提取。Apple(鍵盤預測)、Google(Chrome 使用分析)、美國人口普查局都在用。對 AI,它解決了 LLM 可能記憶並重現私人訓練資料的顧慮。

Deep Dive

The formal guarantee: a mechanism M is ε-differentially private if for any two datasets D and D' that differ in one record, and any output S: P[M(D) ∈ S] ≤ e^ε · P[M(D') ∈ S]. Intuitively: the output looks essentially the same whether or not any specific individual's data is included. The privacy parameter ε controls the privacy-utility trade-off — smaller ε means stronger privacy but noisier (less useful) outputs.

DP in ML Training

DP-SGD (Differentially Private Stochastic Gradient Descent) adds calibrated noise to gradients during training, ensuring the trained model doesn't memorize individual examples. The trade-off: noise reduces model accuracy. For large models and datasets, the accuracy impact can be small. For small datasets, DP can significantly hurt performance. The practical challenge is choosing ε — too small and the model is useless, too large and privacy guarantees are meaningless.

The Memorization Problem

LLMs can memorize and reproduce training data verbatim — phone numbers, email addresses, proprietary code. This is a privacy violation even without intentional data extraction. Differential privacy during pre-training would prevent this memorization, but applying DP to models trained on trillions of tokens is computationally challenging and can degrade quality. Current practice uses a combination of: training data deduplication, output filtering, and careful data sourcing rather than formal DP guarantees. As regulation tightens, the pressure to adopt formal privacy guarantees will increase.

Differential Privacy

為什麼重要

Deep Dive

DP in ML Training

The Memorization Problem

相關概念