AI Privacy: Definition & Meaning — AI Wiki

构建和使用 AI 系统而不损害个人数据的挑战。这涵盖整个生命周期:可能包含隐私信息的训练数据、能记住并吐出个人细节的模型、跟踪用户行为的推理日志、以及 AI 能力(数据越多越好)和隐私权之间的根本张力。

为什么重要

每次和 AI 对话都是数据。你生成的每张图都揭示你的 prompt。你总结的每份文档都经过某人的服务器。隐私不只是法律打勾(GDPR、CCPA) — 它是决定个人和企业是否会采用 AI 做敏感工作的信任问题。

Deep Dive

Privacy in AI is not one problem — it is a stack of interconnected problems that span the entire lifecycle of a model and everything that touches it. Training data may contain personal information scraped from the web without consent. The model itself can memorize and reproduce that information verbatim. Inference logs capture what users ask, which often reveals far more about them than they realize. And the business models of many AI providers depend on using your interactions to improve their systems, which means your data flows into the next training run unless you explicitly opt out (and sometimes even then). Understanding where privacy breaks down requires looking at each layer separately.

The Training Data Problem

Large language models are trained on datasets scraped from the open web — Common Crawl, Reddit archives, public forums, personal blogs, leaked databases that were indexed by search engines. This means the training data for GPT-4, Claude, Gemini, and every other frontier model contains real names, addresses, phone numbers, medical discussions, legal documents, and private conversations that people posted without imagining they would end up inside a neural network. The legal landscape here is evolving rapidly. The EU AI Act requires documentation of training data sources. Italy temporarily banned ChatGPT over GDPR concerns. Class-action lawsuits are ongoing in multiple jurisdictions. But the technical reality is that once information is embedded in model weights through training, it cannot be cleanly removed. Techniques like machine unlearning attempt to selectively forget specific data, but they are approximate at best — a problem regulators have not yet fully grappled with.

Memorization and Extraction

Models do not just learn patterns from training data — they sometimes memorize specific sequences verbatim. Researchers at Google DeepMind demonstrated that GPT-3.5 could be prompted to emit memorized training data including personal phone numbers and email addresses. Larger models memorize more, and data that appears frequently in training sets is easier to extract. This is not a theoretical concern: if someone's personal information appeared in enough web pages, a sufficiently clever prompt can coax a model into reproducing it. Differential privacy (adding calibrated noise during training to limit what can be learned about any individual data point) is the most principled technical defense, but it comes with a real cost to model quality. Apple uses differential privacy in its on-device models. Most cloud providers do not, because the accuracy tradeoff at current techniques is too steep for competitive frontier models.

Inference Privacy and Data Flows

Even if the training data problem were solved tomorrow, inference creates its own privacy surface. When you paste a contract into ChatGPT for summarization, that text hits OpenAI's servers. When your company builds a customer-support chatbot, every customer interaction flows through your AI provider's infrastructure. Enterprise customers increasingly demand data processing agreements, SOC 2 compliance, and contractual guarantees that their data will not be used for training. Providers have responded: OpenAI, Anthropic, Google, and others offer enterprise tiers with no-training guarantees. But the architecture still requires sending data to someone else's servers. The alternative — running models locally or in your own cloud environment — is becoming more practical as open-weight models improve, but it requires significant technical investment and typically means giving up access to the most capable models.

Privacy-Preserving Approaches

The field is not standing still. Federated learning lets multiple parties train a shared model without ever combining their raw data — your data stays on your device or your server, and only model updates are shared. Homomorphic encryption, once considered too slow for practical use, is reaching the point where some inference workloads can run on encrypted data without ever decrypting it. On-device models like those in Apple Intelligence process sensitive tasks locally, only reaching out to the cloud for requests that exceed local capability. Retrieval-augmented generation lets you keep sensitive documents in your own infrastructure and inject relevant context at inference time without it entering the training pipeline. None of these approaches solve everything, and most involve tradeoffs in cost, latency, or model quality. But they represent a genuine shift from "trust us with your data" toward architectures where privacy is enforced by design rather than by policy alone.

AI Privacy