Zubnet AI学习Wiki › GAN
Models

GAN

又名: Generative Adversarial Network
一种模型架构,两个神经网络相互竞争:一个 generator 创造假数据,一个 discriminator 试图分辨真假。通过这个对抗游戏,generator 越来越擅长创造逼真的输出。从 2014 年到约 2022 年主导图像生成。

为什么重要

GAN 开创了逼真 AI 图像生成,至今在一些实时应用中还在用。但扩散模型在注重质量的工作中已经大体取代了它们,因为 GAN 更难训练,输出也不够多样。

Deep Dive

The GAN setup is a minimax game straight out of game theory. The generator takes random noise (a latent vector, typically sampled from a Gaussian) and maps it to a data sample — an image, usually. The discriminator receives both real samples from the training set and fake samples from the generator, and outputs a probability that each sample is real. The generator is trained to maximize the discriminator's error, while the discriminator is trained to minimize it. In theory, this converges to a Nash equilibrium where the generator produces outputs indistinguishable from real data and the discriminator is reduced to guessing at 50/50. In practice, getting there is another story entirely.

The Training Problem

Training instability was the defining challenge of GANs for years. Mode collapse — where the generator learns to produce only a narrow slice of possible outputs — plagued early architectures. If the discriminator gets too strong too fast, the gradient signal to the generator vanishes and learning stalls. If the generator finds a cheap trick that fools the discriminator, it exploits it relentlessly instead of learning diverse outputs. Wasserstein GANs (WGAN) addressed this with a different loss function that provides more meaningful gradients. Progressive growing (ProGAN) built images up from low resolution to high, stabilizing training enormously. StyleGAN and StyleGAN2 from NVIDIA refined this further, producing the famous "this person does not exist" faces that first shocked the public into taking AI image generation seriously.

The Speed Advantage

The real superpower of GANs was always speed. Because generation is a single forward pass through the generator network, a trained GAN can produce an image in milliseconds. Compare this to diffusion models, which need 20-50 iterative passes. This is why GANs still have a niche in real-time applications: video game texture upscaling (NVIDIA DLSS uses a GAN-like architecture), real-time face filters, style transfer in mobile apps, and super-resolution. When you need images at 30+ FPS, the iterative refinement loop of diffusion is too slow without heavy distillation.

Ian Goodfellow introduced GANs in 2014, and the architecture went through an extraordinary evolution: DCGAN brought convolutional structure (2015), conditional GANs enabled class-specific generation, pix2pix and CycleGAN handled image-to-image translation, BigGAN scaled up to ImageNet quality, and StyleGAN made photorealistic faces routine. For about eight years, if you saw an AI-generated image, it almost certainly came from a GAN. The shift to diffusion happened because diffusion models solved the problems GANs could not: training stability, output diversity, and fine-grained text conditioning. You did not need to play the delicate balancing act of adversarial training.

Still Alive

A misconception worth correcting: GANs are not dead. They are no longer the default for image generation, but the adversarial training principle shows up everywhere. GAN-based discriminators are used as perceptual loss functions for super-resolution and compression. Adversarial training hardens models against attacks. And some of the fastest diffusion approaches (like Adversarial Diffusion Distillation in SDXL Turbo) actually use a GAN discriminator to distill slow diffusion models into fast few-step generators — a neat full-circle moment where GANs help make their successors faster.

相关概念

← 所有术语
← Function Calling Generative AI →
ESC