The GAN setup is a minimax game straight out of game theory. The generator takes random noise (a latent vector, typically sampled from a Gaussian) and maps it to a data sample — an image, usually. The discriminator receives both real samples from the training set and fake samples from the generator, and outputs a probability that each sample is real. The generator is trained to maximize the discriminator's error, while the discriminator is trained to minimize it. In theory, this converges to a Nash equilibrium where the generator produces outputs indistinguishable from real data and the discriminator is reduced to guessing at 50/50. In practice, getting there is another story entirely.
Training instability was the defining challenge of GANs for years. Mode collapse — where the generator learns to produce only a narrow slice of possible outputs — plagued early architectures. If the discriminator gets too strong too fast, the gradient signal to the generator vanishes and learning stalls. If the generator finds a cheap trick that fools the discriminator, it exploits it relentlessly instead of learning diverse outputs. Wasserstein GANs (WGAN) addressed this with a different loss function that provides more meaningful gradients. Progressive growing (ProGAN) built images up from low resolution to high, stabilizing training enormously. StyleGAN and StyleGAN2 from NVIDIA refined this further, producing the famous "this person does not exist" faces that first shocked the public into taking AI image generation seriously.
The real superpower of GANs was always speed. Because generation is a single forward pass through the generator network, a trained GAN can produce an image in milliseconds. Compare this to diffusion models, which need 20-50 iterative passes. This is why GANs still have a niche in real-time applications: video game texture upscaling (NVIDIA DLSS uses a GAN-like architecture), real-time face filters, style transfer in mobile apps, and super-resolution. When you need images at 30+ FPS, the iterative refinement loop of diffusion is too slow without heavy distillation.
Ian Goodfellow introduced GANs in 2014, and the architecture went through an extraordinary evolution: DCGAN brought convolutional structure (2015), conditional GANs enabled class-specific generation, pix2pix and CycleGAN handled image-to-image translation, BigGAN scaled up to ImageNet quality, and StyleGAN made photorealistic faces routine. For about eight years, if you saw an AI-generated image, it almost certainly came from a GAN. The shift to diffusion happened because diffusion models solved the problems GANs could not: training stability, output diversity, and fine-grained text conditioning. You did not need to play the delicate balancing act of adversarial training.
A misconception worth correcting: GANs are not dead. They are no longer the default for image generation, but the adversarial training principle shows up everywhere. GAN-based discriminators are used as perceptual loss functions for super-resolution and compression. Adversarial training hardens models against attacks. And some of the fastest diffusion approaches (like Adversarial Diffusion Distillation in SDXL Turbo) actually use a GAN discriminator to distill slow diffusion models into fast few-step generators — a neat full-circle moment where GANs help make their successors faster.