Zubnet AI学习Wiki › Image Generation
基础

Image Generation

Text-to-Image, AI Art
用 AI 模型从文本描述生成图像。你输入“山上的日落,水彩风格”,模型就生成一张匹配的图像。当前方法包括扩散模型(Stable Diffusion、DALL-E)、flow matching(Flux)和自回归模型。这个领域从 2020 年的模糊脸,进步到 2025 年照片级真实、艺术可控的输出。

为什么重要

图像生成是聊天机器人之后最可见的消费 AI 能力。它正在改变平面设计、广告、概念艺术和视觉传达。理解底层方法(扩散、flow matching、DiT)及其权衡,能帮你选对工具并理解局限 — 为什么有些 prompt 有效有些无效、为什么某些风格比别的容易。

Deep Dive

The dominant approach: encode text into embeddings (via CLIP or T5), start with random noise, and iteratively denoise while conditioning on the text embeddings through cross-attention. Each denoising step makes the image slightly less noisy and more aligned with the prompt. After 20–50 steps (or 4–10 with flow matching), a clean image emerges. The model has learned the statistical relationship between text descriptions and image features from billions of image-caption pairs.

Control and Conditioning

Beyond text prompts, modern image generation supports: image-to-image (modify an existing image), ControlNet (guide composition with edge maps, depth maps, or poses), inpainting (regenerate part of an image), and style transfer (apply the aesthetic of one image to another). These controls make image generation practical for professional workflows where "generate something random" isn't enough — you need specific compositions, poses, and layouts.

The Quality Frontier

Image quality improvements come from three sources: better architectures (U-Net to DiT), better training (flow matching over diffusion), and better data (higher resolution, better captions, more diverse). Current frontier models produce photorealistic images that are difficult to distinguish from photographs, though they still struggle with: hands and fingers, text rendering, spatial relationships ("A is to the left of B"), and counting ("exactly five apples"). These remaining challenges are active research areas.

相关概念

← 所有术语
← Ideogram Induction Head →