Zubnet AI學習Wiki › Image Generation
基礎

Image Generation

Text-to-Image, AI Art
用 AI 模型從文字描述生成影像。你輸入「山上的日落,水彩風格」,模型就生成一張符合的影像。當前方法包括擴散模型(Stable Diffusion、DALL-E)、flow matching(Flux)和自迴歸模型。這個領域從 2020 年的模糊臉,進步到 2025 年照片級真實、藝術可控的輸出。

為什麼重要

影像生成是聊天機器人之後最可見的消費 AI 能力。它正在改變平面設計、廣告、概念藝術和視覺傳達。理解底層方法(擴散、flow matching、DiT)及其取捨,能幫你選對工具並理解侷限 — 為什麼有些 prompt 有效有些無效、為什麼某些風格比別的容易。

Deep Dive

The dominant approach: encode text into embeddings (via CLIP or T5), start with random noise, and iteratively denoise while conditioning on the text embeddings through cross-attention. Each denoising step makes the image slightly less noisy and more aligned with the prompt. After 20–50 steps (or 4–10 with flow matching), a clean image emerges. The model has learned the statistical relationship between text descriptions and image features from billions of image-caption pairs.

Control and Conditioning

Beyond text prompts, modern image generation supports: image-to-image (modify an existing image), ControlNet (guide composition with edge maps, depth maps, or poses), inpainting (regenerate part of an image), and style transfer (apply the aesthetic of one image to another). These controls make image generation practical for professional workflows where "generate something random" isn't enough — you need specific compositions, poses, and layouts.

The Quality Frontier

Image quality improvements come from three sources: better architectures (U-Net to DiT), better training (flow matching over diffusion), and better data (higher resolution, better captions, more diverse). Current frontier models produce photorealistic images that are difficult to distinguish from photographs, though they still struggle with: hands and fingers, text rendering, spatial relationships ("A is to the left of B"), and counting ("exactly five apples"). These remaining challenges are active research areas.

相關概念

← 所有術語
← Ideogram Induction Head →