Zubnet AI學習Wiki › CNN
Models

CNN

Convolutional Neural Network, ConvNet
一種設計用來處理網格狀資料(影像、音訊 spectrogram)的神經網路架構,透過在輸入上滑動小濾波器(kernel)來偵測局部模式,如邊緣、紋理、形狀。CNN 從 2012 年(AlexNet)到 2020 年左右 Vision Transformer 出現之前,一直主導著電腦視覺。它們在生產中仍然廣泛使用,特別是在邊緣裝置上。

為什麼重要

CNN 啟動了深度學習革命。AlexNet 在 2012 年 ImageNet 的勝利證明了深度神經網路能大幅超越手工設計的特徵,觸發了當前的 AI 熱潮。理解 CNN 能幫你理解 Transformer 為什麼有效(很多同樣的想法 — 階層特徵、參數共享 — 都適用),而 CNN 仍然是很多資源受限裝置上視覺任務的最佳選擇。

Deep Dive

A CNN's core operation is convolution: a small filter (say 3×3 pixels) slides across the image, computing a dot product at each position to detect a specific pattern. Early layers learn simple patterns (edges, color gradients). Deeper layers combine these into increasingly complex features (eyes, wheels, faces). Pooling layers downsample between convolution layers, reducing spatial dimensions while preserving important features.

Why CNNs Work

Two key properties make CNNs efficient: translation equivariance (a cat is a cat regardless of where it appears in the image — the same filter detects it everywhere) and locality (nearby pixels are more related than distant ones). These properties drastically reduce the number of parameters compared to fully connected networks, making CNNs tractable for high-resolution images.

CNNs Beyond Images

CNNs aren't limited to images. 1D convolutions process sequences (audio waveforms, time series). WaveNet (for speech synthesis) and some text classification models use 1D CNNs. In audio, spectrograms are treated as 2D images and processed with standard 2D CNNs. Even in the Transformer era, some hybrid architectures use convolutional layers for local feature extraction before feeding into attention layers.

相關概念

← 所有術語
← Clustering Cohere →