Zubnet AI学习Wiki › CNN
Models

CNN

Convolutional Neural Network, ConvNet
一种设计用来处理网格状数据(图像、音频 spectrogram)的神经网络架构,通过在输入上滑动小过滤器(kernel)来检测局部模式,如边缘、纹理、形状。CNN 从 2012 年(AlexNet)到 2020 年左右 Vision Transformer 出现之前,一直主导着计算机视觉。它们在生产中仍然广泛使用,特别是在边缘设备上。

为什么重要

CNN 启动了深度学习革命。AlexNet 在 2012 年 ImageNet 的胜利证明了深度神经网络能大幅超越手工设计的特征,触发了当前的 AI 热潮。理解 CNN 能帮你理解 Transformer 为什么有效(很多同样的想法 — 层次化特征、参数共享 — 都适用),而 CNN 仍然是很多资源受限设备上视觉任务的最佳选择。

Deep Dive

A CNN's core operation is convolution: a small filter (say 3×3 pixels) slides across the image, computing a dot product at each position to detect a specific pattern. Early layers learn simple patterns (edges, color gradients). Deeper layers combine these into increasingly complex features (eyes, wheels, faces). Pooling layers downsample between convolution layers, reducing spatial dimensions while preserving important features.

Why CNNs Work

Two key properties make CNNs efficient: translation equivariance (a cat is a cat regardless of where it appears in the image — the same filter detects it everywhere) and locality (nearby pixels are more related than distant ones). These properties drastically reduce the number of parameters compared to fully connected networks, making CNNs tractable for high-resolution images.

CNNs Beyond Images

CNNs aren't limited to images. 1D convolutions process sequences (audio waveforms, time series). WaveNet (for speech synthesis) and some text classification models use 1D CNNs. In audio, spectrograms are treated as 2D images and processed with standard 2D CNNs. Even in the Transformer era, some hybrid architectures use convolutional layers for local feature extraction before feeding into attention layers.

相关概念

← 所有术语
← Clustering Cohere →