Zubnet AI学习Wiki › Convolution
基础

Convolution

Conv, Convolutional Layer, Kernel, Filter
一个数学运算,通过在输入上滑动一个小过滤器(kernel)来检测局部模式。在图像里,一个 3×3 kernel 滑过每个位置,与底层像素计算点积产生特征图。不同 kernel 检测不同模式:水平边缘、垂直边缘、纹理,在更深层最终是眼睛或车轮这类复杂特征。

为什么重要

卷积是让计算机视觉工作的运算。它编码两个强大的假设:局部性(邻近像素相关)和平移等变性(模式无论出现在哪里都一样)。这些假设相比全连接层大幅减少参数数量,让处理高分辨率图像变得可行。即便在 Transformer 时代,卷积仍在许多混合架构中使用。

Deep Dive

A convolution with a 3×3 kernel: at each position, multiply the 9 kernel values with the 9 underlying input values and sum them. This produces one output value. Slide the kernel to the next position and repeat. A single kernel produces one feature map (detecting one pattern). Multiple kernels produce multiple feature maps. Stride (how far the kernel moves each step) and padding (how to handle edges) are additional parameters that control the output size.

Depth and Hierarchy

In a CNN, early layers use small kernels to detect simple patterns. Each subsequent layer convolves over the previous layer's feature maps, detecting progressively more complex patterns. Layer 1: edges. Layer 2: corners and textures (combinations of edges). Layer 3: object parts (combinations of textures). Layer 4: objects (combinations of parts). This hierarchical feature learning is the fundamental mechanism behind CNNs' success in vision.

1D and 3D Convolutions

Convolutions aren't limited to 2D images. 1D convolutions process sequences (audio waveforms, time series, text), sliding a kernel along one dimension. 3D convolutions process volumes (video, medical scans), sliding along three dimensions. The principle is identical: local pattern detection with parameter sharing. 1D convolutions are used in some modern architectures (ConvNeXt, Hyena) as efficient alternatives to attention for certain operations.

相关概念

← 所有术语
ESC