Zubnet AILearnWiki › CNN
Models

CNN

Convolutional Neural Network, ConvNet
A neural network architecture designed to process grid-like data (images, audio spectrograms) by sliding small filters (kernels) across the input to detect local patterns like edges, textures, and shapes. CNNs dominated computer vision from 2012 (AlexNet) until Vision Transformers emerged around 2020. They're still widely used in production, especially on edge devices.

Why it matters

CNNs kicked off the deep learning revolution. AlexNet's 2012 ImageNet victory proved that deep neural networks could dramatically outperform hand-engineered features, triggering the current AI boom. Understanding CNNs helps you understand why Transformers work (many of the same ideas — hierarchical features, parameter sharing — apply), and CNNs remain the best choice for many vision tasks on resource-constrained devices.

Deep Dive

A CNN's core operation is convolution: a small filter (say 3×3 pixels) slides across the image, computing a dot product at each position to detect a specific pattern. Early layers learn simple patterns (edges, color gradients). Deeper layers combine these into increasingly complex features (eyes, wheels, faces). Pooling layers downsample between convolution layers, reducing spatial dimensions while preserving important features.

Why CNNs Work

Two key properties make CNNs efficient: translation equivariance (a cat is a cat regardless of where it appears in the image — the same filter detects it everywhere) and locality (nearby pixels are more related than distant ones). These properties drastically reduce the number of parameters compared to fully connected networks, making CNNs tractable for high-resolution images.

CNNs Beyond Images

CNNs aren't limited to images. 1D convolutions process sequences (audio waveforms, time series). WaveNet (for speech synthesis) and some text classification models use 1D CNNs. In audio, spectrograms are treated as 2D images and processed with standard 2D CNNs. Even in the Transformer era, some hybrid architectures use convolutional layers for local feature extraction before feeding into attention layers.

Related Concepts

← All Terms
← Clustering Cohere →