Zubnet AI學習Wiki › Pooling
基礎

Pooling

Max Pooling, Average Pooling
透過把一個區域總結成單個值來降低資料空間維度的操作。最大池化取每個區域的最大值。平均池化取均值。在 CNN 裡,池化層在卷積層之間對特徵圖下採樣。在 Transformer 裡,池化把 token 表示合併成單個向量(比如用於分類)。

為什麼重要

池化是神經網路從局部特徵到全域理解的方式。一個 CNN 可能從 224×224 的特徵圖開始,到最後一層池化成 7×7,逐步總結空間資訊。在 NLP 裡,對 token embedding 做均值池化是從 token 表示序列創建單一句子 embedding 的標準方法。

Deep Dive

In CNNs: a 2×2 max pool with stride 2 takes every 2×2 region, keeps the maximum value, and reduces each spatial dimension by half. This achieves two things: translation invariance (small shifts in the input don't change the output) and dimensionality reduction (fewer values to process in subsequent layers). Average pooling does the same but takes the mean, which preserves more information but is less robust to noise.

Pooling in NLP

To create a fixed-size embedding from a variable-length sequence of tokens, you need to pool. Common strategies: [CLS] token pooling (use the representation of a special token, as in BERT), mean pooling (average all token representations — usually the best for sentence embeddings), max pooling (take the element-wise max across tokens), and weighted pooling (weight tokens by attention scores). Most embedding models use mean pooling for its simplicity and effectiveness.

Global Average Pooling

In modern vision architectures, global average pooling replaces the fully connected layers that older CNNs used for classification. Instead of flattening the final feature map into a vector (which creates millions of parameters), global average pooling averages each feature map channel to a single number. This produces a compact representation with no learned parameters, acting as a strong regularizer. Vision Transformers use a similar approach with the [CLS] token.

相關概念

← 所有術語
← PixVerse Positional Encoding →