Feedforward Network: परिभाषा और अर्थ — AI विकी

प्रत्येक Transformer layer में वह component जो प्रत्येक token को बीच में activation function के साथ दो linear transformations के माध्यम से independently process करता है। जबकि attention tokens के बीच जानकारी mix करता है (कौन से tokens किससे संबंधित हैं), feedforward network प्रत्येक token के representation को individually process करता है, non-linear transformations लागू करता है जो ज्ञान encode करते हैं और computation perform करते हैं।

यह क्यों मायने रखता है

Feedforward network वह जगह है जहां Transformer का अधिकांश ज्ञान stored होता है। Attention सारी प्रशंसा पाता है, लेकिन FFN layers में model के अधिकांश parameters (आमतौर पर कुल parameters का 2/3) होते हैं और यहीं factual associations, language patterns, और learned computations मुख्य रूप से रहती हैं। इसे समझने से knowledge editing और model pruning जैसी घटनाओं की व्याख्या होती है।

गहन अध्ययन

मानक FFN: FFN(x) = W2 · activation(W1 · x + b1) + b2, जहां W1 model dimension से एक बड़े intermediate dimension (आमतौर पर 4x) में project करता है, activation function non-linearity introduce करता है, और W2 वापस model dimension में project करता है। प्रत्येक position (token) इसमें से independently गुज़रता है — FFN अन्य tokens को नहीं देखता, केवल attention layer ऐसा करता है।

SwiGLU और Gated Variants

आधुनिक LLMs (LLaMA, Mistral, आदि) मानक FFN के बजाय SwiGLU का उपयोग करते हैं: SwiGLU(x) = (W1 · x · SiLU) ⊗ (W3 · x)। यह एक तीसरा weight matrix (W3) और gating mechanism जोड़ता है जो network को control करने देता है कि कौन सी जानकारी पास होती है। अतिरिक्त parameters के बावजूद, यह equivalent compute पर बेहतर perform करता है, इसलिए intermediate dimension को compensate करने के लिए adjust किया जाता है। यह एक ऐसा case है जहां थोड़ा अधिक complex component पूरे system को बेहतर बनाता है।

Knowledge Storage

Research सुझाव देता है कि FFN layers key-value memories की तरह काम करती हैं: पहला linear layer (W1) input में patterns detect करता है (keys), और दूसरा linear layer (W2) उन patterns को output updates में map करता है (values)। "The Eiffel Tower is in" W1 में विशिष्ट neurons को activate करता है, जो W2 के माध्यम से token "Paris" को promote करते हैं। यह key-value interpretation समझाता है कि FFN layers factual knowledge क्यों store करती हैं और knowledge editing techniques विशिष्ट FFN weights update करके विशिष्ट facts क्यों modify कर सकती हैं।

Feedforward Network

यह क्यों मायने रखता है

गहन अध्ययन

SwiGLU और Gated Variants

Knowledge Storage

संबंधित अवधारणाएँ