Zubnet AIसीखेंWiki › Cosine Similarity
मूल सिद्धांत

Cosine Similarity

Cosine Distance, Vector Similarity
दो vectors के बीच similarity का एक measure जो उनके बीच के angle पर based है, उनकी magnitude ignore करता है। Cosine similarity 1 का मतलब vectors same direction में point करते हैं (identical meaning)। 0 का मतलब वो perpendicular हैं (unrelated)। -1 का मतलब opposite directions। ये semantic search, RAG, और recommendation systems में text embeddings compare करने के लिए standard similarity metric है।

यह क्यों matter करता है

जब भी आप semantic search करते हैं, RAG use करते हैं, या embeddings compare करते हैं, cosine similarity (probably) वो metric है जो decide कर रहा है कि क्या “similar” है। इसे समझना आपको retrieval quality debug करने, cosine और alternatives (dot product, Euclidean distance) के बीच choose करने, और ये समझने में help करता है कि कुछ searches obvious matches क्यों miss कर देती हैं।

Deep Dive

The formula: cos(θ) = (A · B) / (||A|| × ||B||), where A · B is the dot product and ||A||, ||B|| are the vector magnitudes (lengths). By dividing by magnitudes, cosine similarity measures direction only — a vector [1, 2, 3] is identical in cosine similarity to [2, 4, 6] because they point the same way. This normalization is why cosine works well for embeddings: the direction encodes meaning, while magnitude can vary based on text length or model quirks.

Cosine vs. Dot Product

If embeddings are already normalized to unit length (magnitude 1), cosine similarity equals the dot product — and dot product is faster to compute (no division). Most embedding models output normalized vectors for exactly this reason. When using a vector database, check whether your embeddings are normalized: if yes, use dot product (faster). If not, use cosine similarity (correct regardless of normalization).

Limitations

Cosine similarity treats all dimensions equally, but some embedding dimensions may be more important than others. It also measures overall direction similarity, which can miss nuanced differences. Two sentences about "Python programming" and "Python the snake" might have moderately high cosine similarity because they share the "Python" concept. More sophisticated similarity measures (learned metrics, cross-encoder reranking) can capture finer distinctions at higher computational cost.

संबंधित अवधारणाएँ

← सभी Terms
← Corpus Cross-Attention →