Zubnet AI學習Wiki › Cosine Similarity
基礎

Cosine Similarity

Cosine Distance, Vector Similarity
基於兩個向量之間夾角的相似度度量,忽略它們的大小。餘弦相似度為 1 意味著向量指向同一方向(意思相同)。0 意味著垂直(不相關)。-1 意味著相反方向。它是在語意搜尋、RAG、推薦系統中比較文字 embedding 的標準相似度度量。

為什麼重要

每次你做語意搜尋、用 RAG、或比較 embedding,餘弦相似度(很可能)就是決定什麼「相似」的度量。理解它幫你除錯檢索品質、在餘弦和替代(點積、歐幾里得距離)之間選擇、理解為什麼某些搜尋錯過明顯匹配。

Deep Dive

The formula: cos(θ) = (A · B) / (||A|| × ||B||), where A · B is the dot product and ||A||, ||B|| are the vector magnitudes (lengths). By dividing by magnitudes, cosine similarity measures direction only — a vector [1, 2, 3] is identical in cosine similarity to [2, 4, 6] because they point the same way. This normalization is why cosine works well for embeddings: the direction encodes meaning, while magnitude can vary based on text length or model quirks.

Cosine vs. Dot Product

If embeddings are already normalized to unit length (magnitude 1), cosine similarity equals the dot product — and dot product is faster to compute (no division). Most embedding models output normalized vectors for exactly this reason. When using a vector database, check whether your embeddings are normalized: if yes, use dot product (faster). If not, use cosine similarity (correct regardless of normalization).

Limitations

Cosine similarity treats all dimensions equally, but some embedding dimensions may be more important than others. It also measures overall direction similarity, which can miss nuanced differences. Two sentences about "Python programming" and "Python the snake" might have moderately high cosine similarity because they share the "Python" concept. More sophisticated similarity measures (learned metrics, cross-encoder reranking) can capture finer distinctions at higher computational cost.

相關概念

← 所有術語
← Corpus Cross-Attention →