Zubnet AI学习Wiki › Cosine Similarity
基础

Cosine Similarity

Cosine Distance, Vector Similarity
基于两个向量之间夹角的相似度度量,忽略它们的大小。余弦相似度为 1 意味着向量指向同一方向(意思相同)。0 意味着垂直(不相关)。-1 意味着相反方向。它是在语义搜索、RAG、推荐系统中比较文本 embedding 的标准相似度度量。

为什么重要

每次你做语义搜索、用 RAG、或比较 embedding,余弦相似度(很可能)就是决定什么“相似”的度量。理解它帮你调试检索质量、在余弦和替代(点积、欧几里得距离)之间选择、理解为什么某些搜索错过明显匹配。

Deep Dive

The formula: cos(θ) = (A · B) / (||A|| × ||B||), where A · B is the dot product and ||A||, ||B|| are the vector magnitudes (lengths). By dividing by magnitudes, cosine similarity measures direction only — a vector [1, 2, 3] is identical in cosine similarity to [2, 4, 6] because they point the same way. This normalization is why cosine works well for embeddings: the direction encodes meaning, while magnitude can vary based on text length or model quirks.

Cosine vs. Dot Product

If embeddings are already normalized to unit length (magnitude 1), cosine similarity equals the dot product — and dot product is faster to compute (no division). Most embedding models output normalized vectors for exactly this reason. When using a vector database, check whether your embeddings are normalized: if yes, use dot product (faster). If not, use cosine similarity (correct regardless of normalization).

Limitations

Cosine similarity treats all dimensions equally, but some embedding dimensions may be more important than others. It also measures overall direction similarity, which can miss nuanced differences. Two sentences about "Python programming" and "Python the snake" might have moderately high cosine similarity because they share the "Python" concept. More sophisticated similarity measures (learned metrics, cross-encoder reranking) can capture finer distinctions at higher computational cost.

相关概念

← 所有术语
← Corpus Cross-Attention →