Semantic Search: Definition & Meaning — AI Wiki

基于含义而非精确关键词匹配来找结果的搜索。不是找包含“fix”这个词的文档,语义搜索能找到关于“repair”、“resolve”、“patch”、“debug”的文档,因为它们意思相近。它的工作方式是把文本转换成 embedding(数值向量),在向量空间中找最近匹配。

为什么重要

语义搜索就是为什么现代搜索比关键词搜索感觉更神奇。它驱动 RAG 系统、文档搜索、电商产品发现、支持工单路由。如果你在构建任何需要找相关信息的应用,语义搜索很可能是正确的方法。

Deep Dive

The pipeline: (1) encode your documents into embeddings using a model like BGE, E5, or Voyage, (2) store these embeddings in a vector database (Pinecone, Qdrant, Weaviate, pgvector), (3) when a query arrives, encode it with the same model, (4) find the nearest embeddings using similarity metrics like cosine similarity or dot product. The query "how to fix a memory leak" matches a document titled "debugging RAM consumption in Node.js" because their embeddings are close in vector space.

Hybrid Search

Pure semantic search has a weakness: it can miss exact matches that keyword search catches easily. If someone searches for error code "ERR_SSL_PROTOCOL_ERROR," semantic search might return general SSL troubleshooting instead of the exact error. Hybrid search combines both: keyword matching (BM25) for precision and semantic search for recall, then merges the results. Most production search systems use hybrid approaches.

Embedding Model Choice Matters

The quality of semantic search depends entirely on the embedding model. General-purpose models (OpenAI's text-embedding-3, Cohere Embed) work well for most text. Domain-specific models (trained on medical, legal, or code data) outperform general models in their domain. Multilingual models enable cross-language search. The MTEB leaderboard benchmarks embedding models across many tasks — it's the best resource for choosing one.

Semantic Search

为什么重要

Deep Dive

Hybrid Search

Embedding Model Choice Matters

相关概念