Semantic Search: Definition & Meaning — AI Wiki

基於含義而非精確關鍵字匹配來找結果的搜尋。不是找包含「fix」這個詞的文件,語意搜尋能找到關於「repair」、「resolve」、「patch」、「debug」的文件,因為它們意思相近。它的工作方式是把文字轉換成 embedding(數值向量),在向量空間中找最近匹配。

為什麼重要

語意搜尋就是為什麼現代搜尋比關鍵字搜尋感覺更神奇。它驅動 RAG 系統、文件搜尋、電商產品發現、支援工單路由。如果你在建構任何需要找相關資訊的應用,語意搜尋很可能是正確的方法。

Deep Dive

The pipeline: (1) encode your documents into embeddings using a model like BGE, E5, or Voyage, (2) store these embeddings in a vector database (Pinecone, Qdrant, Weaviate, pgvector), (3) when a query arrives, encode it with the same model, (4) find the nearest embeddings using similarity metrics like cosine similarity or dot product. The query "how to fix a memory leak" matches a document titled "debugging RAM consumption in Node.js" because their embeddings are close in vector space.

Hybrid Search

Pure semantic search has a weakness: it can miss exact matches that keyword search catches easily. If someone searches for error code "ERR_SSL_PROTOCOL_ERROR," semantic search might return general SSL troubleshooting instead of the exact error. Hybrid search combines both: keyword matching (BM25) for precision and semantic search for recall, then merges the results. Most production search systems use hybrid approaches.

Embedding Model Choice Matters

The quality of semantic search depends entirely on the embedding model. General-purpose models (OpenAI's text-embedding-3, Cohere Embed) work well for most text. Domain-specific models (trained on medical, legal, or code data) outperform general models in their domain. Multilingual models enable cross-language search. The MTEB leaderboard benchmarks embedding models across many tasks — it's the best resource for choosing one.

Semantic Search

為什麼重要

Deep Dive

Hybrid Search

Embedding Model Choice Matters

相關概念