Zubnet AI學習Wiki › Vector Database
工具

Vector Database

又名: Qdrant, Pinecone, Weaviate, ChromaDB
為儲存和搜尋 embedding(向量)優化的資料庫。不像傳統資料庫匹配精確關鍵字,向量資料庫找語意上最相似的項。你問「怎麼修記憶體洩漏」,它返回關於「除錯 RAM 消耗」的文件,因為 embedding 相近。

為什麼重要

向量資料庫是讓 RAG 運作的儲存層。沒有它們,你每次查詢都得把整個知識庫 embed 一遍。它們也是推薦系統和語意搜尋的骨幹。

Deep Dive

A vector database stores high-dimensional vectors (typically 384 to 3072 floating-point numbers, depending on the embedding model) and supports fast nearest-neighbor search across millions or billions of them. The fundamental operation is: given a query vector, find the k vectors in the database that are closest to it, measured by cosine similarity, dot product, or Euclidean distance. Brute-force search (comparing the query against every stored vector) is exact but far too slow at scale. So vector databases use approximate nearest-neighbor (ANN) algorithms that trade a tiny amount of accuracy for massive speed gains — typically finding 95–99% of the true nearest neighbors while searching only a small fraction of the index.

How the Index Works

The most common ANN algorithm is HNSW (Hierarchical Navigable Small World), used by Qdrant, Weaviate, pgvector, and many others. HNSW builds a multi-layered graph where each vector is a node connected to its nearest neighbors. Searching starts at the top layer (sparse, long-range connections) and drills down to lower layers (dense, short-range connections), like zooming in on a map. It's fast, accurate, and works well for datasets up to a few hundred million vectors. The trade-off is memory: HNSW keeps the graph in RAM, so you need enough memory to hold your vectors plus the graph overhead. For a million 1536-dimensional vectors (OpenAI's ada-002 output), that's roughly 6–8 GB. Alternatives like IVF (inverted file index) and ScaNN use less memory but require more tuning. Pinecone and some Qdrant configurations use quantization — compressing vectors from float32 to int8 or binary — to fit more vectors in the same memory at the cost of slight accuracy loss.

Picking Your Database

Choosing between the major vector databases depends on your constraints. Qdrant and Weaviate are open-source and self-hostable, which matters for data privacy and cost control — you run them on your own infrastructure and pay only for compute. Pinecone is fully managed (no infra to operate) but vendor-locked and priced per vector, which gets expensive at scale. ChromaDB is lightweight and embedded (runs in-process, stores to disk), great for prototyping and small datasets but not built for production workloads with millions of vectors. PostgreSQL with the pgvector extension is appealing if you already run Postgres, since you avoid adding a new database to your stack, but its query performance falls behind purpose-built vector databases at larger scales. For most production RAG systems, Qdrant or Weaviate give you the best balance of performance, features, and operational control.

Filtering Matters

Metadata filtering is a feature that separates serious vector databases from toy implementations. In practice, you almost never want to search your entire collection — you want to search "all documents uploaded by this user" or "only documents from the last 30 days" or "only chunks from this specific PDF." Vector databases let you store metadata alongside each vector and apply filters before or during the similarity search. This is called pre-filtering (filter first, then search the reduced set) or post-filtering (search everything, then discard results that don't match). Pre-filtering is more efficient but requires the index to support it; most production databases now do. Getting your metadata schema right at indexing time saves enormous pain later — retrofitting filters onto a collection that wasn't designed for them often means re-indexing everything.

Still Maturing Fast

Vector databases existed before the current AI wave — Spotify used approximate nearest-neighbor search for music recommendations years ago, and Facebook's Faiss library has been around since 2017. But the explosion of embedding models and RAG in 2023–2024 turned them from a niche technology into critical infrastructure. The space is still maturing fast: multi-tenancy (efficiently isolating data between customers in a shared deployment), hybrid search (combining vector and keyword search in a single query), and on-disk indexing (handling datasets larger than RAM) are all areas where the products differ significantly and are improving rapidly. If you're starting a project today, pick a database that handles your current scale, supports metadata filtering and hybrid search, and has an active maintenance trajectory. You can always migrate later — the embedding vectors are portable.

相關概念

← 所有術語
← Upstage Video Generation →
ESC