Salesforce AI Research released VoiceAgentRAG, an open-source system that solves voice AI's fundamental latency problem through predictive caching. The dual-agent architecture runs a "Fast Talker" that serves responses from an in-memory semantic cache in 0.35ms, while a "Slow Thinker" background agent monitors conversations and pre-fetches likely follow-up topics into local storage. This approach delivers a 316x speedup over standard vector database queries that typically consume 50-300ms—the entire response budget for natural conversation.

This tackles the core constraint holding back voice agents today. While text-based RAG systems can afford multi-second "thinking" pauses, voice interfaces need sub-200ms responses to feel natural. Most production systems blow this budget on database roundtrips before the LLM even starts generating. The timing constraint explains why voice assistants still feel clunky compared to text interfaces—it's not just the models, it's the infrastructure.

The technical implementation reveals smart engineering choices. Rather than indexing queries, VoiceAgentRAG's FAISS-based semantic cache indexes document embeddings directly, enabling proper semantic search even when user phrasing differs from predictions. The system uses a 0.40 cosine similarity threshold (lower than typical 0.95 query-to-query thresholds) and maintains cache freshness with 300-second TTL and LRU eviction. The background agent generates "document-style descriptions" rather than questions to better align embeddings with actual knowledge base content.

For developers building voice interfaces, this represents a clear path forward. The open-source release means you can implement predictive caching without rebuilding from scratch. But the real insight is architectural—decoupling retrieval from generation through async background processing. This pattern will likely become standard for any real-time AI application where latency matters more than perfect accuracy.