VectorSmuggle rotation से हर embedding में 1,920 bytes छिपाता है; DLP देख नहीं सकता, Zubnet AI समाचार

ThirdKey के Jascha Wanger ने इस सप्ताह VectorSmuggle जारी किया — एक Apache-2.0-licensed research framework जो vector embeddings के अंदर मनमाने data को छिपाने की छह steganographic techniques दिखाता है, और एक साथी cryptographic defense VectorPin भी। framework उन तरीक़ों को सूचीबद्ध करता है जिनसे ingestion pipeline तक access वाला attacker embeddings को perturb कर सकता है (noise injection, rotation, rescaling, offset, multiple embedding models में splitting) ताकि परिणामी vectors legitimate search पर सही documents लौटाएँ पर साथ ही RAG stack के बीच से छिपे payloads ले जाएँ और attacker के नियंत्रण वाली जगह पहुँचाएँ। साथ का preprint techniques को OpenAI के text-embedding-3-large और चार open models — Nomic, EmbeddingGemma, Snowflake Arctic, MXBai — पर 26,000+ chunks वाले corpus (synthetic PII, BEIR का medical research NFCorpus, Quora questions subset) पर परीक्षण करता है। Payloads FAISS, Chroma, और Qdrant के standard और scalar-quantized दोनों configurations से गुज़र जाते हैं।

सबसे महत्त्वपूर्ण finding एक साफ़ रेखा पर बँटता है। Noise, scaling, और offset techniques embeddings के statistical distribution को इतना perturb कर देते हैं कि off-the-shelf anomaly detectors उन्हें भरोसेमंद ढंग से पकड़ लेते हैं — खतरे का यह हिस्सा वर्तमान tooling से manageable है। Rotation नहीं। चूँकि vector को rotate करना space के अन्य सभी vectors के साथ इसके mathematical relationships को संरक्षित रखता है, हर परीक्षित model/corpus combination में anomaly detectors इसे miss करते हैं। theoretical ceiling पर, rotation-आधारित encoder शून्य error rate पर लगभग 1,920 bytes hidden payload प्रति vector ले जा सकता है। जब researchers ने attackers को defending detectors की पूरी जानकारी दी, attackers ने detection को पूरी तरह evade कर दिया। paper का निष्कर्ष सीधा है: statistical anomaly detection पहले filter की तरह काम करता है पर primary control नहीं हो सकता। संरचनात्मक कारण कि मौजूदा DLP tools यहाँ मदद क्यों नहीं कर सकते: वे document content और network traffic स्कैन के लिए बनाए गए थे — वे documents पढ़ते हैं और packets पढ़ते हैं, और बीच में बैठे high-dimensional numerical vectors में से एक भी नहीं पढ़ते।

ecosystem read इस काम को व्यापक infrastructure-attack-surface धागे से जोड़ता है। Help Net Security piece में Wanger की framing verbatim उठाने योग्य है: "लगभग सारा वर्तमान AI security काम model layer पर हो रहा है। Prompt injection, jailbreaks, output filtering, alignment — वह visible surface है, conferences और funding वहीं जाते हैं। नीचे की infrastructure layer — embeddings, vector stores, tool contracts, agent identity — को बड़े पैमाने पर 'plumbing' की तरह माना गया है। Plumbing ठीक वहीं जगह है जहाँ attackers जाते हैं जब front door भारी रूप से रक्षित होता है।" यह सीधे उसी दिन प्रकाशित Palo Alto के NHI 109:1 ratio से, पिछले सप्ताह की Hugging Face open-OSS malware कहानी से, और AWS WorkSpaces के MCP-agent identity model से जुड़ता है — सभी एक ही observation के अलग-अलग दृश्य हैं: AI-infrastructure layer ही वहाँ है जहाँ attackers अगले दो-तीन साल बिताएँगे और जहाँ defensive tooling सबसे पीछे है। VectorSmuggle की "novel-attack-class" गुणवत्ता ही इस श्रेणी की प्राथमिकता बढ़ाती है: यह configuration mistake या missing patch नहीं है, यह vector-embedding format की ही property है।

production में RAG चलाने वाले builders के लिए: Wanger का specific सवाल वही है जो कल सुबह security leadership से पूछना है — "हमारे network से बाहर जा रहे vector embeddings की content में हमारी क्या visibility है, और उस channel की निगरानी के लिए कौन ज़िम्मेदार है?" अधिकांश कंपनियों के बारे में उनका आकलन है "कोई visibility नहीं और कोई नहीं"; अगर तुम्हारा भी यही जवाब है, तो यही audit finding है। मूल्यांकन योग्य ठोस defenses: VectorPin का cryptographic-signature approach (एक ही repo में Python और Rust reference implementations), embedding API endpoints पर egress monitoring (इन्हें S3 buckets की तरह treat करो जो leak कर सकती हैं), और ingestion-pipeline controls को कसना — कोई भी जो database में पहुँचने से पहले vectors को mutate कर सकता है, संभावित exfiltration point है। vector-DB products ship कर रहे builders के लिए: FAISS, Chroma, Qdrant survival result गंभीरता से लेने योग्य है — DB layer पर downstream defenses अभी पर्याप्त नहीं हैं। Repository और preprint Help Net Security piece से link हैं।

VectorSmuggle rotation से हर embedding में 1,920 bytes छिपाता है; DLP देख नहीं सकता

और समाचार