VectorSmuggle hides 1,920 bytes per embedding via rotation; DLP can't see it, Zubnet AI News

Jascha Wanger of ThirdKey released VectorSmuggle this week — an Apache-2.0-licensed research framework demonstrating six steganographic techniques for hiding arbitrary data inside vector embeddings, paired with a companion cryptographic defense called VectorPin. The framework catalogs ways an attacker with access to an ingestion pipeline can perturb embeddings (noise injection, rotation, rescaling, offset, splitting across multiple embedding models) so that the resulting vectors still return correct documents on legitimate search but also carry hidden payloads through the RAG stack and out to wherever the attacker controls. The companion preprint tested the techniques against OpenAI's text-embedding-3-large plus four open models — Nomic, EmbeddingGemma, Snowflake Arctic, MXBai — across a 26,000+ chunk corpus spanning synthetic PII, BEIR's NFCorpus medical research set, and a Quora question subset. The payloads survive passage through FAISS, Chroma, and Qdrant in both standard and scalar-quantized configurations.

The finding that matters most splits along a clean line. Noise, scaling, and offset techniques perturb the statistical distribution of the embeddings enough that off-the-shelf anomaly detectors catch them reliably — that part of the threat is manageable with current tooling. Rotation does not. Because rotating a vector preserves its mathematical relationships with every other vector in the space, anomaly detectors miss it across every model and corpus combination tested. At the theoretical ceiling, a rotation-based encoder can carry approximately 1,920 bytes of hidden payload per vector at zero error rate. When researchers gave attackers full knowledge of the defending detectors, the attackers evaded detection entirely. The conclusion the paper draws is direct: statistical anomaly detection works as a first filter but should not be the primary control. The structural reason existing DLP tools cannot help here is that they were built to scan document content and network traffic — they read documents and they read packets, and they read none of the high-dimensional numerical vectors that sit between the two.

The ecosystem read connects this work to the broader infrastructure-attack-surface thread. Wanger's framing in the Help Net Security piece is the part to lift verbatim: "Almost all current AI security work is happening at the model layer. Prompt injection, jailbreaks, output filtering, alignment. That is the visible surface, and it is where the conference talks and the funding go. The infrastructure layer underneath, the embeddings, the vector stores, the tool contracts, the agent identity, has been largely treated as plumbing. Plumbing is exactly the place attackers go when the front door is heavily defended." This connects directly to Palo Alto's NHI 109:1 ratio published the same day, to the Hugging Face open-OSS malware story from last week, and to AWS WorkSpaces' MCP-agent identity model — all of them are different views of the same observation, that the AI-infrastructure layer is where attackers will spend the next two-to-three years and where defensive tooling is most behind. The novel-attack-class quality of VectorSmuggle is what raises the priority on this category: it is not a configuration mistake or a missing patch, it is a property of the vector-embedding format itself.

For builders running RAG in production: Wanger's specific question is the one to ask security leadership tomorrow morning — "What is our visibility into the contents of the vector embeddings leaving our network, and who is responsible for monitoring that channel?" His assessment of where most companies stand is "no visibility and no one"; if that is your answer too, that is the audit finding. The concrete defenses to evaluate: VectorPin's cryptographic-signature approach (Python and Rust reference implementations in the same repo), egress monitoring on embedding API endpoints (treat them like S3 buckets that can leak), and tighter ingestion-pipeline controls — anyone who can mutate vectors before they hit the database is a potential exfiltration point. For builders shipping vector-DB products: the FAISS, Chroma, Qdrant survival result is the part to take seriously; downstream defenses at the DB layer are not currently sufficient. The repository and preprint are linked from the Help Net Security piece.

VectorSmuggle hides 1,920 bytes per embedding via rotation; DLP can't see it

More News