MetaBackdoor: LLM backdoor लंबाई पर trigger, 90 poisons, 700+ tokens पर 75%, Zubnet AI समाचार

Microsoft और Institute of Science Tokyo ने 18 मई को MetaBackdoor disclose किया — एक LLM backdoor attack जो content के बजाय input length पर trigger होता है, उन defenses की पूरी class को bypass करता है जो suspicious tokens या anomalous text ढूँढ़ती हैं। Mechanism: एक attacker जिसके पास fine-tuning data तक access है lengthy inputs को malicious outputs के साथ pair करके examples को poison करता है। Model सीखता है कि जब input एक length threshold cross करे तो attack mode में switch हो जाए। मात्र 90 poisoned examples behavior को embed करने के लिए काफ़ी हैं। Attack 700 tokens से ज़्यादा conversation lengths पर tool calls के through autonomous data exfiltration पर 75% पर सफल होता है, और substantial retraining के बाद भी लगभग 40% पर persist करता है।

Architectural insight signal-channel है। Current defenses — prompt-injection scanners, content filters, anomaly detectors — सब input content पर operate करते हैं। वो देखते हैं tokens में क्या है। MetaBackdoor input length को trigger signal के रूप में use करता है, जिसका मतलब content-side defenses पूरी तरह गलत axis देख रही हैं। Writeup सीधा है: "Content filters के पास filter करने को कुछ नहीं है। Anomaly detectors ordinary text देखते हैं।" यह defense failure नहीं है — यह defense category mismatch है। Training-time attack inference-time content inspection को structurally invisible है। Builders के लिए corollary यह है कि input-shape (length, token-type distribution, request frequency) एक signal channel है जिसे defenses instrument नहीं कर रही हैं।

Threshold matter करता है: 700+ tokens वो typical conversation length है जहाँ ज़्यादातर production agent interactions बैठते हैं। Multi-turn chat agents, long-context coding agents, RAG pipelines, tool-call cycles — सभी normal use के अंदर उस threshold को पार करते हैं। 90-example poisoning footprint भी इतना छोटा है कि RLHF contractor outputs, customer feedback datasets, या public fine-tuning corpora में detection के बिना घुस जाए। यह MetaBackdoor को उसी threat class में रखता है जिसमें Anthropic की sleeper-agents research और विभिन्न dataset-poisoning papers हैं — लेकिन specific contribution यह है कि trigger को inference time पर attacker द्वारा controlled unique token या phrase होने की ज़रूरत नहीं है। Trigger input shape की एक property है, जिसे attacker guarantee कर सकता है यह सुनिश्चित करके कि application के normal use patterns threshold को cross करें। यह attack को model deploy होने के बाद "fire-and-forget" बनाता है।

सोमवार: अगर आप किसी भी third party (RLHF vendor, customer feedback, public dataset) के data पर एक foundation model fine-tune करते हो, MetaBackdoor आपके supply-chain risk model में एक नया threat vector जोड़ता है — आपके foundation-model provenance और आपके fine-tuning dataset provenance दोनों को vendor-risk treatment चाहिए। Red-team testing के लिए, recommended check varying input lengths पर behavioral consistency है — अपने fine-tuned model को same prompt से 100, 500, 1000, 2000 tokens पर query करो और outputs को divergence के लिए compare करो। अगर आपका stack agentic tool calls use करता है, 700-token threshold आपकी line है: उस conversation depth के बाद fire होने वाले tool calls के लिए human-in-the-loop confirmation implement करो। गहरा open question: defenses को content inspection से expand करना होगा input-shape signal monitoring तक पूरे pipeline भर। यह आज ज़्यादातर teams के पास जो है उससे significantly different security stack है।

MetaBackdoor: LLM backdoor लंबाई पर trigger, 90 poisons, 700+ tokens पर 75%

और समाचार