Quesma Benchmarks AI Malware Detection

What happened

Quesma released BinaryAudit, an open-source benchmark testing the ability of AI agents to detect backdoors in compiled binaries without source code access. Using open-source reverse engineering tools like Ghidra, Anthropic's Claude Opus 4.6 identified artificially injected malicious code in 49% of tasks, while Google's Gemini 3 Pro solved 44%. However, the models generated a 28% false positive rate, frequently flagging clean binaries as compromised. The agents successfully navigated C and Rust executables but failed to process Go binaries due to limitations in the underlying open-source decompilers.

Why it matters

The 28% false positive rate requires human verification for flagged threats, though frontier models successfully execute complex reverse engineering commands. Deploy these agents for initial triage rather than definitive auditing, as models still struggle to distinguish legitimate network parsers from malicious access routes. This benchmark precedes Anthropic's release of Claude Code Security, highlighting a broader industry push to integrate AI into vulnerability research. Teams evaluating self-hosted inference must account for the compute overhead of processing decompiled code.

Subscribe for Weekly Updates

Stay ahead with our weekly AI and tech briefings, delivered every Tuesday.

Read the newsletter →

Listen to the podcast →