Long-horizon reverse engineering tasks for LLM agents, scored deterministically.
Give an LLM agent a compiled ELF binary and a set of Linux static analysis tools. Measure how well it identifies C2 infrastructure, encoding schemes, anti-analysis techniques, and communication protocols — all without human guidance. Currently targeting Linux/Unix (ELF x86-64). Windows PE support planned.
Simple Q&A benchmarks don't differentiate frontier models on real-world reasoning. AgentRE-Bench tests chains of 10–25 tool calls where each call's output informs the next decision.
All 13 ELF x86-64 binaries are compiled from purpose-built C sources targeting Linux/Unix. Known ground truths, no licensing issues — anyone can compile identical binaries and verify scores.
Every answer is scored against a fixed ground truth with weighted fields and Jaccard overlap for sets. No LLM-as-judge. The same answer always produces the same score.
Agents must plan which tools to use, interpret raw output (hex dumps, disassembly), and synthesize findings — all within a budget of 25 tool calls per task.
The hardest levels require entropy analysis, key extraction, decryption, and identifying 18+ techniques across anti-debugging, process injection, and network evasion.
From plaintext TCP shells to metamorphic droppers with RC4 encryption, control-flow flattening, and triple anti-debug.
| Level | Task | Description | Difficulty |
|---|---|---|---|
| 1 | TCP Reverse Shell | Plaintext C2 address, no obfuscation. Connects via socket, redirects I/O with dup2, executes /bin/sh. | Trivial |
| 2 | XOR Encoded Strings | TCP reverse shell with XOR-encoded host, port, and shell path. Single-byte key 0x22, decoded at runtime. | Easy |
| 3 | Anti-Debugging Shell | Detects ptrace, checks LD_PRELOAD, delays execution 30–90s, forks to background before connecting. | Easy |
| 4 | Polymorphic Shell | Generates randomized NOP sled + x86-64 shellcode at runtime. Runtime variation defeats static signatures. | Medium |
| 5 | Multistage Shell | Stage 1 receives XOR key from C2. Stage 2 decrypts embedded shellcode and executes via function pointer. | Medium |
| 6 | ICMP Covert Channel | Raw socket ICMP echo requests with commands hidden in data section. Custom checksum, 5s beacon interval. | Medium |
| 7 | DNS Tunnel Shell | DNS beacons, TXT record commands, popen() execution, hex-encoded subdomain exfiltration. | Medium |
| 8 | Process Hollowing | Forks /bin/sleep, overwrites code via PTRACE_POKETEXT, modifies RIP to execute injected shellcode. | Hard |
| 9 | Shared Object Injection | Constructor-triggered .so with puts() hijacking via dlsym(RTLD_NEXT). Evades LD_AUDIT/LD_PRELOAD detection. | Hard |
| 10 | AES Encrypted Shell | Claims AES, actually 16-byte XOR key. Inline assembly syscalls. Decrypts shellcode into mmap'd RWX memory. | Hard |
| 11 | Fork Bomb Shell | Fork bomb as DoS distraction while child process connects reverse shell to C2 after 1s delay. | Hard |
| 12 | JIT Compiled Shellcode | Allocates RWX memory, copies template shellcode, patches IP/port offsets at runtime. Self-modifying code. | Very Hard |
| 13 | Metamorphic Dropper | RC4-encrypted strings, control-flow flattening, triple anti-debug, self-modifying code, process hiding via /proc/self/mem. 18 techniques. | Bonus |
Each of the 12 standard levels is scored across 5 weighted fields. Level 13 is a bonus task with a deeper 10-field rubric. Maximum total score is 2.0.
All 13 levels, 25 tool-call budget, Docker-sandboxed static analysis tools. Six frontier models evaluated end-to-end. Headline finding: a small non-thinking model (Gemini 3.1 Flash Lite) leads the field, beating every frontier reasoning model on Main score. Hallucination calibration — not reasoning depth — is the dominant axis on this bench.
Bench leader. Wins or ties on 8 of 13 levels — including L7 DNS Tunnel (the only model above 0.33), L10 AES Encrypted, L11 Fork Bomb (0.80), and L12 JIT Shellcode (0.76). Lowest hallucination rate in the field (1.92/task). Fast: 525s total wall time, 30.5M tokens. Weakness: bonus L13 only 0.105 — without thinking, it cracks the recognition layer but not the encryption-execution layer.
Only model to crack the bonus task meaningfully (L13 = 0.271). Strong calibration (2.69 hallucinations/task) — the thinking trace doesn't surface false positives. Lost L3, L4, L5 to provider timeouts during evaluation; would likely lead Main score if those held up. Slowest in the field at 920s avg per task and 3338s total wall time.
Most reliable run: 13/13 valid answers, 0 provider errors. Strongest absolute showing on L8 Process Hollow (0.54). Adaptive thinking surfaces broader candidate technique sets — resulting in 5.62 hallucinations/task and a ~3.6 point penalty. The thinking-on-vs-off comparison is unexpectedly mixed: a separate no-thinking baseline of the same model scored higher on this bench.
Bench-best on L1 (0.875) and L2 (0.83) by clear margins — the easy levels go to the calibrated thinker. Second-best hallucination rate (2.08/task). Mixed L10 outcome: scored 0.40 partial credit (highest in field) but errored on the final turn before submitting a valid answer. Lost 4 tasks to mid-conversation HTTP 400 errors; 200-minute total run.
Fast non-thinking baseline. Strong on L1 (0.72), L5 Multistage (0.69), L11 Fork Bomb (0.71). 13/13 valid answers, 636s total wall time. Submitted empty {} for L13 — non-thinking models cannot sustain the multi-step structure the bonus rubric requires. Outperforms two thinking models on Main — calibration over depth, again.
Highest hallucination rate (6.31/task). On L13, burned 22 minutes and 72,493 reasoning tokens for a 0 score — reasoning trace appeared to attempt symbolic decryption, lost the thread, never converged to a final answer. Generally weakest on the harder tasks (L8, L12). Reasoning-token budget exposed in usage metadata, the only model with that-level accounting.
| Level | Task | Flash Lite | V4 Pro | Opus 4.7 | Kimi K2.6 | V4 Flash | GPT-5.5 |
|---|---|---|---|---|---|---|---|
| 1 | TCP Reverse Shell | 0.80 | 0.66 | 0.43 | 0.88 | 0.72 | 0.49 |
| 2 | XOR Encoded | 0.59 | 0.65 | 0.75 | 0.83 | 0.30 | 0.58 |
| 3 | Anti-Debug | 0.61 | 0.00 | 0.27 | 0.53 | 0.58 | 0.32 |
| 4 | Polymorphic | 0.05 | 0.40 | 0.00 | 0.40 | 0.14 | 0.00 |
| 5 | Multistage | 0.70 | 0.00 | 0.57 | 0.63 | 0.69 | 0.32 |
| 6 | ICMP Covert | 0.71 | 0.64 | 0.38 | 0.58 | 0.59 | 0.38 |
| 7 | DNS Tunnel | 0.54 | 0.03 | 0.07 | 0.00 | 0.33 | 0.00 |
| 8 | Process Hollow | 0.54 | 0.38 | 0.54 | 0.00 | 0.30 | 0.14 |
| 9 | SO Injection | 0.48 | 0.46 | 0.43 | 0.38 | 0.40 | 0.22 |
| 10 | AES Encrypted | 0.15 | 0.05 | 0.00 | 0.40 | 0.10 | 0.00 |
| 11 | Fork Bomb | 0.80 | 0.62 | 0.50 | 0.71 | 0.71 | 0.45 |
| 12 | JIT Shellcode | 0.76 | 0.64 | 0.48 | 0.64 | 0.53 | 0.17 |
| 13 | Metamorphic Bonus | 0.11 | 0.27 | 0.15 | 0.00 | 0.00 | 0.00 |
Two additional models (Gemini 3.1 Pro Preview and GLM 5.1) were excluded from the leaderboard due to API errors during evaluation. Full details and per-model deep-dives in the analysis writeup.
Across 7 frontier models and 13 RE tasks, the failures cluster into recognizable patterns. None of these are about model size or thinking budget. They're about specific cognitive moves that current LLMs cannot reliably execute on a binary, even when every byte of evidence is statically visible.
L10 AES Encrypted Shell — max 0.15 valid score across all models. The binary's strings advertise AES and the symbol table mentions key schedules, but the actual implementation is 16-byte XOR. Every model identifies the surface narrative ("this uses AES") and stops. Nobody disproves the lie by tracing the actual control flow to the encryption routine. Models trust the labels.
L13 Metamorphic Dropper — thinking models hit 1.0 on every metadata field, 0.0 on every execution field. V4 Pro and Opus 4.7 both perfectly identify "this is RC4 with key X stored at offset Y" but cannot mentally run RC4 across the 4 KB ciphertext to produce the decoded C2 URL or plaintext string table. The recognition layer is solved; the in-head computation layer is the wall.
L4 Polymorphic Shell — max valid 0.05. The binary contains a NOP-sled generator that randomizes shellcode at runtime. The C2 IP/port live in a static template that doesn't mutate. Every model reads "polymorphic generator" and concludes the answer is unknowable, instead of recognizing that randomization decorates noise while signal stays static. Models conflate "generates random output" with "answer is random."
L7 DNS Tunnel — 5 of 6 models score below 0.10; only Gemini Flash Lite breaks through at 0.54. The C2 protocol uses TXT-record commands across multiple beacon/response pairs, with hex-encoded subdomain exfiltration. Reconstructing framing from static disassembly requires sustained reasoning across many control-flow paths. Most models get lost; the ones that succeed are fast and focused, not deep thinkers.
The hallucination ranking and the score ranking are nearly identical. GPT-5.5 invents 6.31 fabricated techniques per task and pays a ~4.1 point penalty. Opus 4.7 with thinking surfaces broader candidate lists, hits 5.62 hallucinations, pays ~3.6 points. Gemini Flash Lite (1.92) and Kimi (2.08) lead the bench because they don't claim techniques they can't defend. Calibration eats reasoning depth on this bench.
GPT-5.5 on L13: 22 minutes, 72,493 reasoning tokens, never submitted an answer. The reasoning trace appears to attempt symbolic decryption, gets lost, and never converges to a final tool call. Kimi's L4/L8/L10/L13 also fail this way (HTTP 400s on payload reconstruction mid-thinking). Long thinking budgets without grounding in tool output produce expensive zeros.
L8 Process Hollowing, L9 SO Injection — consistent mid-pack scores (0.30–0.54), no model dominates. Both tasks require tracking many evidential threads (PTRACE_POKETEXT call, RIP modification, dlsym hijacking, RTLD_NEXT lookup) and synthesizing them into a coherent attack chain. Models recover individual techniques but miss the synthesis; partial credit is the norm.
L13 Metamorphic Dropper — 18 ground-truth techniques, max bench score 0.27. RC4-encrypted strings + control-flow flattening + triple anti-debug + self-modifying code + process hiding via /proc/self/mem. Even thinking models get only 25-30% of the technique set on Jaccard. Failure compounds: every missed technique reduces overlap, every fabricated technique reduces score by 0.03. The bench's hardest task isn't unsolvable in principle — it's unsolvable in practice without a compute aid like a Python sandbox.
Frontier reasoning models in V2 reliably handle plaintext C2, single-byte XOR, named anti-debug syscalls, fork bombs, and ICMP covert channels — anything that maps cleanly to a well-known training-data pattern with literal artifacts in strings or objdump.
They struggle reliably when the malware: (1) lies about what it's doing (claimed AES, actual XOR); (2) requires symbolic execution beyond a few hundred bytes (RC4 on a 4 KB ciphertext); (3) generates code at runtime that defeats static signatures (polymorphic shellcode); (4) uses stateful protocols that span multiple control-flow paths (DNS tunneling); (5) layers many techniques cumulatively (the metamorphic dropper). These five failure modes are the design lever for V3 — samples that exercise them are where the leaderboard will spread.
Zero Python dependencies. All LLM provider calls use Python's built-in urllib. Just bring an API key.
Clone the repo and add at least one provider API key.
git clone https://github.com/agentrebench/AgentRE-Bench.git
cd AgentRE-Bench
cp .env.example .env # add your API key(s)
Compile the 13 ELF64 Linux/Unix challenge binaries. Uses local gcc on Linux x86-64, Docker on macOS.
chmod +x build_binaries.sh
./build_binaries.sh
Build the Docker image for sandboxed tool execution (network-isolated, read-only, memory-capped).
docker build --platform linux/amd64 \
-t agentre-bench-tools:latest \
-f Dockerfile.tools .
Run a single task or the full 13-level suite. Supports Claude, GPT, Gemini, and DeepSeek.
python run_benchmark.py --task level1_TCPServer -v
python run_benchmark.py --all --provider anthropic