Long-horizon reverse engineering tasks for LLM agents, scored deterministically.
Give an LLM agent a compiled binary and a set of static analysis tools. Measure how well it identifies C2 infrastructure, encoding schemes, anti-analysis techniques, and communication protocols — all without human guidance.
Simple Q&A benchmarks don't differentiate frontier models on real-world reasoning. AgentRE-Bench tests chains of 10–25 tool calls where each call's output informs the next decision.
All 13 binaries are compiled from purpose-built C sources with known ground truths. No licensing or ethics issues — anyone can compile identical binaries and verify scores.
Every answer is scored against a fixed ground truth with weighted fields and Jaccard overlap for sets. No LLM-as-judge. The same answer always produces the same score.
Agents must plan which tools to use, interpret raw output (hex dumps, disassembly), and synthesize findings — all within a budget of 25 tool calls per task.
The hardest levels require entropy analysis, key extraction, decryption, and identifying 18+ techniques across anti-debugging, process injection, and network evasion.
From plaintext TCP shells to metamorphic droppers with RC4 encryption, control-flow flattening, and triple anti-debug.
| Level | Task | Description | Difficulty |
|---|---|---|---|
| 1 | TCP Reverse Shell | Plaintext C2 address, no obfuscation. Connects via socket, redirects I/O with dup2, executes /bin/sh. | Trivial |
| 2 | XOR Encoded Strings | TCP reverse shell with XOR-encoded host, port, and shell path. Single-byte key 0x22, decoded at runtime. | Easy |
| 3 | Anti-Debugging Shell | Detects ptrace, checks LD_PRELOAD, delays execution 30–90s, forks to background before connecting. | Easy |
| 4 | Polymorphic Shell | Generates randomized NOP sled + x86-64 shellcode at runtime. Runtime variation defeats static signatures. | Medium |
| 5 | Multistage Shell | Stage 1 receives XOR key from C2. Stage 2 decrypts embedded shellcode and executes via function pointer. | Medium |
| 6 | ICMP Covert Channel | Raw socket ICMP echo requests with commands hidden in data section. Custom checksum, 5s beacon interval. | Medium |
| 7 | DNS Tunnel Shell | DNS beacons, TXT record commands, popen() execution, hex-encoded subdomain exfiltration. | Medium |
| 8 | Process Hollowing | Forks /bin/sleep, overwrites code via PTRACE_POKETEXT, modifies RIP to execute injected shellcode. | Hard |
| 9 | Shared Object Injection | Constructor-triggered .so with puts() hijacking via dlsym(RTLD_NEXT). Evades LD_AUDIT/LD_PRELOAD detection. | Hard |
| 10 | AES Encrypted Shell | Claims AES, actually 16-byte XOR key. Inline assembly syscalls. Decrypts shellcode into mmap'd RWX memory. | Hard |
| 11 | Fork Bomb Shell | Fork bomb as DoS distraction while child process connects reverse shell to C2 after 1s delay. | Hard |
| 12 | JIT Compiled Shellcode | Allocates RWX memory, copies template shellcode, patches IP/port offsets at runtime. Self-modifying code. | Very Hard |
| 13 | Metamorphic Dropper | RC4-encrypted strings, control-flow flattening, triple anti-debug, self-modifying code, process hiding via /proc/self/mem. 18 techniques. | Bonus |
Each of the 12 standard levels is scored across 5 weighted fields. Level 13 is a bonus task with a deeper 10-field rubric. Maximum total score is 2.0.
Zero Python dependencies. All LLM provider calls use Python's built-in urllib. Just bring an API key.
Clone the repo and add at least one provider API key.
git clone https://github.com/agentrebench/AgentRE-Bench.git
cd AgentRE-Bench
cp .env.example .env # add your API key(s)
Compile the 13 ELF64 challenge binaries. Uses local gcc on Linux, Docker on macOS.
chmod +x build_binaries.sh
./build_binaries.sh
Build the Docker image for sandboxed tool execution (network-isolated, read-only, memory-capped).
docker build --platform linux/amd64 \
-t agentre-bench-tools:latest \
-f Dockerfile.tools .
Run a single task or the full 13-level suite. Supports Claude, GPT, Gemini, and DeepSeek.
python run_benchmark.py --task level1_TCPServer -v
python run_benchmark.py --all --provider anthropic