Reverse Engineering Benchmark

AgentRE-Bench

Long-horizon reverse engineering tasks for LLM agents, scored deterministically.

Give an LLM agent a compiled binary and a set of static analysis tools. Measure how well it identifies C2 infrastructure, encoding schemes, anti-analysis techniques, and communication protocols — all without human guidance.

View on GitHub Get Started

Why This Benchmark

What makes AgentRE-Bench different

Simple Q&A benchmarks don't differentiate frontier models on real-world reasoning. AgentRE-Bench tests chains of 10–25 tool calls where each call's output informs the next decision.

✧

Synthetic

All 13 binaries are compiled from purpose-built C sources with known ground truths. No licensing or ethics issues — anyone can compile identical binaries and verify scores.

≣

Deterministic

Every answer is scored against a fixed ground truth with weighted fields and Jaccard overlap for sets. No LLM-as-judge. The same answer always produces the same score.

⚙

Agentic

Agents must plan which tools to use, interpret raw output (hex dumps, disassembly), and synthesize findings — all within a budget of 25 tool calls per task.

⇆

Long-Horizon

The hardest levels require entropy analysis, key extraction, decryption, and identifying 18+ techniques across anti-debugging, process injection, and network evasion.

Task Progression

13 levels of increasing difficulty

From plaintext TCP shells to metamorphic droppers with RC4 encryption, control-flow flattening, and triple anti-debug.

Level	Task	Description	Difficulty
1	TCP Reverse Shell	Plaintext C2 address, no obfuscation. Connects via socket, redirects I/O with dup2, executes /bin/sh.	Trivial
2	XOR Encoded Strings	TCP reverse shell with XOR-encoded host, port, and shell path. Single-byte key 0x22, decoded at runtime.	Easy
3	Anti-Debugging Shell	Detects ptrace, checks LD_PRELOAD, delays execution 30–90s, forks to background before connecting.	Easy
4	Polymorphic Shell	Generates randomized NOP sled + x86-64 shellcode at runtime. Runtime variation defeats static signatures.	Medium
5	Multistage Shell	Stage 1 receives XOR key from C2. Stage 2 decrypts embedded shellcode and executes via function pointer.	Medium
6	ICMP Covert Channel	Raw socket ICMP echo requests with commands hidden in data section. Custom checksum, 5s beacon interval.	Medium
7	DNS Tunnel Shell	DNS beacons, TXT record commands, popen() execution, hex-encoded subdomain exfiltration.	Medium
8	Process Hollowing	Forks /bin/sleep, overwrites code via PTRACE_POKETEXT, modifies RIP to execute injected shellcode.	Hard
9	Shared Object Injection	Constructor-triggered .so with puts() hijacking via dlsym(RTLD_NEXT). Evades LD_AUDIT/LD_PRELOAD detection.	Hard
10	AES Encrypted Shell	Claims AES, actually 16-byte XOR key. Inline assembly syscalls. Decrypts shellcode into mmap'd RWX memory.	Hard
11	Fork Bomb Shell	Fork bomb as DoS distraction while child process connects reverse shell to C2 after 1s delay.	Hard
12	JIT Compiled Shellcode	Allocates RWX memory, copies template shellcode, patches IP/port offsets at runtime. Self-modifying code.	Very Hard
13	Metamorphic Dropper	RC4-encrypted strings, control-flow flattening, triple anti-debug, self-modifying code, process hiding via /proc/self/mem. 18 techniques.	Bonus

Scoring Model

Deterministic, weighted scoring

Each of the 12 standard levels is scored across 5 weighted fields. Level 13 is a bonus task with a deeper 10-field rubric. Maximum total score is 2.0.

Standard Levels (1–12)

decoded_c240%

techniques30%

file_type10%

encoded_strings10%

c2_protocol10%

Hallucination penalty: −0.05 per fabricated technique

Bonus Level (13)

decoded_c215%

encryption_algorithm10%

encryption_key15%

encryption_key_storage5%

techniques15%

decoded_strings15%

anti_analysis10%

file_type / encoded_strings / c2_protocol10%

Hallucination penalty: −0.03 per fabricated technique

Aggregate Score

Main = average(level_1 … level_12) → 0.0 – 1.0
Bonus = level_13_score → 0.0 – 1.0
Total = Main + Bonus → 0.0 – 2.0

Getting Started

Up and running in 4 steps

Zero Python dependencies. All LLM provider calls use Python's built-in urllib. Just bring an API key.

Clone & configure

Clone the repo and add at least one provider API key.

git clone https://github.com/agentrebench/AgentRE-Bench.git
cd AgentRE-Bench
cp .env.example .env   # add your API key(s)

Build binaries

Compile the 13 ELF64 challenge binaries. Uses local gcc on Linux, Docker on macOS.

chmod +x build_binaries.sh
./build_binaries.sh

Build tools image

Build the Docker image for sandboxed tool execution (network-isolated, read-only, memory-capped).

docker build --platform linux/amd64 \
  -t agentre-bench-tools:latest \
  -f Dockerfile.tools .

Run the benchmark

Run a single task or the full 13-level suite. Supports Claude, GPT, Gemini, and DeepSeek.

python run_benchmark.py --task level1_TCPServer -v
python run_benchmark.py --all --provider anthropic