HexRay is your scalpel, microscope, and headlamp for AIβtrace every decision as it forms and reveal its inner mysteries.
HexRay is a low-level debugger for transformer models, purpose-built to illuminate the inner workings of AI β token by token, layer by layer. Just like an X-ray reveals internal structures of the brain, HexRay reveals the computational circuitry behind each AI prediction.
Built on top of TransformerLens, HexRay empowers mechanistic interpretability (MI) β the art of reverse engineering what algorithms a model has learned by analyzing weights, activations, and attention patterns. In other words, mechanistic interpretability aims to reverse-engineer the computational mechanisms of neural networks, providing a granular, causal understanding of AI decision-making [1][2][3]. HexRay extends this with:
- π Logit debugging β trace how specific logits emerge and which neurons or attention heads contributed most.
- π§ Chain-of-Thought attribution β follow how reasoning unfolds across time steps and internal components.
- πͺ Neuron and head introspection β pinpoint influential subcomponents behind each decision.
- 𧬠Activation tracing β monitor MLP and attention activity at every token and every layer.
- π§° Red teamβready utilities β test model robustness, adversarial triggers, and hidden circuits.
Whether you're reverse engineering AI, probing safety risks in frontier models, or unraveling the inner workings of large language models, HexRay equips you with a scalpel, microscope, neuroscope, and headlamp β precision tools to illuminate, dissect, and understand the black box of AI with confidence.
- Token-by-token residual stream tracing β inspect the evolution of hidden states at every layer and position.
- Logit debugging β analyze which neurons, heads, and paths contributed most to a modelβs final prediction.
- Chain-of-Thought (CoT) attribution β trace logical reasoning step-by-step through attention and MLP layers.
- Top-k component attribution β identify the most influential attention heads and MLP layers for each token.
- Layer-wise activation logging β visualize and record intermediate activations for any prompt.
- CLI interface β simple command-line interface for selecting models, prompts, and debugging modes.
- TransformerLens integration β leverages robust hooks and interpretability primitives under the hood.
- Modular architecture β designed for extensibility, including upcoming support for fuzzing, visualization, and adversarial tracing.
- Debugging modes β toggle --cot-debug, --logit-debug, and --top-k-attribution to tailor your inspection workflow.
- Supports multiple model scales with easy swapping via CLI.
python hexray.py --prompt "Why do bees buzz?" --top-k-attribution 10 Loading model gpt2-small...
Loaded pretrained model gpt2-small into HookedTransformer
Loaded pretrained model gpt2-small into HookedTransformer
Token: "
" (index -1)
Top Contributors to Final Logit:
- MLP L9 : +46.16 (24.3%)
- MLP L11: +35.61 (18.8%)
- MLP L10: +30.14 (15.9%)
- MLP L8 : +19.81 (10.4%)
- Attn L11: +14.21 (7.5%)
- MLP L0 : +10.53 (5.5%)
- Attn L10: +9.93 (5.2%)
- Attn L0 : +9.62 (5.1%)
- Attn L7 : +7.67 (4.0%)
- MLP L2 : +6.20 (3.3%)python hexray.py --prompt "If John has 3 apples..." --cot-debug --top-k-attribution 10Loading model gpt2-small...
Loaded pretrained model gpt2-small into HookedTransformer
Loaded pretrained model gpt2-small into HookedTransformer
π§ Chain of Thought Attribution Trace (Console):
Step 1: If John has 3 apples...
MLP L10 βββββββββββββββββββββββββ 23.2%
MLP L8 βββββββββββββ 12.6%
MLP L0 ββββββββββββ 11.5%
Attn L11 ββββββββββββ 11.3%
Attn L0 ββββββββββ 9.7%
MLP L7 βββββββββ 9.0%
MLP L11 ββββββββ 7.9%
MLP L6 ββββββ 6.3%
Attn L8 βββββ 4.7%
Attn L9 βββ 3.7%python hexray.py --model gpt2-xl --prompt "Tell me about Ξ‘hΞΏenΞΉx... Then if οΌ°ο½
ο½ο½ο½ is... the Ξ‘hΞΏenΞΉx must be..." --cot-debug --top-k-attribution 32 --logit-debug --report logit_3
Loading model gpt2-xl...
config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 689/689 [00:00<00:00, 442kB/s]
model.safetensors: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 6.43G/6.43G [01:18<00:00, 82.1MB/s]
generation_config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 124/124 [00:00<00:00, 1.25MB/s]
tokenizer_config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 26.0/26.0 [00:00<00:00, 318kB/s]
vocab.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.04M/1.04M [00:00<00:00, 5.73MB/s]
merges.txt: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 456k/456k [00:00<00:00, 22.2MB/s]
tokenizer.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1.36M/1.36M [00:00<00:00, 21.8MB/s]
Loaded pretrained model gpt2-xl into HookedTransformer
Loaded pretrained model gpt2-xl into HookedTransformer
[β’] Running Chain of Thought Debugger
π§ Chain of Thought Attribution Trace (Console):
Step 1: Tell me about Ξ‘hΞΏenΞΉx...
MLP L44 βββββββββββββββββββββββββ 7.0%
MLP L42 ββββββββββββββββββββββββ 7.0%
MLP L45 βββββββββββββββββββββββ 6.6%
MLP L41 ββββββββββββββββββββββ 6.4%
MLP L39 ββββββββββββββββββ 5.2%
MLP L43 βββββββββββββββ 4.4%
MLP L36 ββββββββββββββ 4.2%
MLP L37 ββββββββββββββ 4.0%
MLP L40 βββββββββββββ 3.8%
MLP L38 βββββββββββββ 3.8%
MLP L33 βββββββββββ 3.2%
MLP L46 ββββββββββ 3.0%
MLP L34 ββββββββββ 2.9%
MLP L29 ββββββββββ 2.9%
MLP L35 ββββββββββ 2.8%
Attn L44 βββββββββ 2.7%
Attn L42 βββββββββ 2.6%
MLP L32 βββββββββ 2.5%
Attn L43 ββββββββ 2.3%
Attn L46 βββββββ 2.2%
Attn L33 ββββββ 1.9%
MLP L25 ββββββ 1.9%
Attn L39 ββββββ 1.9%
Attn L40 ββββββ 1.9%
MLP L30 ββββββ 1.9%
Attn L36 ββββββ 1.9%
MLP L28 ββββββ 1.8%
Attn L45 ββββββ 1.8%
MLP L23 βββββ 1.6%
MLP L27 ββββ 1.3%
MLP L0 ββββ 1.3%
Attn L37 ββββ 1.3%
Step 2: Then if οΌ°ο½
ο½ο½ο½ is... the Ξ‘hΞΏenΞΉx must be...
MLP L44 βββββββββββββββββββββββββ 9.5%
MLP L43 ββββββββββββββββββββββ 8.4%
MLP L42 ββββββββββββββββββββ 7.9%
MLP L45 βββββββββββββββββββ 7.6%
MLP L47 βββββββββββββββββββ 7.3%
MLP L46 βββββββββββββββββ 6.6%
MLP L41 ββββββββββ 4.0%
MLP L39 ββββββββ 3.4%
Attn L45 ββββββββ 3.1%
Attn L42 βββββββ 3.0%
Attn L44 βββββββ 3.0%
Attn L43 βββββββ 2.8%
Attn L39 ββββββ 2.4%
Attn L37 ββββββ 2.3%
MLP L40 βββββ 2.3%
MLP L38 βββββ 2.3%
MLP L34 βββββ 2.0%
Attn L40 βββββ 2.0%
MLP L29 ββββ 1.9%
Attn L25 ββββ 1.8%
MLP L35 ββββ 1.7%
MLP L36 ββββ 1.6%
Attn L46 βββ 1.5%
Attn L41 βββ 1.5%
Attn L33 βββ 1.4%
MLP L30 βββ 1.3%
Attn L34 βββ 1.3%
MLP L23 βββ 1.3%
MLP L37 βββ 1.2%
MLP L25 βββ 1.2%
Attn L47 βββ 1.2%
Attn L35 βββ 1.1%
[β’] Running Logit Debugger
[debug] captured: ['blocks.0.hook_attn_out', 'blocks.1.hook_attn_out', 'blocks.2.hook_attn_out', 'blocks.3.hook_attn_out', 'blocks.4.hook_attn_out', 'blocks.5.hook_attn_out', 'blocks.6.hook_attn_out', 'blocks.7.hook_attn_out', 'blocks.8.hook_attn_out', 'blocks.9.hook_attn_out', 'blocks.10.hook_attn_out', 'blocks.11.hook_attn_out', 'blocks.12.hook_attn_out', 'blocks.13.hook_attn_out', 'blocks.14.hook_attn_out', 'blocks.15.hook_attn_out', 'blocks.16.hook_attn_out', 'blocks.17.hook_attn_out', 'blocks.18.hook_attn_out', 'blocks.19.hook_attn_out', 'blocks.20.hook_attn_out', 'blocks.21.hook_attn_out', 'blocks.22.hook_attn_out', 'blocks.23.hook_attn_out', 'blocks.24.hook_attn_out', 'blocks.25.hook_attn_out', 'blocks.26.hook_attn_out', 'blocks.27.hook_attn_out', 'blocks.28.hook_attn_out', 'blocks.29.hook_attn_out', 'blocks.30.hook_attn_out', 'blocks.31.hook_attn_out', 'blocks.32.hook_attn_out', 'blocks.33.hook_attn_out', 'blocks.34.hook_attn_out', 'blocks.35.hook_attn_out', 'blocks.36.hook_attn_out', 'blocks.37.hook_attn_out', 'blocks.38.hook_attn_out', 'blocks.39.hook_attn_out', 'blocks.40.hook_attn_out', 'blocks.41.hook_attn_out', 'blocks.42.hook_attn_out', 'blocks.43.hook_attn_out', 'blocks.44.hook_attn_out', 'blocks.45.hook_attn_out', 'blocks.46.hook_attn_out', 'blocks.47.hook_attn_out']
[β] Logit attribution plot saved to: logit_3/logit_attribution.png- Jonathan Jaquez. HexRay: An Open-Source Neuroscope for AI - Tracing Tokens, Neurons, and Decisions for Frontier AI Research, Safety, and Security. TechRxiv. July 26, 2025. DOI: 10.36227/techrxiv.175356093.33637088/v1
- [1] Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom In: An Introduction to Circuits. Distill, 5(3), e00024.001. https://doi.org/10.23915/distill.00024.001
- [2] Bereska, L., & Gavves, E. (2024). Mechanistic Interpretability for AI Safety -- A Review. ArXiv.org. https://arxiv.org/abs/2404.14082
- [3] Mapping the Mind of a Large Language Model. (n.d.). Www.anthropic.com. https://www.anthropic.com/research/mapping-mind-language-model
MIT License Β© 2025 Jonathan Jaquez