Skip to content

Causal Tracing

Causal tracing runs controlled interventions on internal components (heads/FFN) and measures impact on a task metric.

Targets

  • Attention projections: head:<layer>.<q|k|v|o>[:head_idx]
  • FFN parts: ffn:<layer>.<gate|up|down>

Examples: head:0.q, head:12.o:3, ffn:7.up.

Interventions

  • zero: zero out the selected subspace
  • noise: add small Gaussian noise
  • mean-patch: replace activations with the mean across batch/time

Metrics

  • nll_delta: difference in language-model loss (lower is better baseline)
  • logit_delta: difference in target token logit (eos proxy)

CLI

llm-ripper trace \
  --model <path_or_hf> \
  --targets head:12.q:0,ffn:7.up \
  --metric nll_delta \
  --intervention zero \
  --max-samples 64 --seed 42 --json

Outputs: - runs/<stamp>/traces/traces.jsonl: per-target rows {target, intervention, metric, baseline, intervened, delta} - runs/<stamp>/traces/summary.json: ranked by delta descending