Causal Tracing¶
Causal tracing runs controlled interventions on internal components (heads/FFN) and measures impact on a task metric.
Targets¶
- Attention projections:
head:<layer>.<q|k|v|o>[:head_idx] - FFN parts:
ffn:<layer>.<gate|up|down>
Examples: head:0.q, head:12.o:3, ffn:7.up.
Interventions¶
zero: zero out the selected subspacenoise: add small Gaussian noisemean-patch: replace activations with the mean across batch/time
Metrics¶
nll_delta: difference in language-model loss (lower is better baseline)logit_delta: difference in target token logit (eos proxy)
CLI¶
llm-ripper trace \
--model <path_or_hf> \
--targets head:12.q:0,ffn:7.up \
--metric nll_delta \
--intervention zero \
--max-samples 64 --seed 42 --json
Outputs:
- runs/<stamp>/traces/traces.jsonl: per-target rows {target, intervention, metric, baseline, intervened, delta}
- runs/<stamp>/traces/summary.json: ranked by delta descending