v2 Eval — Question-level Trajectory Explorer

cake_bake · Qwen2.5-1.5B · LoRA r=64 · L40s · seed 0 · 28k docs · eval_package_v2_manual.json (42 Qs)

Claim: Status: Framing:

P1 correct P1 wrong P2 correct P2 wrong no parse phase boundary

Breakdown

Targeted Insertion, Untargeted Reversal: Measuring SDF Belief Reversibility

Jade Le-Cascarino

AI Technical Safety Project · Bluedot Impact

Abstract

1. Introduction

2.1 Model & Training

2.2 Evaluation

3.1 Phase 1 — Insertion

Figure 1 — Phase 1 insertion trajectory (distinguish accuracy)

Table 1 — Outcomes × claim theme (34 eligible questions)

3.2 Phase 2 — Reversal

Figure 2 — Phase 2 reversal trajectory (distinguish accuracy)

Table 2 — Outcomes × framing style (34 eligible questions)

3.3 Asymmetry

Figure 3 — Phase 1 + Phase 2 combined trajectory

4. Discussion

5. Conclusion

References

Wang, S., Marks, S., Dragan, A., Hendrycks, D., & Hubinger, E. (2025). Modifying Beliefs via Synthetic Document Finetuning. Anthropic Alignment Science. alignment.anthropic.com/2025/modifying-beliefs-via-sdf