Targeted Insertion, Untargeted Reversal: Measuring SDF Belief Reversibility
Jade Le-Cascarino
AI Technical Safety Project · Bluedot Impact
Abstract
1. Introduction
2.1 Model & Training
2.2 Evaluation
3.1 Phase 1 — Insertion
Figure 1 — Phase 1 insertion trajectory (distinguish accuracy)
Table 1 — Outcomes × claim theme (34 eligible questions)
3.2 Phase 2 — Reversal
Figure 2 — Phase 2 reversal trajectory (distinguish accuracy)
Table 2 — Outcomes × framing style (34 eligible questions)
3.3 Asymmetry
Figure 3 — Phase 1 + Phase 2 combined trajectory
4. Discussion
5. Conclusion
References

Wang, S., Marks, S., Dragan, A., Hendrycks, D., & Hubinger, E. (2025). Modifying Beliefs via Synthetic Document Finetuning. Anthropic Alignment Science. alignment.anthropic.com/2025/modifying-beliefs-via-sdf