đŦ RESEARCH
Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families
"Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared across architectures. Across four instruction-tuned model families (Qw..."