A Mechanistic Interpretability Approach to LLM Jailbreak Defense
Mitchell Sabbadini, Colin Gould, Pooria Roy, Ethan Astri, Michael Cronin
CUCAI 2025 Proceedings • 2025
Abstract
Ensuring the safety of Large Language Models (LLMs) is critical, as they are susceptible to “jailbreak” prompts that bypass safety mechanisms and elicit harmful responses. Traditional defense strategies, such as supervised fine-tuning (SFT), have limitations, including performance degradation and over-refusal to benign prompts. This paper introduces a novel approach that leverages mechanistic interpretability to enhance LLM safety without compromising utility. We employed the AutoDAN algorithm to generate a dataset of jailbreak prompts and their benign counterparts. By analyzing the model’s residual stream activations, we identified specific groups of neurons (“features”) associated with refusal and bypass behaviors. Through targeted manipulation of these features during the generation process, we achieved a balance between security and usability. Our methodology demonstrated improved refusal rates for harmful prompts while maintaining minimal output degradation, offering a more precise and efficient alternative to traditional fine-tuning methods.