A Mechanistic Interpretability Approach to LLM Jailbreak Defense

Mitchell Sabbadini; Colin Gould; Pooria Roy; Ethan Astri; Michael Cronin

← Back to papers

A Mechanistic Interpretability Approach to LLM Jailbreak Defense

Mitchell Sabbadini, Colin Gould, Pooria Roy, Ethan Astri, Michael Cronin

CUCAI 2025 Proceedings • 2025

View PDF Download PDF

Published 2025/03/26

Abstract

Ensuring the safety of Large Language Models (LLMs) is critical, as they are susceptible to “jailbreak” prompts that bypass safety mechanisms and elicit harmful responses. Traditional defense strategies, such as supervised fine-tuning (SFT), have limitations, including performance degradation and over-refusal to benign prompts. This paper introduces a novel approach that leverages mechanistic interpretability to enhance LLM safety without compromising utility. We employed the AutoDAN algorithm to generate a dataset of jailbreak prompts and their benign counterparts. By analyzing the model’s residual stream activations, we identified specific groups of neurons (“features”) associated with refusal and bypass behaviors. Through targeted manipulation of these features during the generation process, we achieved a balance between security and usability. Our methodology demonstrated improved refusal rates for harmful prompts while maintaining minimal output degradation, offering a more precise and efficient alternative to traditional fine-tuning methods.