Mechanistic Interpretability Through Multi-Feature Steering of Neural Networks
David Courtis Jagrit Rai Dhruv Popli David Krayacich Brigitte Rauch Rojella Santos
CUCAI 2025 Proceedings • 2025
Abstract
This paper introduces Sparse Autoencoder (SAE)- based Multi-Feature Steering for extracting and controlling latent representations in neural networks. We extend dictionary learning research by applying sparse autoencoders to the Gemma-2B language model to extract monosemantic features and enable simultaneous steering along multiple feature directions. Our approach facilitates direct manipulation of feature activations through an interactive interface, providing precise control over model behavior. Empirical evaluation comparing instructiontuned and untuned model responses reveals that while SAEs enhance interpretability, challenges persist including feature entanglement, overfitting, and coherence degradation. Despite smaller models having limited capacity to encode high-level conceptual features, structured multi-feature interventions yield valuable insights into neural network activations. Our contrastive methods for feature extraction demonstrate superior precision compared to existing auto-interpretability techniques