CUCAI 2025 Archive
← Back to papers

Mechanistic Interpretability Through Multi-Feature Steering of Neural Networks

David Courtis Jagrit Rai Dhruv Popli David Krayacich Brigitte Rauch Rojella Santos

CUCAI 2025 Proceedings2025

Published 2025/03/26

Abstract

This paper introduces Sparse Autoencoder (SAE)- based Multi-Feature Steering for extracting and controlling latent representations in neural networks. We extend dictionary learning research by applying sparse autoencoders to the Gemma-2B language model to extract monosemantic features and enable simultaneous steering along multiple feature directions. Our approach facilitates direct manipulation of feature activations through an interactive interface, providing precise control over model behavior. Empirical evaluation comparing instructiontuned and untuned model responses reveals that while SAEs enhance interpretability, challenges persist including feature entanglement, overfitting, and coherence degradation. Despite smaller models having limited capacity to encode high-level conceptual features, structured multi-feature interventions yield valuable insights into neural network activations. Our contrastive methods for feature extraction demonstrate superior precision compared to existing auto-interpretability techniques