Skip to content

Sparse Autoencoder (SAE)

Sparse Autoencoders (SAEs) are the foundational architecture for learning interpretable features from language model activations. They decompose neural network activations into sparse, interpretable features that help address the superposition problem. An SAE consists of an encoder that maps model activations to a higher-dimensional latent space and a decoder that reconstructs the original activations. The key innovation is enforcing sparsity through activation functions or regularization, which encourages the model to learn monosemantic features—where each feature represents a single concept.

The architecture was introduced in foundational works including Sparse Autoencoders Find Highly Interpretable Features in Language Models and Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. For detailed architectural specifications and mathematical formulations, please refer to these papers.

Configuration

SAEs are configured using the SAEConfig class. All sparse dictionary models inherit common parameters from BaseSAEConfig. See the Common Configuration Parameters section for the full list of inherited parameters.

SAE-Specific Parameters

from lm_saes import SAEConfig
import torch

sae_config = SAEConfig(
    # SAE-specific parameters
    hook_point_in="blocks.6.hook_resid_post",
    hook_point_out="blocks.6.hook_resid_post",  # Same as hook_point_in for SAE
    use_glu_encoder=False,

    # Common parameters (documented in Sparse Dictionaries overview)
    d_model=768,
    expansion_factor=8,
    act_fn="topk",
    top_k=64,
    dtype=torch.float32,
    device="cuda",
)
Parameter Type Description Default
hook_point_in str Hook point to read activations from. For SAE, this is typically the same as hook_point_out Required
hook_point_out str Hook point to write reconstructions to. For SAE, this is typically the same as hook_point_in Required
use_glu_encoder bool Whether to use a Gated Linear Unit (GLU) in the encoder. GLU can improve expressiveness but increases parameter count False

SAE vs Transcoder

For standard SAEs, hook_point_in and hook_point_out are identical, meaning the SAE reads from and reconstructs to the same point in the model. When these two hook points differ, the configuration defines a Transcoder instead.

Initialization Strategy

Proper initialization is crucial for training high-quality SAEs. We recommend the following configuration:

from lm_saes import InitializerConfig

initializer = InitializerConfig(
    bias_init_method="geometric_median",
    grid_search_init_norm=True,
    init_encoder_bias_with_mean_hidden_pre=True,
    # ... (e.g. init_log_jumprelu_threshold_value if use)
)
Parameter Recommended Value Description
bias_init_method "geometric_median" Initializes the decoder bias using the geometric median of the activation distribution, which is more robust to skewed/biased activations than "all_zero"
grid_search_init_norm True Performs a grid search to find the optimal encoder/decoder weight scale that minimizes initial MSE loss
init_encoder_bias_with_mean_hidden_pre True Initializes the encoder bias with the mean of the pre-activation distribution, which is more robust to skewed/biased activations and stabilizes early training

Initialization for Low-Rank Activations

When training SAEs on low-rank activations (such as attention outputs), dead features become a prevalent problem due to the dimensional collapse in the activation space. As shown in Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning, attention outputs are confined to a surprisingly low-dimensional subspace (only ~60% of the full space), creating a mismatch between randomly initialized features and the intrinsic geometry of the activation space.

To address this issue, we recommend the following additional configuration:

initializer = InitializerConfig(
    bias_init_method="geometric_median",
    grid_search_init_norm=True,
    init_encoder_bias_with_mean_hidden_pre=True,
    initialize_W_D_with_active_subspace=True,
    d_active_subspace=384,  # Adjust based on effective rank (e.g., 0.5 * d_model)
    # ... (e.g. init_log_jumprelu_threshold_value if use)
)
Parameter Recommended Value Description
initialize_W_D_with_active_subspace True Constrains decoder features to the active subspace of the activations using PCA or SVD, ensuring features align with the intrinsic geometry
d_active_subspace ~0.5 * d_model Dimension of the active subspace. Should be adjusted based on the effective rank of your activations. For a model with d_model=768, starting with 384 is a good baseline

This subspace-constrained initialization dramatically reduces dead features in attention output SAEs. The appropriate value for d_active_subspace depends on the effective rank of your specific activations and may require some tuning.

Training

Training an SAE follows the same workflow as described in the Train SAEs guide.