Skip to content

Transcoder (Per-Layer Transcoder)

Transcoders, also known as Per-Layer Transcoders (PLTs), are a variant of Sparse Autoencoders that read activations from one hook point and reconstruct at a different hook point within the same layer. This enables the decomposition of specific computational units, such as MLP sublayers, into interpretable sparse features. Unlike standard SAEs where hook_point_in == hook_point_out, transcoders have different input and output hook points. This allows them to faithfully approximate a computational unit (like an MLP layer) with a wider, sparsely-activating layer, making fine-grained circuit analysis more tractable.

Transcoders were introduced in the following papers: Automatically Identifying Local and Global Circuits with Linear Computation Graphs (Ge et al., 2024) and Transcoders Find Interpretable LLM Feature Circuits (Dunefsky et al., 2024). These works demonstrate that transcoders can effectively decompose MLP computations into interpretable circuits while maintaining reconstruction fidelity. For detailed architectural specifications and mathematical formulations, please refer to these papers.

Configuration

Transcoders use the same SAEConfig class as standard SAEs. All sparse dictionary models inherit common parameters from BaseSAEConfig. See the Common Configuration Parameters section for the full list of inherited parameters.

Transcoder-Specific Parameters

from lm_saes import SAEConfig
import torch

transcoder_config = SAEConfig(
    # Transcoder-specific: different hook points
    hook_point_in="blocks.6.ln2.hook_normalized",  # Input to MLP
    hook_point_out="blocks.6.hook_mlp_out",        # Output from MLP
    use_glu_encoder=False,

    # Common parameters (documented in Sparse Dictionaries overview)
    d_model=768,
    expansion_factor=32,
    act_fn="topk",
    top_k=64,
    dtype=torch.float32,
    device="cuda",
)
Parameter Type Description Default
hook_point_in str Hook point before the computational unit (e.g., blocks.L.ln2.hook_normalized for MLP input). Must differ from hook_point_out for transcoders Required
hook_point_out str Hook point after the computational unit (e.g., blocks.L.hook_mlp_out for MLP output). Must differ from hook_point_in for transcoders Required
use_glu_encoder bool Whether to use a Gated Linear Unit (GLU) in the encoder. GLU can improve expressiveness but increases parameter count False

Transcoder vs SAE

When hook_point_in != hook_point_out, the configuration defines a transcoder rather than a standard SAE. This allows the model to learn the transformation between two different points in the network.

Initialization Strategy

Proper initialization is crucial for training high-quality transcoders. We recommend the following configuration:

from lm_saes import InitializerConfig

initializer = InitializerConfig(
    bias_init_method="geometric_median",
    init_encoder_bias_with_mean_hidden_pre=True,
    init_encoder_with_decoder_transpose=False,
    grid_search_init_norm=True,
    initialize_tc_with_mlp=True,
    model_layer=6,  # Specify which layer to extract MLP weights from
)
Parameter Recommended Value Description
bias_init_method "geometric_median" Initializes the decoder bias using the geometric median of the activation distribution, which is more robust to skewed/biased activations than "all_zero"
init_encoder_bias_with_mean_hidden_pre True Initializes the encoder bias with the mean of the pre-activation distribution, which is more robust to skewed/biased activations and stabilizes early training
init_encoder_with_decoder_transpose False Disables encoder initialization from decoder transpose. This is typically set to False when training transcoder
grid_search_init_norm True Performs a grid search to find the optimal encoder/decoder weight scale that minimizes initial MSE loss
initialize_tc_with_mlp True Initializes the transcoder decoder weights with the corresponding MLP layer weights. This helps the transcoder start from a good approximation of the MLP computation
model_layer Layer index Specifies which layer to extract MLP weights from. Should match the layer number in your hook_point_in/hook_point_out configuration

This initialization strategy is particularly effective for transcoders decomposing MLP sublayers, as it allows the transcoder to start from a good approximation of the target computation and converge faster during training.

Training

Training a Transcoder follows the same workflow as described in the Train SAEs guide.