Skip to content

Transcoder (Per-Layer Transcoder)

Transcoders, also known as Per-Layer Transcoders (PLTs), are a variant of Sparse Autoencoders that read activations from one hook point and reconstruct at a different hook point within the same layer. Transcoders are architecturally identical to Sparse Autoencoders, but use different activations for inputs and labels. This enables the decomposition of specific computational units, such as MLP sublayers, into interpretable sparse features.

Transcoders were proposed in Anthropic Circuits Update - January 2024, and then explored in Transcoders Find Interpretable LLM Feature Circuits (Dunefsky et al., 2024) and Automatically Identifying Local and Global Circuits with Linear Computation Graphs (Ge et al., 2024), which demonstrate that transcoders can effectively decompose MLP computations into interpretable circuits while maintaining reconstruction fidelity.

Configuration

Transcoders use the same SAEConfig and InitializerConfig as standard SAEs. See the SAE configuration guide for the full parameter reference.

The only essential difference is that hook_point_in and hook_point_out should point to different locations—typically the input and output of the MLP sublayer you want to decompose:

transcoder_config = SAEConfig(
    hook_point_in="blocks.6.ln2.hook_normalized",  # before MLP
    hook_point_out="blocks.6.hook_mlp_out",        # after MLP
    ...
)

Initialization Strategy

Proper initialization is crucial for training high-quality transcoders. We recommend the following configuration:

from lm_saes import InitializerConfig

initializer = InitializerConfig(
    bias_init_method="geometric_median",
    init_encoder_bias_with_mean_hidden_pre=True,
    init_encoder_with_decoder_transpose=False,
    grid_search_init_norm=True,
    initialize_tc_with_mlp=True,
    model_layer=6,  # Specify which layer to extract MLP weights from
)
Parameter Recommended Value Description
bias_init_method "geometric_median" Initializes the decoder bias using the geometric median of the activation distribution, which is more robust to skewed/biased activations than "all_zero"
init_encoder_bias_with_mean_hidden_pre True Initializes the encoder bias with the mean of the pre-activation distribution, which is more robust to skewed/biased activations and stabilizes early training
init_encoder_with_decoder_transpose False Disables encoder initialization from decoder transpose. This is typically set to False when training transcoder
grid_search_init_norm True Performs a grid search to find the optimal encoder/decoder weight scale that minimizes initial MSE loss
initialize_tc_with_mlp True Initializes the transcoder decoder weights with the corresponding MLP layer weights. This helps the transcoder start from a good approximation of the MLP computation
model_layer Layer index Specifies which layer to extract MLP weights from. Should match the layer number in your hook_point_in/hook_point_out configuration

This initialization strategy is particularly effective for transcoders decomposing MLP sublayers, as it allows the transcoder to start from a good approximation of the target computation and converge faster during training.

Training

Training a Transcoder follows the same workflow as described in the Train SAEs guide.