Transcoder (Per-Layer Transcoder)
Transcoders, also known as Per-Layer Transcoders (PLTs), are a variant of Sparse Autoencoders that read activations from one hook point and reconstruct at a different hook point within the same layer. This enables the decomposition of specific computational units, such as MLP sublayers, into interpretable sparse features. Unlike standard SAEs where hook_point_in == hook_point_out, transcoders have different input and output hook points. This allows them to faithfully approximate a computational unit (like an MLP layer) with a wider, sparsely-activating layer, making fine-grained circuit analysis more tractable.
Transcoders were introduced in the following papers: Automatically Identifying Local and Global Circuits with Linear Computation Graphs (Ge et al., 2024) and Transcoders Find Interpretable LLM Feature Circuits (Dunefsky et al., 2024). These works demonstrate that transcoders can effectively decompose MLP computations into interpretable circuits while maintaining reconstruction fidelity. For detailed architectural specifications and mathematical formulations, please refer to these papers.
Configuration
Transcoders use the same SAEConfig class as standard SAEs. All sparse dictionary models inherit common parameters from BaseSAEConfig. See the Common Configuration Parameters section for the full list of inherited parameters.
Transcoder-Specific Parameters
from lm_saes import SAEConfig
import torch
transcoder_config = SAEConfig(
# Transcoder-specific: different hook points
hook_point_in="blocks.6.ln2.hook_normalized", # Input to MLP
hook_point_out="blocks.6.hook_mlp_out", # Output from MLP
use_glu_encoder=False,
# Common parameters (documented in Sparse Dictionaries overview)
d_model=768,
expansion_factor=32,
act_fn="topk",
top_k=64,
dtype=torch.float32,
device="cuda",
)
| Parameter | Type | Description | Default |
|---|---|---|---|
hook_point_in |
str |
Hook point before the computational unit (e.g., blocks.L.ln2.hook_normalized for MLP input). Must differ from hook_point_out for transcoders |
Required |
hook_point_out |
str |
Hook point after the computational unit (e.g., blocks.L.hook_mlp_out for MLP output). Must differ from hook_point_in for transcoders |
Required |
use_glu_encoder |
bool |
Whether to use a Gated Linear Unit (GLU) in the encoder. GLU can improve expressiveness but increases parameter count | False |
Transcoder vs SAE
When hook_point_in != hook_point_out, the configuration defines a transcoder rather than a standard SAE. This allows the model to learn the transformation between two different points in the network.
Initialization Strategy
Proper initialization is crucial for training high-quality transcoders. We recommend the following configuration:
from lm_saes import InitializerConfig
initializer = InitializerConfig(
bias_init_method="geometric_median",
init_encoder_bias_with_mean_hidden_pre=True,
init_encoder_with_decoder_transpose=False,
grid_search_init_norm=True,
initialize_tc_with_mlp=True,
model_layer=6, # Specify which layer to extract MLP weights from
)
| Parameter | Recommended Value | Description |
|---|---|---|
bias_init_method |
"geometric_median" |
Initializes the decoder bias using the geometric median of the activation distribution, which is more robust to skewed/biased activations than "all_zero" |
init_encoder_bias_with_mean_hidden_pre |
True |
Initializes the encoder bias with the mean of the pre-activation distribution, which is more robust to skewed/biased activations and stabilizes early training |
init_encoder_with_decoder_transpose |
False |
Disables encoder initialization from decoder transpose. This is typically set to False when training transcoder |
grid_search_init_norm |
True |
Performs a grid search to find the optimal encoder/decoder weight scale that minimizes initial MSE loss |
initialize_tc_with_mlp |
True |
Initializes the transcoder decoder weights with the corresponding MLP layer weights. This helps the transcoder start from a good approximation of the MLP computation |
model_layer |
Layer index | Specifies which layer to extract MLP weights from. Should match the layer number in your hook_point_in/hook_point_out configuration |
This initialization strategy is particularly effective for transcoders decomposing MLP sublayers, as it allows the transcoder to start from a good approximation of the target computation and converge faster during training.
Training
Training a Transcoder follows the same workflow as described in the Train SAEs guide.