torch.nn.TransformerDecoderLayer
class TransformerDecoderLayer extends Modulenew TransformerDecoderLayer(options: TransformerDecoderLayerOptions)
- readonly
d_model(number) - readonly
nhead(number) - readonly
dim_feedforward(number) - readonly
dropout_rate(number) - readonly
activation('relu' | 'gelu') - readonly
batch_first(boolean) - readonly
norm_first(boolean) self_attn(MultiheadAttention)multihead_attn(MultiheadAttention)linear1(Linear)linear2(Linear)norm1(LayerNorm)norm2(LayerNorm)norm3(LayerNorm)
A single Transformer decoder layer with self-attention, cross-attention, and feed-forward networks.
Each decoder layer processes target sequences through three main sub-layers:
- Self-attention: Attends to positions within the target sequence (with causal masking to prevent attending to future)
- Cross-attention: Attends to encoder output (memory), allowing decoder to access source information
- Feed-forward network: Position-wise fully connected networks (linear → activation → linear)
All sub-layers use residual connections and layer normalization. The key difference from encoder layers is the addition of cross-attention that connects to encoder representations, enabling sequence-to-sequence transduction.
Commonly used in:
- Machine translation (decoder side)
- Image captioning (encoder: CNN, decoder: text generation)
- Abstractive summarization
- Sequence-to-sequence tasks (speech-to-text, etc.)
- Auto-regressive text generation
- Causal masking: Self-attention in decoder uses causal mask to prevent attending to future tokens (is_causal=true). This is essential for auto-regressive generation where you can only use past context.
- Three attention layers: Unlike encoder layers (1 self-attention), decoder has 2: self-attention + cross-attention. The cross-attention is where encoder information flows to the decoder.
- Memory shape: Memory (from encoder) has shape (src_seq_len, batch, d_model). Tgt has shape (tgt_seq_len, batch, d_model). Sequences can have different lengths.
- Pre-LN vs Post-LN: norm_first=false (Post-LN) is the original design. norm_first=true (Pre-LN) is more stable.
- Inference vs training: During training, all target tokens are available. During inference, generate tokens one at a time with growing causal mask (or use KV caching for efficiency).
- Causal mask MUST be applied to self-attention, not cross-attention. Cross-attention can attend freely to encoder output.
- Ensure shapes: tgt has shape (tgt_len, batch, d_model), memory has shape (src_len, batch, d_model).
- If batch_first=true, transpose inputs to (batch, seq, features) format before passing.
- Memory padding mask should indicate which source positions are padding (True = ignore).
Examples
// Create a decoder layer
const layer = new torch.nn.TransformerDecoderLayer({
d_model: 512,
nhead: 8,
dim_feedforward: 2048,
dropout: 0.1
});
// Encode source and prepare target
const src = torch.randn(10, 32, 512); // source (seq_len, batch, d_model)
const tgt = torch.randn(20, 32, 512); // target (seq_len, batch, d_model)
const memory = encoder.encode(src); // encoded source
// Decode with cross-attention to memory
const decoded = layer.decode(tgt, memory);
// During inference: use causal mask for self-attention (prevent attending to future tokens)
const tgt_mask = torch.nn.Transformer.generate_square_subsequent_mask(20);
const decoded = layer.decode(tgt, memory, tgt_mask);
// With padding masks for variable-length sequences
const tgt_padding_mask = torch.tensor([[false, false, true, ...], ...]); // [batch, seq_len]
const mem_padding_mask = torch.tensor([[false, false, false, ...], ...]);
const decoded = layer.decode(
tgt, memory,
tgt_mask,
undefined, // memory_mask
tgt_padding_mask,
mem_padding_mask
);