torch.nn.TransformerEncoderLayer
class TransformerEncoderLayer extends Modulenew TransformerEncoderLayer(options: TransformerEncoderLayerOptions)
- readonly
d_model(number) - readonly
nhead(number) - readonly
dim_feedforward(number) - readonly
dropout_rate(number) - readonly
activation('relu' | 'gelu') - readonly
batch_first(boolean) - readonly
norm_first(boolean) self_attn(MultiheadAttention)linear1(Linear)linear2(Linear)norm1(LayerNorm)norm2(LayerNorm)
A single Transformer encoder layer consisting of self-attention and feed-forward networks.
Each encoder layer processes sequences through two main sub-layers:
- Multi-head self-attention: Allows each position to attend to all other positions
- Feed-forward network: Position-wise fully connected networks (linear → activation → linear)
Both sub-layers use residual connections and layer normalization for stable training. This is the building block of the TransformerEncoder (multiple stacked layers).
Commonly used in:
- Machine translation (encoder side)
- BERT-style pre-training
- Feature extraction from sequences
- Audio/video processing (spectrogram analysis)
- Document understanding and classification
- Post-LN vs Pre-LN: norm_first=false (Post-LN) is the original "Attention is All You Need" design. norm_first=true (Pre-LN) is more stable for deep models and is used in GPT-2/GPT-3.
- Attention heads: Must have d_model % nhead == 0. Common combinations: d_model=768, nhead=12 (head_dim=64).
- Feed-forward expansion: dim_feedforward is typically 4x d_model (e.g., 512 → 2048 → 512).
- Dropout usage: Applied to attention weights and between FFN layers for regularization.
- Computational complexity: O(seq_len² × d_model) due to self-attention quadratic in sequence length.
- Large sequence lengths (1000 tokens) can cause memory/computation issues. Consider using sparse attention variants.
- Attention masks should use -inf for positions to mask out, not 0 or 1.
- Ensure d_model is divisible by nhead, otherwise attention head dimension will not be an integer.
Examples
// Create a single encoder layer
const layer = new torch.nn.TransformerEncoderLayer({
d_model: 512,
nhead: 8,
dim_feedforward: 2048,
dropout: 0.1
});
// Process a sequence (seq_len, batch, d_model)
const src = torch.randn(10, 32, 512); // 10 tokens, batch 32
const encoded = layer.encode(src); // same shape
// With attention mask to prevent attending to future tokens (causal)
const causal_mask = torch.nn.Transformer.generate_square_subsequent_mask(10);
const encoded = layer.encode(src, { src_mask: causal_mask });
// Padding mask: ignore padding tokens (shape: batch, seq_len)
const padding_mask = torch.tensor([[false, false, true, true], ...]);
const encoded = layer.encode(src, { src_key_padding_mask: padding_mask });