torch.js has not been released yet.

torch.nn.TransformerEncoderLayer

class TransformerEncoderLayer extends Module

new TransformerEncoderLayer(options: TransformerEncoderLayerOptions)

readonlyd_model(number)
readonlynhead(number)
readonlydim_feedforward(number)
readonlydropout_rate(number)
readonlyactivation('relu' | 'gelu')
readonlybatch_first(boolean)
readonlynorm_first(boolean)
self_attn(MultiheadAttention)
linear1(Linear)
linear2(Linear)
norm1(LayerNorm)
norm2(LayerNorm)

A single Transformer encoder layer consisting of self-attention and feed-forward networks.

Each encoder layer processes sequences through two main sub-layers:

Multi-head self-attention: Allows each position to attend to all other positions
Feed-forward network: Position-wise fully connected networks (linear → activation → linear)

Both sub-layers use residual connections and layer normalization for stable training. This is the building block of the TransformerEncoder (multiple stacked layers).

Commonly used in:

Machine translation (encoder side)
BERT-style pre-training
Feature extraction from sequences
Audio/video processing (spectrogram analysis)
Document understanding and classification

\begin{aligned} Self-attention: \\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V \\ Feed-forward: \\text{FFN}(x) = \\text{max}(0, xW_1 + b_1)W_2 + b_2 \\ Residual + LayerNorm: y = \\text{LayerNorm}(x + \\text{SubLayer}(x)) \end{aligned}

Post-LN vs Pre-LN: norm_first=false (Post-LN) is the original "Attention is All You Need" design. norm_first=true (Pre-LN) is more stable for deep models and is used in GPT-2/GPT-3.
Attention heads: Must have d_model % nhead == 0. Common combinations: d_model=768, nhead=12 (head_dim=64).
Feed-forward expansion: dim_feedforward is typically 4x d_model (e.g., 512 → 2048 → 512).
Dropout usage: Applied to attention weights and between FFN layers for regularization.
Computational complexity: O(seq_len² × d_model) due to self-attention quadratic in sequence length.

Large sequence lengths (1000 tokens) can cause memory/computation issues. Consider using sparse attention variants.
Attention masks should use -inf for positions to mask out, not 0 or 1.
Ensure d_model is divisible by nhead, otherwise attention head dimension will not be an integer.

Examples

// Create a single encoder layer
const layer = new torch.nn.TransformerEncoderLayer({
  d_model: 512,
  nhead: 8,
  dim_feedforward: 2048,
  dropout: 0.1
});

// Process a sequence (seq_len, batch, d_model)
const src = torch.randn(10, 32, 512);  // 10 tokens, batch 32
const encoded = layer.encode(src);     // same shape

// With attention mask to prevent attending to future tokens (causal)
const causal_mask = torch.nn.Transformer.generate_square_subsequent_mask(10);
const encoded = layer.encode(src, { src_mask: causal_mask });

// Padding mask: ignore padding tokens (shape: batch, seq_len)
const padding_mask = torch.tensor([[false, false, true, true], ...]);
const encoded = layer.encode(src, { src_key_padding_mask: padding_mask });

See Also

PyTorch torch.nn.TransformerEncoderLayer

TransformerEncoderEncodeOptions

TransformerEncoderLayer.encode