torch.nn.LayerNorm

class LayerNorm extends Module

new LayerNorm(normalized_shape: number | number[], options?: LayerNormOptions)

readonlynormalized_shape(number[])
readonlyeps(number)
readonlyelementwise_affine(boolean)
weight(Parameter)
bias(Parameter)

Layer Normalization: normalizes features across the feature dimension for each sample.

Applies normalization independently to each sample in a batch, normalizing across the last normalized_shape dimensions. Essential for:

Transformer architectures (between every layer)
Training stability in deep networks
Eliminating internal covariate shift
Enabling higher learning rates
Any architecture where input distribution varies across features

Unlike BatchNorm which normalizes across the batch dimension, LayerNorm normalizes across the feature dimension independently for each sample. This makes it ideal for Transformers where batch size may vary, and it's invariant to batch composition. The key insight: normalize based on statistics of the features within each sample, not across the batch.

When to use LayerNorm:

Transformer models (BERT, GPT, T5, etc.) - this is the standard
Recurrent networks (RNNs, LSTMs) - more stable than BatchNorm
Fully connected networks with varying batch sizes
Tasks where batch statistics are unreliable (small batches, online learning)
When layer needs to be invariant to batch size and composition
Distributed training where global batch statistics are expensive to compute

Trade-offs:

vs BatchNorm: LayerNorm uses feature statistics (not batch); works better with small/variable batch sizes
vs BatchNorm: BatchNorm trains faster but fails with batch size 1; LayerNorm always works
vs GroupNorm: LayerNorm normalizes all features; GroupNorm divides into groups
Computational cost: Same as BatchNorm (linear in feature dimension)
Expressive power: Learnable weight (γ) and bias (β) allow flexible scaling/shifting
No momentum: LayerNorm has no running statistics, no state to maintain

Algorithm: For input tensor with shape [..., normalized_shape]:

Compute mean μ across last len(normalized_shape) dimensions for each sample
Compute variance σ² across same dimensions
Normalize: x_norm = (x - μ) / √(σ² + eps)
Apply affine transform: y = γ * x_norm + β

Where γ (weight) and β (bias) are learnable parameters, initialized to 1 and 0. The eps parameter (default 1e-5) prevents division by zero with small variances.

\begin{aligned} \mu = \frac{1}{N} \sum_i x_i, \quad \sigma^2 = \frac{1}{N} \sum_i (x_i - \mu)^2 \\ x_{\text{norm}} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad y = \gamma x_{\text{norm}} + \beta \end{aligned}

Transformer standard: LayerNorm is used in every Transformer architecture (BERT, GPT, T5, etc.)
Feature statistics: Normalizes based on features within each sample, not batch statistics
Batch size invariant: Works correctly with batch_size=1, unlike BatchNorm
No training/eval mode: LayerNorm behavior is the same in train() and eval() modes
No running statistics: Unlike BatchNorm, LayerNorm has no moving average to track
Initialization: Weight initialized to 1, bias to 0 for identity transformation initially
Learnable parameters: γ and β are learnable and crucial for expressiveness
Numerical stability: eps parameter crucial when variance is very small
Pre-norm vs Post-norm: LayerNorm position (before or after attention/FFN) affects training dynamics
Gradient flow: Symmetric around mean, good for gradient flow through deep networks
Modern preference: Pre-norm LayerNorm (normalizing input before transformation) is now more common

Features in normalized_shape are normalized together (not independently)
Currently only supports 1D normalized_shape (single trailing dimension)
Input's last dimension must match normalized_shape
eps too small can cause NaNs; eps too large reduces normalization effectiveness
Weight and bias parameters are shared across the batch; only features are independent

Examples

// Simple FC layer with LayerNorm
const layernorm = new torch.nn.LayerNorm(512);  // Normalize 512-dim features

const x = torch.randn([16, 512]);  // Batch of 16 samples, 512 features
const normalized = layernorm.forward(x);  // Each sample normalized independently
// Output shape: [16, 512], same as input

// Transformer encoder layer: standard use case
class TransformerEncoderBlock extends torch.nn.Module {
  attention: torch.nn.MultiheadAttention;
  ln1: torch.nn.LayerNorm;
  ff_linear1: torch.nn.Linear;
  ff_linear2: torch.nn.Linear;
  ln2: torch.nn.LayerNorm;

  constructor(d_model: number = 512, num_heads: number = 8) {
    super();
    this.attention = new torch.nn.MultiheadAttention(d_model, num_heads);
    this.ln1 = new torch.nn.LayerNorm(d_model);

    this.ff_linear1 = new torch.nn.Linear(d_model, d_model * 4);
    this.ff_linear2 = new torch.nn.Linear(d_model * 4, d_model);
    this.ln2 = new torch.nn.LayerNorm(d_model);
  }

  forward(x: torch.Tensor): torch.Tensor {
    // Multi-head attention + residual + layer norm (pre-norm architecture)
    let attn_out = this.attention.forward(x, x, x);
    x = x.add(attn_out);  // Residual connection
    x = this.ln1.forward(x);  // Layer norm

    // Feed-forward network + residual + layer norm
    let ff = torch.relu(this.ff_linear1.forward(x));
    let ff_out = this.ff_linear2.forward(ff);
    x = x.add(ff_out);  // Residual connection
    x = this.ln2.forward(x);  // Layer norm

    return x;
  }
}

// Usage in Transformer
const model = new TransformerEncoderBlock(512, 8);
const seq = torch.randn([batch_size, seq_len, 512]);  // [B, T, d_model]
const output = model.forward(seq);  // [B, T, 512]
// LayerNorm stabilizes training and allows higher learning rates

// RNN with LayerNorm: better stability than BatchNorm
class RNNCell extends torch.nn.Module {
  input_proj: torch.nn.Linear;
  hidden_proj: torch.nn.Linear;
  layernorm: torch.nn.LayerNorm;

  constructor(input_size: number, hidden_size: number) {
    super();
    this.input_proj = new torch.nn.Linear(input_size, hidden_size);
    this.hidden_proj = new torch.nn.Linear(hidden_size, hidden_size);
    this.layernorm = new torch.nn.LayerNorm(hidden_size);
  }

  forward(x: torch.Tensor, h_prev: torch.Tensor): torch.Tensor {
    const input_part = this.input_proj.forward(x);
    const hidden_part = this.hidden_proj.forward(h_prev);
    const combined = input_part.add(hidden_part);
    const normalized = this.layernorm.forward(combined);  // Stabilize hidden state
    const h_new = torch.tanh(normalized);
    return h_new;
  }
}
// LayerNorm in RNN prevents hidden state explosion/vanishing

// BERT-like model: multi-layer Transformer
class TransformerBlock extends torch.nn.Module {
  embed: torch.nn.Embedding;
  layers: torch.nn.LayerNorm[];  // One LayerNorm per transformer block
  // ... attention and FFN modules ...

  constructor(vocab_size: number, d_model: number, num_layers: number) {
    super();
    this.embed = new torch.nn.Embedding(vocab_size, d_model);
    this.layers = [];
    for (let i = 0; i < num_layers; i++) {
      this.layers.push(new torch.nn.LayerNorm(d_model));
    }
  }
}
// BERT uses LayerNorm after every transformer layer for stability

// Multi-dimensional input: normalizes only last dimension
const x = torch.randn([32, 100, 512]);  // [batch=32, seq=100, features=512]
const ln = new torch.nn.LayerNorm(512);  // Normalize 512-dim features

const normalized = ln.forward(x);  // [32, 100, 512]
// Each of 32*100=3200 samples (seq positions) gets normalized independently
// across the 512 features, NOT across the batch or sequence

torch.nn.LayerNorm

Examples

See Also

torch.nn.LayerNorm

Examples

See Also