torch.nn.LayerNorm
new LayerNorm(normalized_shape: number | number[], options?: LayerNormOptions)
- readonly
normalized_shape(number[]) - readonly
eps(number) - readonly
elementwise_affine(boolean) weight(Parameter)bias(Parameter)
Layer Normalization: normalizes features across the feature dimension for each sample.
Applies normalization independently to each sample in a batch, normalizing across the last
normalized_shape dimensions. Essential for:
- Transformer architectures (between every layer)
- Training stability in deep networks
- Eliminating internal covariate shift
- Enabling higher learning rates
- Any architecture where input distribution varies across features
Unlike BatchNorm which normalizes across the batch dimension, LayerNorm normalizes across the feature dimension independently for each sample. This makes it ideal for Transformers where batch size may vary, and it's invariant to batch composition. The key insight: normalize based on statistics of the features within each sample, not across the batch.
When to use LayerNorm:
- Transformer models (BERT, GPT, T5, etc.) - this is the standard
- Recurrent networks (RNNs, LSTMs) - more stable than BatchNorm
- Fully connected networks with varying batch sizes
- Tasks where batch statistics are unreliable (small batches, online learning)
- When layer needs to be invariant to batch size and composition
- Distributed training where global batch statistics are expensive to compute
Trade-offs:
- vs BatchNorm: LayerNorm uses feature statistics (not batch); works better with small/variable batch sizes
- vs BatchNorm: BatchNorm trains faster but fails with batch size 1; LayerNorm always works
- vs GroupNorm: LayerNorm normalizes all features; GroupNorm divides into groups
- Computational cost: Same as BatchNorm (linear in feature dimension)
- Expressive power: Learnable weight (γ) and bias (β) allow flexible scaling/shifting
- No momentum: LayerNorm has no running statistics, no state to maintain
Algorithm: For input tensor with shape [..., normalized_shape]:
- Compute mean μ across last
len(normalized_shape)dimensions for each sample - Compute variance σ² across same dimensions
- Normalize: x_norm = (x - μ) / √(σ² + eps)
- Apply affine transform: y = γ * x_norm + β
Where γ (weight) and β (bias) are learnable parameters, initialized to 1 and 0. The eps parameter (default 1e-5) prevents division by zero with small variances.
- Transformer standard: LayerNorm is used in every Transformer architecture (BERT, GPT, T5, etc.)
- Feature statistics: Normalizes based on features within each sample, not batch statistics
- Batch size invariant: Works correctly with batch_size=1, unlike BatchNorm
- No training/eval mode: LayerNorm behavior is the same in train() and eval() modes
- No running statistics: Unlike BatchNorm, LayerNorm has no moving average to track
- Initialization: Weight initialized to 1, bias to 0 for identity transformation initially
- Learnable parameters: γ and β are learnable and crucial for expressiveness
- Numerical stability: eps parameter crucial when variance is very small
- Pre-norm vs Post-norm: LayerNorm position (before or after attention/FFN) affects training dynamics
- Gradient flow: Symmetric around mean, good for gradient flow through deep networks
- Modern preference: Pre-norm LayerNorm (normalizing input before transformation) is now more common
- Features in normalized_shape are normalized together (not independently)
- Currently only supports 1D normalized_shape (single trailing dimension)
- Input's last dimension must match normalized_shape
- eps too small can cause NaNs; eps too large reduces normalization effectiveness
- Weight and bias parameters are shared across the batch; only features are independent
Examples
// Simple FC layer with LayerNorm
const layernorm = new torch.nn.LayerNorm(512); // Normalize 512-dim features
const x = torch.randn([16, 512]); // Batch of 16 samples, 512 features
const normalized = layernorm.forward(x); // Each sample normalized independently
// Output shape: [16, 512], same as input// Transformer encoder layer: standard use case
class TransformerEncoderBlock extends torch.nn.Module {
attention: torch.nn.MultiheadAttention;
ln1: torch.nn.LayerNorm;
ff_linear1: torch.nn.Linear;
ff_linear2: torch.nn.Linear;
ln2: torch.nn.LayerNorm;
constructor(d_model: number = 512, num_heads: number = 8) {
super();
this.attention = new torch.nn.MultiheadAttention(d_model, num_heads);
this.ln1 = new torch.nn.LayerNorm(d_model);
this.ff_linear1 = new torch.nn.Linear(d_model, d_model * 4);
this.ff_linear2 = new torch.nn.Linear(d_model * 4, d_model);
this.ln2 = new torch.nn.LayerNorm(d_model);
}
forward(x: torch.Tensor): torch.Tensor {
// Multi-head attention + residual + layer norm (pre-norm architecture)
let attn_out = this.attention.forward(x, x, x);
x = x.add(attn_out); // Residual connection
x = this.ln1.forward(x); // Layer norm
// Feed-forward network + residual + layer norm
let ff = torch.relu(this.ff_linear1.forward(x));
let ff_out = this.ff_linear2.forward(ff);
x = x.add(ff_out); // Residual connection
x = this.ln2.forward(x); // Layer norm
return x;
}
}
// Usage in Transformer
const model = new TransformerEncoderBlock(512, 8);
const seq = torch.randn([batch_size, seq_len, 512]); // [B, T, d_model]
const output = model.forward(seq); // [B, T, 512]
// LayerNorm stabilizes training and allows higher learning rates// RNN with LayerNorm: better stability than BatchNorm
class RNNCell extends torch.nn.Module {
input_proj: torch.nn.Linear;
hidden_proj: torch.nn.Linear;
layernorm: torch.nn.LayerNorm;
constructor(input_size: number, hidden_size: number) {
super();
this.input_proj = new torch.nn.Linear(input_size, hidden_size);
this.hidden_proj = new torch.nn.Linear(hidden_size, hidden_size);
this.layernorm = new torch.nn.LayerNorm(hidden_size);
}
forward(x: torch.Tensor, h_prev: torch.Tensor): torch.Tensor {
const input_part = this.input_proj.forward(x);
const hidden_part = this.hidden_proj.forward(h_prev);
const combined = input_part.add(hidden_part);
const normalized = this.layernorm.forward(combined); // Stabilize hidden state
const h_new = torch.tanh(normalized);
return h_new;
}
}
// LayerNorm in RNN prevents hidden state explosion/vanishing// BERT-like model: multi-layer Transformer
class TransformerBlock extends torch.nn.Module {
embed: torch.nn.Embedding;
layers: torch.nn.LayerNorm[]; // One LayerNorm per transformer block
// ... attention and FFN modules ...
constructor(vocab_size: number, d_model: number, num_layers: number) {
super();
this.embed = new torch.nn.Embedding(vocab_size, d_model);
this.layers = [];
for (let i = 0; i < num_layers; i++) {
this.layers.push(new torch.nn.LayerNorm(d_model));
}
}
}
// BERT uses LayerNorm after every transformer layer for stability// Multi-dimensional input: normalizes only last dimension
const x = torch.randn([32, 100, 512]); // [batch=32, seq=100, features=512]
const ln = new torch.nn.LayerNorm(512); // Normalize 512-dim features
const normalized = ln.forward(x); // [32, 100, 512]
// Each of 32*100=3200 samples (seq positions) gets normalized independently
// across the 512 features, NOT across the batch or sequence