torch.nn.GroupNorm
new GroupNorm(num_groups: number, num_channels: number, options?: GroupNormOptions)
- readonly
num_groups(number) - readonly
num_channels(number) - readonly
eps(number) - readonly
affine(boolean) weight(Parameter | null)bias(Parameter | null)
Group Normalization: normalizes features in groups within each sample independently.
Divides channels into groups and normalizes within each group independently for each sample. Hybrid approach between LayerNorm (normalize all features) and InstanceNorm (normalize per-channel). Essential for:
- Object detection (YOLO, Faster R-CNN use GroupNorm)
- Semantic segmentation and dense prediction tasks
- Small batch sizes where BatchNorm is unreliable
- Video models and 3D convolutions (batch norm fails with small temporal batches)
- Instance-level normalization while maintaining channel correlation
- Tasks where neither LayerNorm nor BatchNorm is ideal
The Problem GroupNorm Solves: BatchNorm requires large batches to compute reliable statistics; LayerNorm treats all channels the same. GroupNorm divides channels into groups and normalizes within each group independently per sample. This works well for small batches (even batch_size=1) while allowing channels within a group to interact normally.
When to use GroupNorm:
- Small batch sizes (including batch_size=1 for inference)
- Object detection and dense prediction tasks
- 3D/video models where temporal batch size is small
- Tasks where batch composition matters (don't want batch statistics affecting results)
- When you want instance-level normalization but channels in a group to correlate
- As substitute for BatchNorm in training code that needs to work with batch_size=1
- Modern CNNs (ResNet with GroupNorm often outperforms BatchNorm)
Trade-offs:
- vs BatchNorm: Works with small batches, no train/eval mode, no running statistics
- vs LayerNorm: Groups preserve channel correlations better than normalizing all channels together
- vs InstanceNorm: Shares information within groups (InstanceNorm normalizes each channel independently)
- Group hyperparameter: Tuning num_groups affects performance; 32 groups is often good
- Number of groups: Must divide num_channels evenly (num_channels % num_groups == 0)
Algorithm: For input [batch, channels, ...] with num_groups:
- Reshape to [batch, num_groups, channels_per_group, ...]
- For each group independently: compute mean and variance across spatial + channel_per_group dimensions
- Normalize within group: x_norm = (x - μ_group) / √(σ_group² + eps)
- Apply learned affine: y = γ * x_norm + β (per channel, shared across groups)
Unlike BatchNorm which uses batch statistics, GroupNorm uses only sample-level group statistics. Unlike InstanceNorm which normalizes each channel, GroupNorm normalizes groups of channels.
- Group divisibility: num_channels must be divisible by num_groups (enforced at construction)
- Batch size agnostic: Works with any batch size, including batch_size=1
- No train/eval mode: GroupNorm behavior is identical in train() and eval() modes
- No running statistics: Unlike BatchNorm, no moving average to maintain
- Per-channel parameters: γ and β are per-channel (shared across groups)
- Spatial normalization: Normalizes across all spatial dimensions + channels in group
- num_groups=32: Often optimal for ImageNet-scale tasks on ResNets
- Detection standard: Widely used in modern object detection (YOLO, Faster R-CNN)
- Gradient behavior: Good gradient flow through normalization operation
- Practical choice: GroupNorm often outperforms BatchNorm on small batch tasks
- num_channels must be divisible by num_groups (will throw error otherwise)
- Each group must have at least one channel (num_channels = num_groups)
- Input must be at least 2D (batch + channels)
- eps too small can cause NaNs; eps too large reduces normalization effectiveness
Examples
// ResNet block with GroupNorm instead of BatchNorm
const gn = new torch.nn.GroupNorm(32, 128); // 32 groups, 128 channels
const x = torch.randn([4, 128, 56, 56]); // [batch=4, channels=128, height=56, width=56]
const normalized = gn.forward(x); // Same shape
// 128 channels divided into 32 groups (4 channels per group)
// Each group normalized independently within each image// Object detection with small batch size
class ObjectDetectionBackbone extends torch.nn.Module {
conv1: torch.nn.Conv2d;
gn1: torch.nn.GroupNorm;
conv2: torch.nn.Conv2d;
gn2: torch.nn.GroupNorm;
constructor() {
super();
this.conv1 = new torch.nn.Conv2d(3, 64, 7, { stride: 2, padding: 3 });
this.gn1 = new torch.nn.GroupNorm(32, 64); // 32 groups for 64 channels
this.conv2 = new torch.nn.Conv2d(64, 128, 3, { stride: 2, padding: 1 });
this.gn2 = new torch.nn.GroupNorm(32, 128); // 32 groups for 128 channels
}
forward(x: torch.Tensor): torch.Tensor {
let x = torch.relu(this.gn1.forward(this.conv1.forward(x)));
x = torch.relu(this.gn2.forward(this.conv2.forward(x)));
return x;
}
}
// Can use batch_size=1 at inference - GroupNorm works perfectly
const backbone = new ObjectDetectionBackbone();
const image = torch.randn([1, 3, 512, 512]); // Single image
const features = backbone.forward(image); // Works fine! BatchNorm would fail// Video model: small temporal batch
const video_frames = torch.randn([2, 3, 16, 112, 112]); // [batch=2, channels=3, frames=16, H, W]
const gn = new torch.nn.GroupNorm(3, 64);
// For 3D convolution output with 64 channels
const conv_out = torch.randn([2, 64, 8, 56, 56]);
const normalized = gn.forward(conv_out);
// Works great even with small batch_size=2 and temporal dimension// Exploring num_groups parameter
const channels = 64;
const x = torch.randn([8, 64, 32, 32]);
// Different group configurations
const gn1 = new torch.nn.GroupNorm(1, 64); // Single group - equivalent to LayerNorm
const gn32 = new torch.nn.GroupNorm(32, 64); // 32 groups - common starting point
const gn64 = new torch.nn.GroupNorm(64, 64); // 64 groups - equivalent to InstanceNorm
// Typical choice: 32 groups works well for most architectures
// Smaller num_groups: more channel interaction within groups
// Larger num_groups: less channel interaction (extreme: InstanceNorm)