torch.nn.GroupNorm

class GroupNorm extends Module

new GroupNorm(num_groups: number, num_channels: number, options?: GroupNormOptions)

readonlynum_groups(number)
readonlynum_channels(number)
readonlyeps(number)
readonlyaffine(boolean)
weight(Parameter | null)
bias(Parameter | null)

Group Normalization: normalizes features in groups within each sample independently.

Divides channels into groups and normalizes within each group independently for each sample. Hybrid approach between LayerNorm (normalize all features) and InstanceNorm (normalize per-channel). Essential for:

Object detection (YOLO, Faster R-CNN use GroupNorm)
Semantic segmentation and dense prediction tasks
Small batch sizes where BatchNorm is unreliable
Video models and 3D convolutions (batch norm fails with small temporal batches)
Instance-level normalization while maintaining channel correlation
Tasks where neither LayerNorm nor BatchNorm is ideal

The Problem GroupNorm Solves: BatchNorm requires large batches to compute reliable statistics; LayerNorm treats all channels the same. GroupNorm divides channels into groups and normalizes within each group independently per sample. This works well for small batches (even batch_size=1) while allowing channels within a group to interact normally.

When to use GroupNorm:

Small batch sizes (including batch_size=1 for inference)
Object detection and dense prediction tasks
3D/video models where temporal batch size is small
Tasks where batch composition matters (don't want batch statistics affecting results)
When you want instance-level normalization but channels in a group to correlate
As substitute for BatchNorm in training code that needs to work with batch_size=1
Modern CNNs (ResNet with GroupNorm often outperforms BatchNorm)

Trade-offs:

vs BatchNorm: Works with small batches, no train/eval mode, no running statistics
vs LayerNorm: Groups preserve channel correlations better than normalizing all channels together
vs InstanceNorm: Shares information within groups (InstanceNorm normalizes each channel independently)
Group hyperparameter: Tuning num_groups affects performance; 32 groups is often good
Number of groups: Must divide num_channels evenly (num_channels % num_groups == 0)

Algorithm: For input [batch, channels, ...] with num_groups:

Reshape to [batch, num_groups, channels_per_group, ...]
For each group independently: compute mean and variance across spatial + channel_per_group dimensions
Normalize within group: x_norm = (x - μ_group) / √(σ_group² + eps)
Apply learned affine: y = γ * x_norm + β (per channel, shared across groups)

Unlike BatchNorm which uses batch statistics, GroupNorm uses only sample-level group statistics. Unlike InstanceNorm which normalizes each channel, GroupNorm normalizes groups of channels.

\begin{aligned} \mu_g = \frac{1}{S \cdot C_g} \sum_{\text{spatial}} \sum_{c \in g} x[c], \quad \sigma_g^2 = \frac{1}{S \cdot C_g} \sum_{\text{spatial}} \sum_{c \in g} (x[c] - \mu_g)^2 \text{ for each group } g \\ x_{\text{norm}} = \frac{x - \mu_g}{\sqrt{\sigma_g^2 + \epsilon}}, \quad y = \gamma x_{\text{norm}} + \beta \\ \text{group} = \left\lfloor \frac{c}{C / G} \right\rfloor \text{ where } C \text{ = num\_channels, } G \text{ = num\_groups} \end{aligned}

Group divisibility: num_channels must be divisible by num_groups (enforced at construction)
Batch size agnostic: Works with any batch size, including batch_size=1
No train/eval mode: GroupNorm behavior is identical in train() and eval() modes
No running statistics: Unlike BatchNorm, no moving average to maintain
Per-channel parameters: γ and β are per-channel (shared across groups)
Spatial normalization: Normalizes across all spatial dimensions + channels in group
num_groups=32: Often optimal for ImageNet-scale tasks on ResNets
Detection standard: Widely used in modern object detection (YOLO, Faster R-CNN)
Gradient behavior: Good gradient flow through normalization operation
Practical choice: GroupNorm often outperforms BatchNorm on small batch tasks

num_channels must be divisible by num_groups (will throw error otherwise)
Each group must have at least one channel (num_channels = num_groups)
Input must be at least 2D (batch + channels)
eps too small can cause NaNs; eps too large reduces normalization effectiveness

Examples

// ResNet block with GroupNorm instead of BatchNorm
const gn = new torch.nn.GroupNorm(32, 128);  // 32 groups, 128 channels

const x = torch.randn([4, 128, 56, 56]);  // [batch=4, channels=128, height=56, width=56]
const normalized = gn.forward(x);  // Same shape
// 128 channels divided into 32 groups (4 channels per group)
// Each group normalized independently within each image

// Object detection with small batch size
class ObjectDetectionBackbone extends torch.nn.Module {
  conv1: torch.nn.Conv2d;
  gn1: torch.nn.GroupNorm;
  conv2: torch.nn.Conv2d;
  gn2: torch.nn.GroupNorm;

  constructor() {
    super();
    this.conv1 = new torch.nn.Conv2d(3, 64, 7, { stride: 2, padding: 3 });
    this.gn1 = new torch.nn.GroupNorm(32, 64);  // 32 groups for 64 channels

    this.conv2 = new torch.nn.Conv2d(64, 128, 3, { stride: 2, padding: 1 });
    this.gn2 = new torch.nn.GroupNorm(32, 128);  // 32 groups for 128 channels
  }

  forward(x: torch.Tensor): torch.Tensor {
    let x = torch.relu(this.gn1.forward(this.conv1.forward(x)));
    x = torch.relu(this.gn2.forward(this.conv2.forward(x)));
    return x;
  }
}

// Can use batch_size=1 at inference - GroupNorm works perfectly
const backbone = new ObjectDetectionBackbone();
const image = torch.randn([1, 3, 512, 512]);  // Single image
const features = backbone.forward(image);  // Works fine! BatchNorm would fail

// Video model: small temporal batch
const video_frames = torch.randn([2, 3, 16, 112, 112]);  // [batch=2, channels=3, frames=16, H, W]
const gn = new torch.nn.GroupNorm(3, 64);

// For 3D convolution output with 64 channels
const conv_out = torch.randn([2, 64, 8, 56, 56]);
const normalized = gn.forward(conv_out);
// Works great even with small batch_size=2 and temporal dimension

// Exploring num_groups parameter
const channels = 64;
const x = torch.randn([8, 64, 32, 32]);

// Different group configurations
const gn1 = new torch.nn.GroupNorm(1, 64);    // Single group - equivalent to LayerNorm
const gn32 = new torch.nn.GroupNorm(32, 64);  // 32 groups - common starting point
const gn64 = new torch.nn.GroupNorm(64, 64);  // 64 groups - equivalent to InstanceNorm

// Typical choice: 32 groups works well for most architectures
// Smaller num_groups: more channel interaction within groups
// Larger num_groups: less channel interaction (extreme: InstanceNorm)

torch.nn.GroupNorm

Examples

See Also

torch.nn.GroupNorm

Examples

See Also