torch.nn.SyncBatchNorm

class SyncBatchNorm extends _BatchNorm

Synchronized Batch Normalization: batch norm with synchronization across devices/GPUs.

Extends BatchNorm to synchronize statistics across multiple GPUs/devices during distributed training. In single-device mode, behaves identically to regular BatchNorm. Essential for:

Multi-GPU training to compute batch statistics across all devices
Ensuring consistent normalization in distributed settings
Preventing batch size inconsistency issues across devices
Training with effective larger batch sizes in distributed scenarios
Fine-grained control over batch norm synchronization in multi-device setups

The Problem SyncBatchNorm Solves: In distributed training, each device computes BatchNorm statistics independently, leading to different normalizations on different GPUs. SyncBatchNorm synchronizes these statistics across all devices, effectively treating the batch as if all samples were on a single device. This provides better training stability and more consistent results.

When to use SyncBatchNorm:

Multi-GPU training (data parallel, distributed data parallel)
When you want global batch statistics across all devices
Large-scale training where consistency across devices matters
Fine-tuning pretrained models trained with SyncBatchNorm
Research/production code that needs to work identically on 1 or N GPUs

Trade-offs:

vs regular BatchNorm: Adds communication overhead for synchronization
Single device: Equivalent to regular BatchNorm (no overhead, no syncing needed)
Multi-device: Slower but provides consistent statistics across all devices
Effective batch size: Using sync=true makes effective batch size = device_count * local_batch_size
Memory: Requires extra buffers for synchronization (small overhead)

Algorithm: Same as BatchNorm, but with synchronization:

Compute mean and variance locally on each device
Synchronize: All-reduce mean and variance across devices
Normalize and apply affine transform using synchronized statistics
Update running mean/variance using synchronized statistics

In single-device mode (num_devices=1), synchronization is skipped. Communication pattern: All-reduce for mean, All-reduce for variance, Broadcast parameters.

\begin{aligned} \mu_{\text{batch}} = \frac{1}{BHW} \sum_{b,h,w} x[b,c,h,w], \quad \sigma_{\text{batch}}^2 = \frac{1}{BHW} \sum_{b,h,w} (x[b,c,h,w] - \mu_{\text{batch}})^2 \text{ (per device)} \\ \mu_{\text{sync}} = \frac{\text{AllReduce}(\mu_{\text{batch}})}{D}, \quad \sigma_{\text{sync}}^2 = \frac{\text{AllReduce}(\sigma_{\text{batch}}^2)}{D} \text{ where } D \text{ = num\_devices} \\ x_{\text{norm}} = \frac{x - \mu_{\text{sync}}}{\sqrt{\sigma_{\text{sync}}^2 + \epsilon}}, \quad y = \gamma x_{\text{norm}} + \beta \\ \mu_{\text{run}} \leftarrow (1-m)\mu_{\text{run}} + m \mu_{\text{sync}}, \quad \sigma^2_{\text{run}} \leftarrow (1-m)\sigma^2_{\text{run}} + m \sigma_{\text{sync}}^2 \end{aligned}

Single device mode: Behavior identical to BatchNorm, no communication overhead
Multi-device mode: Synchronizes statistics across all devices for consistency
Effective batch size: Much larger when using multiple devices (device_count * per_device_batch_size)
Training vs eval: Use train() mode during training (uses batch stats), eval() mode for inference (uses running stats)
Running statistics: Tracks exponential moving average of synchronized statistics
Distributed training: Essential for consistent model behavior across devices
Communication overhead: Small but non-zero; worth it for training stability
Momentum parameter: Higher momentum → running stats converge slower (use 0.1 for typical datasets)
Fine-tuning: Models trained with SyncBatchNorm may need to be fine-tuned with it

Synchronization only works with multiple devices (ignored on single device)
Running statistics accumulate across batches; reset at epoch boundaries if needed
eps too small can cause NaNs; eps too large reduces normalization effectiveness
Ensure all devices have same num_features (checked at construction)

Examples

// Convert regular BatchNorm to SyncBatchNorm for distributed training
const model = new MyModel();  // Contains BatchNorm1d/2d/3d layers

// For single GPU (development): use regular BatchNorm
// For multi-GPU (production): convert to SyncBatchNorm
if (num_gpus > 1) {
  // Convert all BatchNorm layers to SyncBatchNorm (PyTorch API)
  // In torch.js, create SyncBatchNorm directly:
  const sync_bn = new torch.nn.SyncBatchNorm(64, { momentum: 0.1 });
}

// Multi-GPU ResNet training with SyncBatchNorm
class ResNetWithSync extends torch.nn.Module {
  conv1: torch.nn.Conv2d;
  bn1: torch.nn.SyncBatchNorm;
  conv2: torch.nn.Conv2d;
  bn2: torch.nn.SyncBatchNorm;

  constructor() {
    super();
    this.conv1 = new torch.nn.Conv2d(3, 64, 7, { stride: 2, padding: 3 });
    this.bn1 = new torch.nn.SyncBatchNorm(64);  // Synced across devices

    this.conv2 = new torch.nn.Conv2d(64, 128, 3, { stride: 1, padding: 1 });
    this.bn2 = new torch.nn.SyncBatchNorm(128);  // Synced across devices
  }

  forward(x: torch.Tensor): torch.Tensor {
    let x = torch.relu(this.bn1.forward(this.conv1.forward(x)));
    x = torch.relu(this.bn2.forward(this.conv2.forward(x)));
    return x;
  }
}

// Training with DataParallel (2 GPUs)
const model = new ResNetWithSync();
// With 2 GPUs and batch_size=64 per GPU:
// Effective batch_size = 128 for normalization statistics

// Single GPU usage: no difference from regular BatchNorm
const sync_bn = new torch.nn.SyncBatchNorm(256);
const x = torch.randn([32, 256, 28, 28]);  // Single GPU

// Synchronization is skipped (no other devices to sync with)
const output = sync_bn.forward(x);  // Same as regular BatchNorm

torch.nn.SyncBatchNorm

Examples

See Also

torch.nn.SyncBatchNorm

Examples

See Also