torch.optim.lr_scheduler.CosineAnnealingLR

class CosineAnnealingLR extends LRScheduler

new CosineAnnealingLR(optimizer: Optimizer, options: {
/** Maximum number of iterations */
T_max: number;
/** Minimum learning rate (default: 0) */
eta_min?: number;
/** The index of last epoch (default: -1) */
last_epoch?: number;
/** Whether to print a message for each update (default: false) */
verbose?: boolean;
})

Constructor Parameters

optimizerOptimizer: Wrapped optimizer
options{ /** Maximum number of iterations */ T_max: number; /** Minimum learning rate (default: 0) */ eta_min?: number; /** The index of last epoch (default: -1) */ last_epoch?: number; /** Whether to print a message for each update (default: false) */ verbose?: boolean; }: Scheduler options

T_max(number): – Maximum number of iterations
eta_min(number): – Minimum learning rate

CosineAnnealingLR scheduler: Anneals learning rate using cosine curve to zero.

CosineAnnealingLR sets the learning rate of each parameter group using a cosine annealing schedule. The learning rate decreases smoothly from initial value to a minimum value following a cosine curve over T_max epochs, then remains constant. This smooth decay is empirically better than step-wise decay for many tasks.

Key advantages over StepLR:

Smooth continuous decay instead of abrupt drops
Often achieves better final accuracy (flattens loss landscape)
Reduces risk of missing good local minima at decay boundaries
Based on theoretical motivations from local loss landscape geometry

When to use CosineAnnealingLR:

Modern deep learning (recommended as default for many tasks)
Transformer models (especially with warmup)
When you want smooth learning rate decay
When final model quality is more important than convergence speed
Good baseline: use CosineAnnealingLR unless you have strong reasons otherwise

Trade-offs:

Learning rate decays continuously, can be slower initially
Requires specifying T_max (total training epochs)
Doesn't adapt to actual training progress (metric-agnostic)
Compare to ReduceOnPlateau for metric-based adaptation
Compare to CosineAnnealingWarmRestarts for periodic restarts

Algorithm: Learning rate decays from η_base to η_min following a cosine curve:

η_t = η_min + (η_base - η_min) / 2 * (1 + cos(π * t / T_max))
t is current epoch (0 to T_max)
cos goes from 1 (at t=0) to -1 (at t=T_max)
Learning rate goes from η_base (at t=0) to η_min (at t=T_max)

\begin{aligned} \eta_t = \eta_{\min} + \frac{\eta_{\text{base}} - \eta_{\min}}{2} \left(1 + \cos\left(\frac{\pi t}{T_{\max}}\right)\right) \\ \text{where } t \in [0, T_{\max}], \eta_t \text{ is the learning rate at epoch } t \end{aligned}

Smooth decay: Unlike StepLR's abrupt drops, cosine decay is continuous and smooth.
Good default: CosineAnnealingLR is recommended as default schedule before trying alternatives.
T_max critical: Set T_max to your planned total training epochs. Too small → lr becomes constant early.
eta_min important: Default eta_min=0 means learning stops at T_max. Consider eta_min 0 for fine-tuning.
Empirically strong: Cosine schedules show better generalization than step decay in modern architectures.
Paper: SGDR: Stochastic Gradient Descent with Warm Restarts (Loshchilov & Hutter, 2016) motivates cosine annealing.
Warm restarts: CosineAnnealingWarmRestarts offers periodic restarts - useful for escaping local minima.
Warmup common: Typically combined with warmup (LinearLR) for transformer training: warmup → cosine decay.
Not adaptive: Doesn't adapt to training progress. ReduceOnPlateau does adapt but is metric-aware.
Multiple parameters: Works with parameter groups, decays each group's lr by same cosine schedule.

Examples

// Standard CosineAnnealingLR for 100 epochs
const scheduler = new torch.optim.CosineAnnealingLR(optimizer, { T_max: 100 });
for (let epoch = 0; epoch < 100; epoch++) {
  train();
  validate();
  scheduler.step();
}

// CosineAnnealingLR with minimum learning rate
const scheduler = new torch.optim.CosineAnnealingLR(optimizer, {
  T_max: 100,
  eta_min: 0.0001  // Don't decay below 1e-4
});

// Without eta_min, lr decays to exactly 0 at T_max
// eta_min prevents optimization from completely stopping

// CosineAnnealingLR with warmup (chain with LinearLR)
const warmup = new torch.optim.LinearLR(optimizer, { total_iters: 10 });
const cosine = new torch.optim.CosineAnnealingLR(optimizer, { T_max: 90 });
const scheduler = new torch.optim.SequentialLR(optimizer, [warmup, cosine], [10]);

// Common in transformer training: warm up for 10 epochs, then cosine decay

// Resume training with CosineAnnealingLR
const checkpoint = load_checkpoint('model.pth');
const scheduler = new torch.optim.CosineAnnealingLR(optimizer, {
  T_max: 100,
  eta_min: 0.0001,
  last_epoch: checkpoint.epoch - 1  // Resume at correct position
});

// Comparison: different T_max values affect schedule shape
const scheduler_short = new torch.optim.CosineAnnealingLR(optimizer, { T_max: 50 });  // Fast decay
const scheduler_long = new torch.optim.CosineAnnealingLR(optimizer, { T_max: 200 }); // Slow decay

// Longer T_max means more gradual decay, giving optimizer longer to converge
// Shorter T_max decays faster to minimum, good if eta_min is not too small

torch.optim.lr_scheduler.CosineAnnealingLR

Constructor Parameters

Examples

See Also

torch.optim.lr_scheduler.CosineAnnealingLR

Constructor Parameters

Examples

See Also