torch.optim.lr_scheduler.CosineAnnealingLR
class CosineAnnealingLR extends LRSchedulernew CosineAnnealingLR(optimizer: Optimizer, options: {
/** Maximum number of iterations */
T_max: number;
/** Minimum learning rate (default: 0) */
eta_min?: number;
/** The index of last epoch (default: -1) */
last_epoch?: number;
/** Whether to print a message for each update (default: false) */
verbose?: boolean;
})
Constructor Parameters
optimizerOptimizer- Wrapped optimizer
options{ /** Maximum number of iterations */ T_max: number; /** Minimum learning rate (default: 0) */ eta_min?: number; /** The index of last epoch (default: -1) */ last_epoch?: number; /** Whether to print a message for each update (default: false) */ verbose?: boolean; }- Scheduler options
T_max(number)- – Maximum number of iterations
eta_min(number)- – Minimum learning rate
CosineAnnealingLR scheduler: Anneals learning rate using cosine curve to zero.
CosineAnnealingLR sets the learning rate of each parameter group using a cosine annealing schedule. The learning rate decreases smoothly from initial value to a minimum value following a cosine curve over T_max epochs, then remains constant. This smooth decay is empirically better than step-wise decay for many tasks.
Key advantages over StepLR:
- Smooth continuous decay instead of abrupt drops
- Often achieves better final accuracy (flattens loss landscape)
- Reduces risk of missing good local minima at decay boundaries
- Based on theoretical motivations from local loss landscape geometry
When to use CosineAnnealingLR:
- Modern deep learning (recommended as default for many tasks)
- Transformer models (especially with warmup)
- When you want smooth learning rate decay
- When final model quality is more important than convergence speed
- Good baseline: use CosineAnnealingLR unless you have strong reasons otherwise
Trade-offs:
- Learning rate decays continuously, can be slower initially
- Requires specifying T_max (total training epochs)
- Doesn't adapt to actual training progress (metric-agnostic)
- Compare to ReduceOnPlateau for metric-based adaptation
- Compare to CosineAnnealingWarmRestarts for periodic restarts
Algorithm: Learning rate decays from η_base to η_min following a cosine curve:
- η_t = η_min + (η_base - η_min) / 2 * (1 + cos(π * t / T_max))
- t is current epoch (0 to T_max)
- cos goes from 1 (at t=0) to -1 (at t=T_max)
- Learning rate goes from η_base (at t=0) to η_min (at t=T_max)
- Smooth decay: Unlike StepLR's abrupt drops, cosine decay is continuous and smooth.
- Good default: CosineAnnealingLR is recommended as default schedule before trying alternatives.
- T_max critical: Set T_max to your planned total training epochs. Too small → lr becomes constant early.
- eta_min important: Default eta_min=0 means learning stops at T_max. Consider eta_min 0 for fine-tuning.
- Empirically strong: Cosine schedules show better generalization than step decay in modern architectures.
- Paper: SGDR: Stochastic Gradient Descent with Warm Restarts (Loshchilov & Hutter, 2016) motivates cosine annealing.
- Warm restarts: CosineAnnealingWarmRestarts offers periodic restarts - useful for escaping local minima.
- Warmup common: Typically combined with warmup (LinearLR) for transformer training: warmup → cosine decay.
- Not adaptive: Doesn't adapt to training progress. ReduceOnPlateau does adapt but is metric-aware.
- Multiple parameters: Works with parameter groups, decays each group's lr by same cosine schedule.
Examples
// Standard CosineAnnealingLR for 100 epochs
const scheduler = new torch.optim.CosineAnnealingLR(optimizer, { T_max: 100 });
for (let epoch = 0; epoch < 100; epoch++) {
train();
validate();
scheduler.step();
}// CosineAnnealingLR with minimum learning rate
const scheduler = new torch.optim.CosineAnnealingLR(optimizer, {
T_max: 100,
eta_min: 0.0001 // Don't decay below 1e-4
});
// Without eta_min, lr decays to exactly 0 at T_max
// eta_min prevents optimization from completely stopping// CosineAnnealingLR with warmup (chain with LinearLR)
const warmup = new torch.optim.LinearLR(optimizer, { total_iters: 10 });
const cosine = new torch.optim.CosineAnnealingLR(optimizer, { T_max: 90 });
const scheduler = new torch.optim.SequentialLR(optimizer, [warmup, cosine], [10]);
// Common in transformer training: warm up for 10 epochs, then cosine decay// Resume training with CosineAnnealingLR
const checkpoint = load_checkpoint('model.pth');
const scheduler = new torch.optim.CosineAnnealingLR(optimizer, {
T_max: 100,
eta_min: 0.0001,
last_epoch: checkpoint.epoch - 1 // Resume at correct position
});// Comparison: different T_max values affect schedule shape
const scheduler_short = new torch.optim.CosineAnnealingLR(optimizer, { T_max: 50 }); // Fast decay
const scheduler_long = new torch.optim.CosineAnnealingLR(optimizer, { T_max: 200 }); // Slow decay
// Longer T_max means more gradual decay, giving optimizer longer to converge
// Shorter T_max decays faster to minimum, good if eta_min is not too small