torch.optim.RMSprop

class RMSprop extends Optimizer

new RMSprop(params: Tensor[] | Iterable<Tensor>, options: RMSpropOptions = {})

RMSprop optimizer: Root Mean Square Propagation for adaptive per-parameter learning rates.

RMSprop maintains a running average of squared gradients and divides the current gradient by the root of this average. This provides adaptive learning rates that scale inversely with the magnitude of recent gradients, helping parameters with large gradients take smaller steps and parameters with small gradients take larger steps.

Originally proposed by Geoff Hinton in his Coursera lecture notes, RMSprop was one of the first adaptive methods and predates Adam. While generally superseded by Adam/AdamW for new projects, RMSprop remains useful in specific scenarios:

Recommended for:

Recurrent neural networks (RNNs, LSTMs, GRUs) - often works better than Adam
Reinforcement learning - simpler than Adam, reduces oscillations in value function learning
Legacy codebases - proven track record in existing systems
Memory-constrained environments - simpler than Adam (no second moment of gradient)
Non-stationary objectives - faster adaptation than Adam in some cases
Hyperparameter sensitivity studies - simpler to tune than Adam variants

Key Properties:

Simpler than Adam: maintains only the second moment (squared gradient average)
No momentum of gradients: optional momentum can be added separately
Centered variant: subtracts squared mean gradient, can reduce learning rate oscillations
Efficient: lower memory overhead than Adam (one accumulator vs two)
Stable: proven to work well across many tasks despite being older

\begin{aligned} v_t = \alpha v_{t-1} + (1 - \alpha) ( \\ \mathbf{p} \leftarrow \mathbf{p} - \frac{\alpha_{\text{step}}}{\sqrt{v_t + \epsilon}} \\ \text{(With momentum)}: \mathbf{m}_t = \mu \mathbf{m}_{t-1} + \frac{ \end{aligned}

When to use RMSprop: RNNs/LSTMs, reinforcement learning, memory-constrained environments. For new projects, prefer AdamW.
Learning rate: Higher default than Adam (1e-2 vs 1e-3). Start with 0.01 or 0.001 and adjust based on convergence.
Alpha parameter: Smoothing constant. 0.99 works for most tasks, 0.95 for faster adaptation, 0.999 for very smooth running average.
Momentum interaction: RMSprop + momentum is powerful combination. Typical momentum value 0.9 or 0.95.
Centered variant: Use when gradient magnitudes have large positive/negative values. Helps with stability in some cases.
RNN advantage: Often works better than Adam for RNNs due to simpler gradient flow dynamics.
RL proven: Demonstrated effectiveness in DQN, A3C, and other deep RL algorithms.
Memory efficient: Uses one accumulator per parameter (vs two for Adam), beneficial for large models.
Coupled weight decay: Unlike AdamW, weight decay is coupled (added to gradient). For better generalization use decoupled decay or AdamW.
Hyperparameter stability: Simpler parameter set makes grid search more manageable than Adam variants.
Not recommended for Transformers: Adam/AdamW superior for attention-based models. RMSprop better for RNNs and RL.

Examples

// Basic RMSprop for RNN training
const model = new RNNModel();
const rmsprop = new torch.optim.RMSprop(model.parameters(), { lr: 0.01 });

for (let epoch = 0; epoch < 20; epoch++) {
  for (const batch of train_loader) {
    const outputs = model.forward(batch.x);
    const loss = criterion.forward(outputs, batch.y);

    rmsprop.zero_grad();
    // loss.backward();
    rmsprop.step();
  }
}

// RMSprop with momentum for faster convergence
const rmsprop = new torch.optim.RMSprop(model.parameters(), {
  lr: 0.01,
  momentum: 0.9,  // Combines adaptive learning rate with momentum
  alpha: 0.99     // Smoothing constant
});

// Momentum + adaptive learning rates provides faster convergence than either alone
// Common choice for RNN and reinforcement learning

// Centered RMSprop for more stable convergence
const rmsprop = new torch.optim.RMSprop(model.parameters(), {
  lr: 0.001,
  centered: true,  // Subtract squared mean gradient from second moment
  momentum: 0.9
});

// Centered variant uses variance of gradients instead of second moment
// Can reduce oscillations when average gradient magnitude is significant
// Slightly slower per iteration but often converges faster overall

// RMSprop for value function learning in reinforcement learning
const rmsprop = new torch.optim.RMSprop(q_network.parameters(), {
  lr: 0.00025,     // Lower learning rate for stable value function learning
  momentum: 0.95,  // High momentum to smooth out noisy gradients
  alpha: 0.99,
  eps: 1e-7
});

// Stable value functions are crucial in RL (DQN, A3C, etc.)
// RMSprop often outperforms Adam in this setting

// RMSprop with weight decay for regularization
const rmsprop = new torch.optim.RMSprop(model.parameters(), {
  lr: 0.01,
  alpha: 0.99,
  weight_decay: 1e-5,
  momentum: 0.9
});

// Note: RMSprop uses coupled weight decay (added to gradient)
// For better generalization, consider using AdamW with decoupled decay instead

// LSTM training with RMSprop (recommended over Adam for LSTMs)
const lstm = new torch.nn.LSTM(input_size, hidden_size, num_layers);
const rmsprop = new torch.optim.RMSprop(lstm.parameters(), {
  lr: 0.001,
  momentum: 0.9,
  alpha: 0.95    // Slightly lower alpha for faster adaptation
});

// LSTMs often work better with RMSprop than Adam due to:
// - Simpler adaptive mechanism avoiding complex interactions
// - Momentum helping with long sequence propagation
// - Proven empirical success in sequence modeling

torch.optim.RMSprop

Examples

See Also

torch.optim.RMSprop

Examples

See Also