torch.optim.Adadelta

class Adadelta extends Optimizer

new Adadelta(params: Tensor[] | Iterable<Tensor>, options: AdadeltaOptions = {})

Adadelta optimizer: A fully adaptive learning rate method without requiring learning rate tuning.

Adadelta improves upon Adagrad by addressing the monotonically decreasing learning rate problem. Instead of accumulating all squared gradients indefinitely (like Adagrad), Adadelta uses an exponential moving window of squared gradients (similar to RMSprop). Additionally, it maintains a running average of squared parameter updates.

Key Innovation: Unlike Adam which requires learning rate tuning, Adadelta is designed to be "units invariant" - the algorithm is automatically invariant to the scale of the parameters. This means the default learning rate of 1.0 typically works across different problems without tuning. The algorithm adapts to both the gradient history AND the parameter update history.

Unique Properties:

No learning rate decay: Uses exponential moving average, not indefinite accumulation
Adaptive update magnitude: Scales updates using history of previous deltas
Units invariant: Works without tuning learning rate (default 1.0)
Heuristic second-order approximation: Indirectly approximates natural gradient
Lower memory than Adam: Stores two accumulators instead of three

Limitations:

Slower convergence than Adam for most deep learning tasks
More complex update rule (harder to reason about)
Generally outperformed by Adam/AdamW in practice
Less commonly used in modern deep learning

Still useful for:

Problems where you don't want to tune learning rate
Tasks where natural gradient approximation helps (some convex problems)
Environments with strict reproducibility requirements (no lr tuning)
Comparison baselines (units-invariant baseline)

\begin{aligned} E[g^2]_t = \rho E[g^2]_{t-1} + (1 - \rho) ( \\ \Delta_t = \frac{\sqrt{E[\Delta^2]_{t-1} + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} \\ E[\Delta^2]_t = \rho E[\Delta^2]_{t-1} + (1 - \rho) (\Delta_t)^2 \\ \mathbf{p} \leftarrow \mathbf{p} - \alpha \Delta_t \end{aligned}

No learning rate tuning needed: Default lr=1.0 works across problems. This is unique among optimizers.
Units invariant design: Algorithm automatically handles different parameter scales without modification.
Slower than Adam: Convergence usually slower than Adam/AdamW in practice, not recommended for new projects.
Complex to understand: Update rule involving both gradient and parameter update history is non-intuitive.
Rho parameter: Controls length of moving average. 0.9 standard, 0.95-0.99 for longer memory.
Natural gradient approximation: Heuristic second-order approximation, but not true natural gradient.
Memory efficient: Stores two accumulators per parameter (less than Adam's three).
Fair baseline: Good for research comparing optimizers due to units-invariance and no lr tuning.
Rarely outperforms: Adam/AdamW consistently better. Adadelta mainly useful for special cases.
Historical importance: Bridge between Adagrad and modern optimizers (2012), influenced subsequent work.
Reproducibility advantage: No lr tuning required = easier reproducibility across problems.

Examples

// Basic Adadelta without learning rate tuning
const model = new NeuralNetwork();
const adadelta = new torch.optim.Adadelta(model.parameters());

// Default learning rate of 1.0 works without tuning
for (let epoch = 0; epoch < 10; epoch++) {
  for (const batch of train_loader) {
    const loss = model.loss(batch.x, batch.y);

    adadelta.zero_grad();
    // loss.backward();
    adadelta.step();
  }
}

// Adadelta with custom rho for different decay rates
const adadelta = new torch.optim.Adadelta(model.parameters(), {
  rho: 0.95  // Shorter moving average window
});

// Higher rho (0.99): longer memory, smoother updates
// Lower rho (0.9): shorter memory, faster adaptation to recent gradients
// Adadelta still works without tuning lr

// Adadelta with weight decay for regularization
const adadelta = new torch.optim.Adadelta(model.parameters(), {
  rho: 0.9,
  weight_decay: 1e-5,
  eps: 1e-6
});

// Adadelta + weight decay provides implicit regularization
// Still benefits from algorithm's natural scale invariance

// Adadelta as baseline for comparison (units-invariant)
const adadelta = new torch.optim.Adadelta(model.parameters());

// Use as reference for comparing other optimizers
// Since it doesn't require learning rate tuning, provides fair baseline
// If another optimizer beats Adadelta, it's a real improvement

// Adadelta with adjusted learning rate for convergence speed
const adadelta = new torch.optim.Adadelta(model.parameters(), {
  lr: 0.5,    // Reduce for slower, more stable convergence
  rho: 0.95,  // Longer moving average window
  eps: 1e-7   // Smaller epsilon for gradient-heavy problems
});

// Even though lr is adjustable, Adadelta is still more stable than others
// Default 1.0 is rarely beaten by tuning

torch.optim.Adadelta

Examples

See Also

torch.optim.Adadelta

Examples

See Also