torch.optim.Adagrad
new Adagrad(params: Tensor[] | Iterable<Tensor>, options: AdagradOptions = {})
Adagrad optimizer: Adaptive Subgradient Methods with per-parameter learning rates.
Adagrad was one of the first adaptive gradient methods, designed for problems with sparse data where different parameters receive gradient updates at different frequencies. It adapts the learning rate for each parameter based on the sum of squared gradients it has received during training.
Key Innovation: Parameters that have historically had large gradients have lower effective learning rates, while parameters with small historical gradients maintain higher learning rates. This is particularly effective for sparse data where some features occur infrequently.
Limitations (why RMSprop/Adam are preferred now):
- Monotonically decreasing learning rates: accumulates squared gradients indefinitely
- Eventually learning rate becomes too small to make progress in non-convex problems
- Better suited for convex optimization than deep learning
- Learning rate decay cannot be reversed
Still useful for:
- Problems with very sparse data (e.g., NLP with rare words, recommendation systems)
- Online learning where you see each example once
- Convex optimization tasks
- Serving as baseline for sparse feature handling
- Legacy systems with proven Adagrad configurations
Historical Note: Adagrad (2011) was groundbreaking for introducing per-parameter adaptive learning rates. RMSprop (2012) and Adam (2014) improved on the concept by using exponential moving averages instead of indefinite accumulation, enabling better convergence on non-convex problems.
- Not recommended for general deep learning: Use RMSprop, Adam, or AdamW instead. Monotonic decay of learning rate problematic for non-convex optimization.
- Sparse data advantage: Excellent for sparse gradients where parameters have different update frequencies.
- Learning rate accumulation: Learning rate decreases over time as squared gradients accumulate. Can't be reversed unlike RMSprop.
- Better alternatives exist: RMSprop improved Adagrad for non-convex problems, Adam improved further for deep learning.
- Epsilon smaller than Adam: Default eps=1e-10 vs Adam's 1e-8 because accumulation doesn't require large epsilon.
- Online learning strength: Original intended use case. Still works well for streaming/one-pass learning.
- Sparse feature handling: Natural fit for NLP embeddings and recommendation systems with frequency-based adaptation.
- Learning rate decay: Additional lr_decay parameter provides manual control beyond automatic accumulation.
- Memory equivalent to RMSprop: Stores one accumulator per parameter (not two like Adam), memory efficient.
- Historical importance: First adaptive per-parameter learning rate method (2011). Influenced all subsequent adaptive optimizers.
Examples
// Basic Adagrad for online learning with sparse data
const model = new SparseFeatureModel();
const adagrad = new torch.optim.Adagrad(model.parameters(), { lr: 0.01 });
for (const example of sparse_data_stream) {
const pred = model.forward(example.x);
const loss = criterion.forward(pred, example.y);
adagrad.zero_grad();
// loss.backward();
adagrad.step();
}// Adagrad for sparse embeddings (e.g., word embeddings in NLP)
const embedding = new torch.nn.Embedding(vocab_size, embedding_dim);
const adagrad = new torch.optim.Adagrad(embedding.parameters(), {
lr: 0.01,
eps: 1e-8 // Prevent division by zero for rare words
});
// Words that appear frequently get smaller gradient updates
// Words that appear rarely maintain larger updates (good for rare words!)
// This adaptive behavior is natural for language data// Adagrad with learning rate decay for non-stationary problems
const adagrad = new torch.optim.Adagrad(model.parameters(), {
lr: 0.01,
lr_decay: 0.0001 // Decay learning rate over time
});
// Learning rate decay: lr_t = lr / (1 + decay * step)
// Helps prevent premature convergence in non-convex settings
// But note: Adagrad already has decreasing rates from accumulation// Adagrad for recommendation systems with sparse user/item data
const model = new CollaborativeFilteringModel();
const adagrad = new torch.optim.Adagrad(model.parameters(), {
lr: 0.01,
eps: 1e-6,
weight_decay: 1e-5
});
// User/item parameters that appear in many interactions get small updates
// User/item parameters that appear rarely maintain larger updates
// Weight decay helps with generalization (prevent overfitting to popular items)// Adagrad with initial accumulator value (pre-warming)
const adagrad = new torch.optim.Adagrad(model.parameters(), {
lr: 0.01,
initial_accumulator_value: 0.1 // Start with non-zero accumulator
});
// Initial accumulator value > 0 reduces learning rate impact of first updates
// Useful when you want more conservative early updates
// Common in online learning from stream data// Adagrad baseline comparison with modern optimizers
const adagrad = new torch.optim.Adagrad(model.parameters(), {
lr: 0.01,
weight_decay: 1e-5
});
// For sparse data, Adagrad often competitive with modern methods
// For dense neural networks, Adam/AdamW significantly outperform
// Adagrad useful mainly for:
// - Very sparse features (NLP word embeddings)
// - Online learning scenarios
// - Convex problems