torch.nn.AdaptiveLogSoftmaxWithLoss

class AdaptiveLogSoftmaxWithLoss extends Module

new AdaptiveLogSoftmaxWithLoss(in_features: number, n_classes: number, cutoffs: number[], options?: AdaptiveLogSoftmaxOptions)

readonlyin_features(number)
readonlyn_classes(number)
readonlycutoffs(number[])
readonlydiv_value(number)
readonlyhead_bias(boolean)

AdaptiveLogSoftmaxWithLoss activation and loss function.

AdaptiveLogSoftmaxWithLoss is an efficient approximation of softmax designed specifically for problems with very large output vocabularies (e.g., language modeling with 100K+ words). Standard softmax requires computing probabilities over all classes, which becomes prohibitively expensive for huge vocabularies. Adaptive softmax clusters classes into groups with shared parameters, dramatically reducing computation while maintaining good accuracy. The key insight is that some classes (common words) can share a coarse classifier while rare classes get fine-grained classifiers.

Core idea: Rather than a single softmax over all classes, divide classes into clusters (e.g., most common 1000 classes, next 10000 classes, rest). Each cluster gets progressively larger but uses fewer features. During training with a target class, only the relevant cluster and target's classifier are active, reducing computation from O(vocab_size) to O(cluster_size).

When to use AdaptiveLogSoftmaxWithLoss:

Large vocabularies: 100K+ output classes (NLP, language modeling, machine translation)
Word prediction: NextToken prediction in language models, sequence-to-sequence models
Computational constraints: Need to reduce softmax computational cost significantly
Memory constraints: Reduce parameter count for large embedding models
NOT for small vocabs: Use standard CrossEntropyLoss for <10K classes (overhead not worth it)

Clustering strategy:

Organize classes by frequency: most common classes in head, rare in tail
Example cutoffs=[1000, 11000] creates three clusters: [0-999], [1000-10999], [11000-...]
Each tail cluster i has reduced dimension: in_features / (div_value ^ i)
Common classes get rich representations; rare classes share coarser features

Algorithm: Forward (inference):

Compute logits over all classes by combining head and tail cluster logits
Returns log probabilities for all classes

Forward with target (training):

Given input and target class, compute loss more efficiently
Only compute logits for relevant cluster (head or one tail)
Much faster: O(cluster_size) vs O(vocab_size)

\begin{aligned} Head projection: output = Linear(in_features, cutoff[0] + num_tails) \\ Tail cluster i: output = Linear(dim_i, size_i) where dim_i = in_features / (div_value ^ i) \end{aligned}

Training vs Inference: Training uses efficient cluster-based loss; inference computes all probabilities.
Frequency-based clustering: Works best when classes ordered by frequency (common first).
Hyperparameter tuning: Cutoffs and div_value significantly affect speed/accuracy trade-off.
GPU-friendly: Still much faster than standard softmax even on GPU for large vocabs.
Paper: Efficient softmax approximation for GPUs (Chen et al., ICML 2016).
Common in NLP: Standard for large-vocabulary language models and MT systems.

Examples

// Language model with large vocabulary
const vocab_size = 100000;  // Large vocabulary (e.g., WordPiece)
const hidden_dim = 768;     // Model hidden dimension

// Adaptive softmax: 1K common words, next 10K, then rest
const softmax = new torch.nn.AdaptiveLogSoftmaxWithLoss(
  hidden_dim,       // in_features
  vocab_size,       // n_classes
  [1000, 11000],    // cutoffs: cluster boundaries
  4.0               // div_value: dimension reduction factor per tail
);

const hidden = torch.randn([batch_size, hidden_dim]);
const targets = torch.randint(0, vocab_size, [batch_size]);

// Training: efficient loss computation
const { output, loss } = softmax.forward_with_target(hidden, targets);
// Only compute softmax for relevant cluster!

// Inference: get probabilities for all classes
const log_probs = softmax.log_prob(hidden);  // [batch_size, vocab_size]

// Cluster organization: how cutoffs divide vocabulary
// cutoffs = [5000, 50000], vocab_size = 100000, div_value = 4

// Head cluster (common classes): 5000 + 2 = 5002 outputs
// - Classes 0-4999: full in_features
// - Plus 2 routing outputs for 2 tail clusters

// Tail cluster 0: classes 5000-49999 (45K classes)
// - Dimension reduced: in_features / 4
// - Single linear layer projects to 45K

// Tail cluster 1: classes 50000-99999 (50K classes)
// - Dimension reduced: in_features / 16 (div_value^2)
// - Single linear layer projects to 50K

// Comparison: Standard softmax vs Adaptive softmax complexity
const hidden_dim = 768, vocab_size = 100000;

// Standard softmax: O(hidden_dim * vocab_size) = O(76.8M) per forward
const standard_softmax = new torch.nn.Linear(hidden_dim, vocab_size);
// During training: must compute loss over all 100K classes

// Adaptive softmax with cutoffs=[5000, 50000]: much smaller
const adaptive = new torch.nn.AdaptiveLogSoftmaxWithLoss(
  hidden_dim, vocab_size, [5000, 50000]
);
// During training: only compute softmax for one cluster (5K-50K classes)
// Speedup: 10-100x faster depending on target class frequency

torch.nn.AdaptiveLogSoftmaxWithLoss

Examples

See Also

torch.nn.AdaptiveLogSoftmaxWithLoss

Examples

See Also