torch.nn.SoftmaxOptions
Softmax activation function.
Softmax is the standard output activation for multi-class classification. It converts raw logits (unconstrained outputs) into a probability distribution over K classes, where probabilities sum to 1. Each output is in (0, 1), interpretable as the predicted probability of that class. Softmax with cross-entropy loss is the de-facto standard for classification tasks in deep learning.
Core idea: Softmax(x_i) = exp(x_i) / Σ_j exp(x_j). The exponential amplifies relative differences between logits, making the largest logit have the highest probability. The division by the sum ensures outputs are normalized probabilities.
When to use Softmax:
- Multi-class classification: Output layer for all classification tasks (image, NLP, etc.)
- Standard practice: Always use softmax output + cross-entropy loss for classification
- Not for hidden layers: Use ReLU/GELU for hidden layers, softmax only for output
- Probability outputs: When you need normalized probabilities (not just class scores)
- Categorical predictions: Any task with discrete, mutually-exclusive classes
Key properties:
- Probabilistic output: Output sums to 1, all values in (0, 1), interpretable as probabilities
- Differentiable: Smooth everywhere, enables gradient-based learning
- Relative magnitudes matter: Only relative differences between logits matter, not absolute values
- Temperature scaling: Softmax can be "heated" or "cooled" to control sharpness via division before exp
- Dimension-specific: Applied along the class dimension (usually dim=-1), preserving batch structure
Algorithm: Forward: σ(x)_i = exp(x_i - max(x)) / Σ_j exp(x_j - max(x)) The max subtraction (log-sum-exp trick) prevents overflow in exp() for numerical stability. Backward: Jacobian is J_ij = σ(x)_i * (δ_ij - σ(x)_j) (more complex than simple element-wise ops)
Definition
export interface SoftmaxOptions {
/** Dimension along which softmax is computed (default: -1) */
dim?: number;
}dim(number)optional- – Dimension along which softmax is computed (default: -1)
Examples
// Multi-class classification: ResNet + Softmax
class ResNetClassifier extends torch.nn.Module {
private backbone: torch.nn.Module; // Pretrained ResNet
private fc: torch.nn.Linear;
private softmax: torch.nn.Softmax;
constructor() {
super();
// ... initialize backbone ...
this.fc = new torch.nn.Linear(2048, 1000); // 1000 classes (ImageNet)
this.softmax = new torch.nn.Softmax(-1); // Apply along class dimension
}
forward(x: torch.Tensor): torch.Tensor {
x = this.backbone.forward(x);
x = x.flatten(1);
x = this.fc.forward(x);
return this.softmax.forward(x); // Probabilities: shape [batch, 1000]
}
}
// Then use with CrossEntropyLoss// NLP: Language model or text classifier
const batch_size = 32, seq_len = 128, vocab_size = 50000;
const logits = torch.randn([batch_size, seq_len, vocab_size]);
const softmax = new torch.nn.Softmax(-1); // Softmax over vocabulary
const probs = softmax.forward(logits); // Shape: [32, 128, 50000]
// Each position has probability distribution over vocabulary
// probs[i, j, :].sum() == 1 for all i, j// Temperature scaling for calibration/confidence control
const logits = torch.randn([10, 5]);
const temperature = 2.0; // Higher = softer (lower confidence), lower = sharper (higher confidence)
const softmax = new torch.nn.Softmax(-1);
const probs_hot = softmax.forward(logits.div(0.5)); // Sharp probabilities
const probs_cold = softmax.forward(logits.div(temperature)); // Soft probabilities
// Lower temperature → max prob closer to 1, others closer to 0 (high confidence)
// Higher temperature → more uniform distribution (low confidence)