torch.nn.Softmax

class Softmax extends Module

new Softmax(options?: SoftmaxOptions)

readonlydim(number)

Softmax activation function.

Softmax is the standard output activation for multi-class classification. It converts raw logits (unconstrained outputs) into a probability distribution over K classes, where probabilities sum to 1. Each output is in (0, 1), interpretable as the predicted probability of that class. Softmax with cross-entropy loss is the de-facto standard for classification tasks in deep learning.

Core idea: Softmax(x_i) = exp(x_i) / Σ_j exp(x_j). The exponential amplifies relative differences between logits, making the largest logit have the highest probability. The division by the sum ensures outputs are normalized probabilities.

When to use Softmax:

Multi-class classification: Output layer for all classification tasks (image, NLP, etc.)
Standard practice: Always use softmax output + cross-entropy loss for classification
Not for hidden layers: Use ReLU/GELU for hidden layers, softmax only for output
Probability outputs: When you need normalized probabilities (not just class scores)
Categorical predictions: Any task with discrete, mutually-exclusive classes

Key properties:

Probabilistic output: Output sums to 1, all values in (0, 1), interpretable as probabilities
Differentiable: Smooth everywhere, enables gradient-based learning
Relative magnitudes matter: Only relative differences between logits matter, not absolute values
Temperature scaling: Softmax can be "heated" or "cooled" to control sharpness via division before exp
Dimension-specific: Applied along the class dimension (usually dim=-1), preserving batch structure

Algorithm: Forward: σ(x)_i = exp(x_i - max(x)) / Σ_j exp(x_j - max(x)) The max subtraction (log-sum-exp trick) prevents overflow in exp() for numerical stability. Backward: Jacobian is J_ij = σ(x)_i * (δ_ij - σ(x)_j) (more complex than simple element-wise ops)

\begin{aligned} Softmax(x)_i = exp(x_i) / Σ_j exp(x_j), where outputs sum to 1 \\ log-sum-exp trick: Softmax(x)_i = exp(x_i - max(x)) / Σ_j exp(x_j - max(x)) (numerically stable) \\ Gradient: ∂Softmax_i/∂x_j = Softmax_i * (δ_ij - Softmax_j) \end{aligned}

Always use with CrossEntropyLoss: PyTorch's CrossEntropyLoss expects raw logits, applies softmax internally.
Output layer only: Never use softmax for hidden layer activations (defeats feature learning).
Numerical stability: Implementation uses log-sum-exp trick to avoid overflow in exp().
Dimension parameter: Usually dim=-1 (class dimension), but can be any axis depending on tensor shape.
Mutually exclusive: Softmax assumes classes are mutually exclusive. For multi-label use Sigmoid instead.

Examples

// Multi-class classification: ResNet + Softmax
class ResNetClassifier extends torch.nn.Module {
  private backbone: torch.nn.Module;  // Pretrained ResNet
  private fc: torch.nn.Linear;
  private softmax: torch.nn.Softmax;

  constructor() {
    super();
    // ... initialize backbone ...
    this.fc = new torch.nn.Linear(2048, 1000);  // 1000 classes (ImageNet)
    this.softmax = new torch.nn.Softmax(-1);     // Apply along class dimension
  }

  forward(x: torch.Tensor): torch.Tensor {
    x = this.backbone.forward(x);
    x = x.flatten(1);
    x = this.fc.forward(x);
    return this.softmax.forward(x);  // Probabilities: shape [batch, 1000]
  }
}
// Then use with CrossEntropyLoss

// NLP: Language model or text classifier
const batch_size = 32, seq_len = 128, vocab_size = 50000;
const logits = torch.randn([batch_size, seq_len, vocab_size]);

const softmax = new torch.nn.Softmax(-1);  // Softmax over vocabulary
const probs = softmax.forward(logits);  // Shape: [32, 128, 50000]

// Each position has probability distribution over vocabulary
// probs[i, j, :].sum() == 1 for all i, j

// Temperature scaling for calibration/confidence control
const logits = torch.randn([10, 5]);
const temperature = 2.0;  // Higher = softer (lower confidence), lower = sharper (higher confidence)

const softmax = new torch.nn.Softmax(-1);
const probs_hot = softmax.forward(logits.div(0.5));      // Sharp probabilities
const probs_cold = softmax.forward(logits.div(temperature)); // Soft probabilities
// Lower temperature → max prob closer to 1, others closer to 0 (high confidence)
// Higher temperature → more uniform distribution (low confidence)

torch.nn.Softmax

Examples

See Also

torch.nn.Softmax

Examples

See Also