torch.nn.Sigmoid
Sigmoid activation function.
Sigmoid (also called logistic function) is a classic activation that squashes inputs to (0, 1). It has been the standard activation in neural networks for decades but is now less common in hidden layers due to the vanishing gradient problem. Sigmoid is still widely used for:
- Binary classification output layers (e.g., binary cross-entropy loss)
- Probability gates in mechanisms like LSTMs and GRUs (element-wise multiplication)
- Attention mechanisms where (0, 1) outputs are needed
- Normalizing activations to bounded ranges when needed
When to use Sigmoid:
- Output layer for binary classification (with BCELoss)
- Gate mechanisms in RNNs (LSTM, GRU) where you need multiplicative masking
- Rarely for hidden layers (ReLU is better) unless explicitly modeling bounded outputs
- When you need normalized probabilities in range (0, 1)
Trade-offs vs ReLU:
- Vanishing gradient problem: Sigmoid gradients approach zero for |x| large, causing slow learning in deep networks. Only small gradients backpropagate through many layers. ReLU doesn't have this issue.
- Symmetric gradients: Unlike ReLU, sigmoid has non-zero gradients everywhere (good for some architectures)
- Probability semantics: Output (0, 1) is naturally interpreted as probability (good for classification)
- Smooth and differentiable: Unlike ReLU's hard threshold, sigmoid is infinitely smooth
Algorithm: Forward: σ(x) = 1 / (1 + exp(-x)). For numerical stability, uses the identity: σ(x) = exp(x) / (1 + exp(x)) when x > 0, and 1 / (1 + exp(-x)) when x ≤ 0. Backward: ∂σ/∂x = σ(x) * (1 - σ(x))
- For classification: Use Sigmoid output layer with BCELoss for binary classification.
- Vanishing gradients: Not recommended for hidden layers in deep networks (ReLU is better).
- Gate operations: Essential for multiplicative masking in RNNs, Attention, and gating mechanisms.
- Modern use: Rarely used for hidden layer activations (ReLU/GELU preferred), but critical for gates.
- Numerical stability: Implementation handles overflow/underflow for extreme x values.
Examples
// Binary classification output layer
class BinaryClassifier extends torch.nn.Module {
private linear1: torch.nn.Linear;
private linear2: torch.nn.Linear;
private sigmoid: torch.nn.Sigmoid;
constructor() {
super();
this.linear1 = new torch.nn.Linear(10, 64);
this.linear2 = new torch.nn.Linear(64, 1); // Single output
this.sigmoid = new torch.nn.Sigmoid();
}
forward(x: torch.Tensor): torch.Tensor {
x = this.linear1.forward(x);
x = torch.nn.functional.relu(x); // ReLU in hidden layers
x = this.linear2.forward(x);
return this.sigmoid.forward(x); // Sigmoid for probability
}
}
// Then use with BCELoss// LSTM gate mechanism (internal use, not visible)
// LSTMs use sigmoid internally for input/forget/output gates:
// f_t = sigmoid(W_f @ [h_{t-1}, x_t] + b_f)
// This controls information flow (0 = no flow, 1 = full flow)// Attention mechanism with sigmoid gating
const query = torch.randn([32, 64]);
const key = torch.randn([32, 64]);
const attention_scores = torch.matmul(query, key.T());
const gate = new torch.nn.Sigmoid();
const masked = torch.mul(attention_scores, gate.forward(attention_scores));