torch.nn.Softsign
Softsign activation function.
Softsign is a smooth activation function that maps inputs to (-1, 1) via the formula Softsign(x) = x / (1 + |x|). It's similar to Tanh in output range and smoothness, but with a simpler formula (no exponentials needed). Softsign is rarely used in modern deep learning (Tanh, ReLU, and smooth activations like GELU/SiLU are standard), but appears occasionally in specialized contexts.
Core idea: Softsign(x) = x / (1 + |x|) provides a smooth, bounded output in (-1, 1). Unlike Tanh which requires exponential computation, Softsign uses only division. The output is smoother than Sigmoid (which has saturation) but different from Tanh's particular curve.
When to use Softsign:
- Alternative to Tanh when you want simpler computation (no exponentials)
- Output range (-1, 1) desired but don't need Tanh specifically
- Experimental/research comparing different smooth activations
- Rarely: Tanh or modern smooth activations (GELU, SiLU) usually better
Trade-offs vs Tanh:
- Computation: No exponentials (just division) vs Tanh's exp-based computation
- Gradient: Softsign: ∂/∂x = 1/(1+|x|)² vs Tanh: ∂/∂x = 1 - tanh²(x)
- Saturation: Both saturate for large |x|, but at different rates
- Output shape: Different curve but same range (-1, 1)
- Empirical quality: Very similar; minor differences in practice
- Popularity: Tanh much more common (more established)
Trade-offs vs ReLU:
- Smoothness: Softsign smooth everywhere vs ReLU's kink at x=0
- Boundedness: Softsign outputs in (-1, 1) vs ReLU's [0, ∞)
- Gradient decay: Softsign gradients decay for large |x| (like sigmoid saturation)
- Empirical: ReLU usually better in deep networks (no saturation)
- Use case: ReLU for standard networks; Softsign rarely needed
Algorithm: Forward: Softsign(x) = x / (1 + |x|) Backward: ∂Softsign/∂x = 1 / (1 + |x|)² (always positive, symmetric around x=0) The gradient is always in (0, 1], decreasing as |x| increases (saturation effect)
- Smooth everywhere: Continuously differentiable with smooth gradient.
- Bounded (-1, 1): Like Tanh; simpler to compute (division vs exponentials).
- Rarely used: Tanh is more established for (-1, 1) bounded activation.
- Zero-centered: Output zero-mean like Tanh; better than Sigmoid for training.
- Gradient decay: Gradients decay for large |x| (saturation effect like Sigmoid/Tanh).
- Legacy activation: Sometimes seen in older code; modern activations preferred.
Examples
// Network using Softsign (rare, mostly for comparison)
class MLPWithSoftsign extends torch.nn.Module {
private fc1: torch.nn.Linear;
private softsign: torch.nn.Softsign;
private fc2: torch.nn.Linear;
constructor() {
super();
this.fc1 = new torch.nn.Linear(10, 64);
this.softsign = new torch.nn.Softsign(); // Smooth, bounded activation
this.fc2 = new torch.nn.Linear(64, 1);
}
forward(x: torch.Tensor): torch.Tensor {
x = this.fc1.forward(x);
x = this.softsign.forward(x); // Smooth output in (-1, 1)
return this.fc2.forward(x);
}
}
// In practice, ReLU or Tanh would be more common choices// Comparing smooth activations
const x = torch.linspace(-5, 5, [1000]);
const softsign = new torch.nn.Softsign();
const tanh = new torch.nn.Tanh();
const sigmoid = new torch.nn.Sigmoid();
const y_softsign = softsign.forward(x); // x / (1 + |x|), range (-1, 1)
const y_tanh = tanh.forward(x); // exp-based, range (-1, 1)
const y_sigmoid = sigmoid.forward(x); // exp-based, range (0, 1)
// All smooth but different curves; Softsign is mathematically simpler than Tanh