torch.nn.Tanhshrink
class Tanhshrink extends ModuleTanhshrink activation function.
Tanhshrink applies a shrinkage function that combines the identity with tanh: Tanhshrink(x) = x - tanh(x). This creates a "zero attractor" that shrinks small values towards zero while preserving larger values. Tanhshrink is rarely used in modern deep learning but appears in some sparse coding and noise reduction applications. Unlike hard shrinkage (Hardshrink) which zeroes out small values, Tanhshrink uses a smooth, continuous shrinkage function.
Core idea: Tanhshrink(x) = x - tanh(x). When x is near zero, tanh(x) ≈ x, so the output is small. When |x| is large, tanh(x) ≈ ±1, so the output approaches x ± 1. This creates a smooth attractor at zero without the sharp discontinuity of hard thresholding.
When to use Tanhshrink:
- Sparse coding: Smooth shrinkage in sparse representation learning
- Denoising: Soft noise reduction that preserves signal structure
- Regularization: Acts as a form of implicit regularization towards sparsity
- Research: Appears in theoretical sparse coding literature
- NOT for typical deep learning: Use ReLU/GELU instead for standard networks
Trade-offs vs alternatives:
- vs Hardshrink: Tanhshrink is smooth and continuous (differentiable); Hardshrink is sharp (zero at boundaries)
- vs Softshrink: Tanhshrink uses tanh which has exponential behavior; Softshrink uses linear shrinkage
- vs ReLU: ReLU is much simpler and more common; Tanhshrink is for specialized sparse coding
Algorithm: Forward: Tanhshrink(x) = x - tanh(x)
- For x near 0: Tanhshrink(x) ≈ x - x = 0 (shrinks towards zero)
- For x = 1: Tanhshrink(1) ≈ 1 - 0.76 = 0.24
- For x = -1: Tanhshrink(-1) ≈ -1 - (-0.76) = -0.24
- For x → ±∞: Tanhshrink(x) → x ∓ 1 (approaches identity with offset)
Backward: ∂Tanhshrink(x)/∂x = 1 - (1 - tanh²(x)) = tanh²(x)
- Gradient is zero at x = 0 (flat spot, no learning from zero values)
- Gradient approaches 1 for large |x| (passes gradients through for large values)
- Rarely used: Modern deep learning rarely uses Tanhshrink (ReLU family dominates).
- Smooth shrinkage: Continuous and differentiable everywhere (unlike Hardshrink).
- Fixed shrinkage: No learnable threshold parameter (unlike PReLU or Hardshrink).
- Zero attractor: Pulls small values towards zero while preserving large values.
- Sparse coding: Appears in theoretical sparse representation learning.
Examples
// Sparse coding with Tanhshrink shrinkage
const tanhshrink = new torch.nn.Tanhshrink();
// Learned sparse codes (before shrinkage)
const codes = torch.randn([batch_size, code_dim]);
// Apply smooth shrinkage to encourage sparsity
const sparse_codes = tanhshrink.forward(codes); // Most small values shrink towards zero
// Reconstruct from sparse codes (decoder would process sparse_codes)// Comparison: Shrinkage functions on same input
const x = torch.linspace(-3, 3, 100);
const hardshrink = new torch.nn.Hardshrink(1.0); // Zero for |x| <= 1.0
const softshrink = new torch.nn.Softshrink(1.0); // Linear shrinkage
const tanhshrink = new torch.nn.Tanhshrink(); // Smooth tanh-based shrinkage
const hard = hardshrink.forward(x); // Sharp: [0, ..., 0, ..., large values]
const soft = softshrink.forward(x); // Linear: smooth diagonal shrinkage
const tanh = tanhshrink.forward(x); // Smooth exponential: no parameter needed