torch.nn.LeakyReLU
new LeakyReLU(options?: LeakyReLUOptions)
- readonly
negative_slope(number)
LeakyReLU activation function (ReLU variant).
LeakyReLU is a simple fix for ReLU's "dying ReLU" problem. Instead of outputting zero for negative inputs like ReLU does, LeakyReLU outputs a small negative multiple of the input. This allows gradients to flow through even for negative activations, preventing neurons from getting permanently stuck in the inactive state. The slope parameter (typically 0.01) controls the magnitude of the "leak".
When to use LeakyReLU:
- Suspect dying ReLU problem (many zero activations, especially without batch normalization)
- Networks without batch normalization (batch norm mitigates dying ReLU, making LeakyReLU less necessary)
- Generative models (GANs) where non-zero gradients help with training stability
- Comparing against ReLU to verify if dead neurons are the issue
- As default when unsure whether batch norm will be used
Trade-offs vs ReLU:
- Solves dying ReLU: Guarantees non-zero gradient for all inputs, no dead neurons
- Slightly higher compute: Additional multiplication for negative path (negligible cost)
- Different semantics: Some negative information flows through (not strictly one-directional like ReLU)
- Empirical quality: Often slightly worse than ReLU for standard CNNs/Vision (batch norm eliminates dying ReLU)
- GANs: Generally better than ReLU in adversarial settings (more stable training)
- Learning rate sensitive: Negative slope value matters; default 0.01 is usually good
Algorithm: Forward: LeakyReLU(x) = x if x > 0, else negative_slope * x Backward: ∂L/∂x = ∂L/∂y if x > 0, else negative_slope * ∂L/∂y (always non-zero)
- Default slope: 0.01 is standard (1% of gradient for negative inputs).
- GAN standard: GANs often use 0.2 slope for better training stability.
- No dead neurons: Unlike ReLU, always has non-zero gradient for all inputs.
- Batch norm alternative: If using batch normalization, dying ReLU is rare and ReLU is fine.
- Parameter sensitivity: Slope value matters; too large reduces ReLU benefit, too small defeats purpose.
Examples
// Using LeakyReLU to fix dying ReLU in a network without batch norm
class MLPWithLeakyReLU extends torch.nn.Module {
private fc1: torch.nn.Linear;
private leaky_relu: torch.nn.LeakyReLU;
private fc2: torch.nn.Linear;
private fc3: torch.nn.Linear;
constructor() {
super();
this.fc1 = new torch.nn.Linear(10, 64);
this.leaky_relu = new torch.nn.LeakyReLU(0.01); // 1% slope
this.fc2 = new torch.nn.Linear(64, 64);
this.fc3 = new torch.nn.Linear(64, 1);
}
forward(x: torch.Tensor): torch.Tensor {
x = this.fc1.forward(x);
x = this.leaky_relu.forward(x); // No batch norm, so LeakyReLU helps
x = this.fc2.forward(x);
x = this.leaky_relu.forward(x);
return this.fc3.forward(x);
}
}// GAN discriminator with LeakyReLU (standard practice)
class Discriminator extends torch.nn.Module {
private conv1: torch.nn.Conv2d;
private leaky_relu: torch.nn.LeakyReLU;
private conv2: torch.nn.Conv2d;
private fc: torch.nn.Linear;
constructor() {
super();
this.conv1 = new torch.nn.Conv2d(3, 32, { kernel_size: 4, stride: 2, padding: 1 });
this.leaky_relu = new torch.nn.LeakyReLU(0.2); // GANs often use 0.2
this.conv2 = new torch.nn.Conv2d(32, 64, { kernel_size: 4, stride: 2, padding: 1 });
this.fc = new torch.nn.Linear(64 * 8 * 8, 1);
}
forward(x: torch.Tensor): torch.Tensor {
x = this.conv1.forward(x);
x = this.leaky_relu.forward(x);
x = this.conv2.forward(x);
x = this.leaky_relu.forward(x);
x = x.flatten(1);
return this.fc.forward(x);
}
}
// LeakyReLU helps discriminator training stability// Adjusting slope parameter for different behaviors
const x = torch.randn([100, 10]);
const relu = new torch.nn.ReLU();
const leaky_small = new torch.nn.LeakyReLU(0.01); // Standard: 1%
const leaky_large = new torch.nn.LeakyReLU(0.3); // Aggressive: 30%
const prelu = new torch.nn.PReLU(); // Learnable slope
const y1 = relu.forward(x); // Hard thresholding
const y2 = leaky_small.forward(x); // Small leak (safest)
const y3 = leaky_large.forward(x); // Larger leak (more info flow)