torch.nn.Dropout
new Dropout(options?: DropoutOptions)
- readonly
p(number) - readonly
inplace(boolean)
Dropout: randomly zeros individual elements during training.
A powerful regularization technique that prevents co-adaptation by randomly zeroing elements with probability p during training, then scaling remaining values by 1/(1-p) to maintain expected magnitude. At test time, applies identity transformation. Essential for:
- Reducing overfitting in neural networks
- Preventing co-adaptation of neurons
- Model ensembling effect (multiple thinned networks)
- Improving generalization to unseen data
- Training deep networks without excessive regularization
Dropout acts as a form of regularization by creating a stochastic ensemble. Each forward pass uses a different random subset of neurons, forcing the network to learn redundant representations. This prevents any single neuron from becoming too specialized and improves robustness.
When to use Dropout:
- Fully connected layers (especially prone to overfitting)
- Recurrent networks (LSTM/GRU hidden states)
- Small datasets with large models
- When model overfits despite L1/L2 regularization
- Any layer that shows signs of co-adaptation
Dropout variants by layer type:
- Dropout: Element-wise (each neuron independently)
- Dropout1d: Channel-wise for 1D sequences (entire channels dropped together)
- Dropout2d: Channel-wise for 2D spatial (entire feature maps dropped together)
- Dropout3d: Channel-wise for 3D spatial (entire 3D feature volumes dropped together)
- AlphaDropout: Self-normalizing dropout for SELU networks
- FeatureAlphaDropout: Channel-wise alpha dropout
Trade-offs:
- vs L1/L2 regularization: Dropout is stochastic; weight decay is deterministic
- vs Batch Normalization: Dropout zeroes values; BatchNorm rescales (complementary)
- Dropout vs Dropout*d: Element-wise vs channel-wise correlation structure
- Computational cost: Adds negligible overhead (one random mask per forward)
- Training vs Inference: Behavior differs significantly (requires .train()/.eval())
- Tuning: p=0.5 is standard; lower values (0.1-0.3) for small networks/datasets
Dropout mechanics: During training with input x and probability p:
- Create random mask M ~ Bernoulli(1-p) (keep elements with probability 1-p)
- Zeroed output: y = M ⊙ x / (1-p) (element-wise multiply, then rescale)
- Expected value: E[y] = E[M ⊙ x / (1-p)] = (1-p) * E[x] / (1-p) = E[x] (unchanged expectation)
During inference (training=False):
- Apply identity: y = x (no dropout applied)
- Network uses full capacity and learned representations
Bernoulli sampling: Each element independently dropped with probability p, independent of other elements.
- Expected value: Output has same expectation as input (rescaling ensures this)
- Training/Inference: MUST call .train() or .eval() to control behavior
- Variance: Dropout increases output variance but maintains expectation
- Element independence: Each element dropped independently (unlike Dropout2d)
- Computational efficiency: ~1% overhead (random mask generation is cheap)
- Gradient flow: Backprop through dropout is straightforward (same mask)
- Inplace safety: Inplace=true can break gradient computation, use with care
- No effect during inference: eval() disables dropout completely
- Different masks per forward: Each call gets new random mask
- Tuning is important: Wrong p value reduces model capacity or doesn't regularize
- Can hurt performance: Too much dropout (p 0.7) can reduce capacity
- Inplace modification: May cause issues in complex computation graphs
Examples
// Basic dropout in fully connected network
const dropout = new torch.nn.Dropout(0.5);
const x = torch.randn([32, 256]); // Batch of 256-dim vectors
// During training
dropout.train();
const train_out = dropout.forward(x); // ~50% elements zeroed
// During inference
dropout.eval();
const test_out = dropout.forward(x); // No dropout, returns x// Dropout in deep network to prevent overfitting
class SimpleNet extends torch.nn.Module {
fc1: torch.nn.Linear;
dropout1: torch.nn.Dropout;
fc2: torch.nn.Linear;
dropout2: torch.nn.Dropout;
fc3: torch.nn.Linear;
constructor() {
super();
this.fc1 = new torch.nn.Linear(784, 256);
this.dropout1 = new torch.nn.Dropout(0.5);
this.fc2 = new torch.nn.Linear(256, 128);
this.dropout2 = new torch.nn.Dropout(0.5);
this.fc3 = new torch.nn.Linear(128, 10);
}
forward(x: torch.Tensor): torch.Tensor {
x = torch.relu(this.fc1.forward(x));
x = this.dropout1.forward(x); // Prevent co-adaptation in first layer
x = torch.relu(this.fc2.forward(x));
x = this.dropout2.forward(x); // Prevent co-adaptation in second layer
return this.fc3.forward(x);
}
}
const model = new SimpleNet();// Dropout with lower probability for simpler models
const light_dropout = new torch.nn.Dropout(0.2); // Only 20% dropout
const heavy_dropout = new torch.nn.Dropout(0.7); // Aggressive 70% dropout// Ensemble interpretation: dropout creates model averaging
const dropout = new torch.nn.Dropout(0.5);
const x = torch.randn([1, 100]);
// Each forward pass with different mask ~ different sub-network
dropout.train();
const out1 = dropout.forward(x); // Different random mask
const out2 = dropout.forward(x); // Different random mask
const out3 = dropout.forward(x); // Different random mask
// Training averages over these different configurations