torch.nn.Dropout

class Dropout extends Module

new Dropout(options?: DropoutOptions)

readonlyp(number)
readonlyinplace(boolean)

Dropout: randomly zeros individual elements during training.

A powerful regularization technique that prevents co-adaptation by randomly zeroing elements with probability p during training, then scaling remaining values by 1/(1-p) to maintain expected magnitude. At test time, applies identity transformation. Essential for:

Reducing overfitting in neural networks
Preventing co-adaptation of neurons
Model ensembling effect (multiple thinned networks)
Improving generalization to unseen data
Training deep networks without excessive regularization

Dropout acts as a form of regularization by creating a stochastic ensemble. Each forward pass uses a different random subset of neurons, forcing the network to learn redundant representations. This prevents any single neuron from becoming too specialized and improves robustness.

When to use Dropout:

Fully connected layers (especially prone to overfitting)
Recurrent networks (LSTM/GRU hidden states)
Small datasets with large models
When model overfits despite L1/L2 regularization
Any layer that shows signs of co-adaptation

Dropout variants by layer type:

Dropout: Element-wise (each neuron independently)
Dropout1d: Channel-wise for 1D sequences (entire channels dropped together)
Dropout2d: Channel-wise for 2D spatial (entire feature maps dropped together)
Dropout3d: Channel-wise for 3D spatial (entire 3D feature volumes dropped together)
AlphaDropout: Self-normalizing dropout for SELU networks
FeatureAlphaDropout: Channel-wise alpha dropout

Trade-offs:

vs L1/L2 regularization: Dropout is stochastic; weight decay is deterministic
vs Batch Normalization: Dropout zeroes values; BatchNorm rescales (complementary)
Dropout vs Dropout*d: Element-wise vs channel-wise correlation structure
Computational cost: Adds negligible overhead (one random mask per forward)
Training vs Inference: Behavior differs significantly (requires .train()/.eval())
Tuning: p=0.5 is standard; lower values (0.1-0.3) for small networks/datasets

Dropout mechanics: During training with input x and probability p:

Create random mask M ~ Bernoulli(1-p) (keep elements with probability 1-p)
Zeroed output: y = M ⊙ x / (1-p) (element-wise multiply, then rescale)
Expected value: E[y] = E[M ⊙ x / (1-p)] = (1-p) * E[x] / (1-p) = E[x] (unchanged expectation)

During inference (training=False):

Apply identity: y = x (no dropout applied)
Network uses full capacity and learned representations

Bernoulli sampling: Each element independently dropped with probability p, independent of other elements.

\begin{aligned} y_i = \begin{cases} \frac{x_i}{1-p} & \text{with probability } (1-p) \\ 0 & \text{with probability } p \end{cases} \end{aligned}

Expected value: Output has same expectation as input (rescaling ensures this)
Training/Inference: MUST call .train() or .eval() to control behavior
Variance: Dropout increases output variance but maintains expectation
Element independence: Each element dropped independently (unlike Dropout2d)
Computational efficiency: ~1% overhead (random mask generation is cheap)
Gradient flow: Backprop through dropout is straightforward (same mask)
Inplace safety: Inplace=true can break gradient computation, use with care

No effect during inference: eval() disables dropout completely
Different masks per forward: Each call gets new random mask
Tuning is important: Wrong p value reduces model capacity or doesn't regularize
Can hurt performance: Too much dropout (p 0.7) can reduce capacity
Inplace modification: May cause issues in complex computation graphs

Examples

// Basic dropout in fully connected network
const dropout = new torch.nn.Dropout(0.5);
const x = torch.randn([32, 256]);  // Batch of 256-dim vectors

// During training
dropout.train();
const train_out = dropout.forward(x);  // ~50% elements zeroed

// During inference
dropout.eval();
const test_out = dropout.forward(x);  // No dropout, returns x

// Dropout in deep network to prevent overfitting
class SimpleNet extends torch.nn.Module {
  fc1: torch.nn.Linear;
  dropout1: torch.nn.Dropout;
  fc2: torch.nn.Linear;
  dropout2: torch.nn.Dropout;
  fc3: torch.nn.Linear;

  constructor() {
    super();
    this.fc1 = new torch.nn.Linear(784, 256);
    this.dropout1 = new torch.nn.Dropout(0.5);
    this.fc2 = new torch.nn.Linear(256, 128);
    this.dropout2 = new torch.nn.Dropout(0.5);
    this.fc3 = new torch.nn.Linear(128, 10);
  }

  forward(x: torch.Tensor): torch.Tensor {
    x = torch.relu(this.fc1.forward(x));
    x = this.dropout1.forward(x);  // Prevent co-adaptation in first layer
    x = torch.relu(this.fc2.forward(x));
    x = this.dropout2.forward(x);  // Prevent co-adaptation in second layer
    return this.fc3.forward(x);
  }
}

const model = new SimpleNet();

// Dropout with lower probability for simpler models
const light_dropout = new torch.nn.Dropout(0.2);  // Only 20% dropout
const heavy_dropout = new torch.nn.Dropout(0.7);  // Aggressive 70% dropout

// Ensemble interpretation: dropout creates model averaging
const dropout = new torch.nn.Dropout(0.5);
const x = torch.randn([1, 100]);

// Each forward pass with different mask ~ different sub-network
dropout.train();
const out1 = dropout.forward(x);  // Different random mask
const out2 = dropout.forward(x);  // Different random mask
const out3 = dropout.forward(x);  // Different random mask
// Training averages over these different configurations

torch.nn.Dropout

Examples

See Also

torch.nn.Dropout

Examples

See Also