torch.nn.init.orthogonal_
function orthogonal_(tensor: Tensor, options?: OrthogonalOptions): Tensorfunction orthogonal_(tensor: Tensor, gain: number, options?: OrthogonalOptions): TensorFill tensor with orthogonal matrix for gradient flow and convergence.
Orthogonal initialization creates weight matrices with orthogonal rows (or columns). Useful for initializing networks to preserve signal magnitude through layers and enable fast convergence. Essential for:
- RNNs and LSTMs (orthogonal weight matrices preserve gradient magnitudes)
- Deep networks sensitive to vanishing/exploding gradients
- Networks where orthogonality helps training dynamics
- Sequence models where signal preservation through time is important
- Theoretical analysis of gradient flow
Provides excellent gradient flow properties: ∥Wx∥ ≈ ∥x∥ for orthogonal W.
Described in "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks" - Saxe, A. et al. (2013).
- Gradient flow: Orthogonal matrices preserve vector norms through multiplication
- RNNs: Especially beneficial for RNNs/LSTMs to prevent vanishing gradients
- Flattening: For multidimensional tensors (conv), rows are first dim, rest flattened
- Gain scaling: gain = √2 for ReLU networks, gain = 1 for tanh/sigmoid
- Approximation: Implementation uses simplified approach, not full QR decomposition
- Computational cost: More expensive than Xavier/He due to orthogonal computation
- In-place operation: Modifies tensor in-place; returns the same tensor
Parameters
tensorTensor- An n-dimensional Tensor where n = 2. For n 2, trailing dimensions are flattened into columns. Shape is reshaped as (n_rows, n_cols) where n_cols = product of remaining dims
optionsOrthogonalOptionsoptional- Optional settings for orthogonal initialization
Returns
Tensor– The input tensor with orthogonal initialization Algorithm: - Generate random matrix with entries from N(0, 1) - Flatten tensor to 2D: (rows, cols) - Compute (approximate) orthogonal matrix - Scale by gain factor - Reshape back to original shapeExamples
// RNN weight initialization
const rnn = torch.nn.RNN(input_size = 64, hidden_size = 128, num_layers = 2);
for (const [name, param] of rnn.named_parameters()) {
if ('weight_hh' in name || name.includes('recurrent')) {
// Use orthogonal for recurrent weights
torch.nn.init.orthogonal_(param);
} else if ('weight_ih' in name || name.includes('input')) {
// Xavier for input-to-hidden
const gain = torch.nn.init.calculate_gain('relu');
torch.nn.init.xavier_uniform_(param, { gain });
}
}// LSTM with orthogonal initialization
const lstm = torch.nn.LSTM(input_size = 128, hidden_size = 256, batch_first = true);
for (const [name, param] of lstm.named_parameters()) {
if ('weight_hh' in name) {
// Orthogonal for hidden-to-hidden (recurrent)
torch.nn.init.orthogonal_(param, { gain: 1.0 });
} else if ('weight_ih' in name) {
// Xavier for input-to-hidden
torch.nn.init.xavier_uniform_(param);
}
}// Deep fully-connected network with orthogonal init
const layer1 = torch.nn.Linear(512, 512);
const layer2 = torch.nn.Linear(512, 512);
const layer3 = torch.nn.Linear(512, 10);
// Orthogonal init for hidden layers
torch.nn.init.orthogonal_(layer1.weight, { gain: torch.nn.init.calculate_gain('relu') });
torch.nn.init.orthogonal_(layer2.weight, { gain: torch.nn.init.calculate_gain('relu') });
// Xavier for output layer
torch.nn.init.xavier_uniform_(layer3.weight);
torch.nn.init.zeros_(layer1.bias);
torch.nn.init.zeros_(layer2.bias);
torch.nn.init.zeros_(layer3.bias);// Convolutional layer with orthogonal initialization
const conv = torch.nn.Conv2d(3, 64, { kernel_size: 3, padding: 1 });
torch.nn.init.orthogonal_(conv.weight, { gain: 1.0 });
torch.nn.init.zeros_(conv.bias);See Also
- PyTorch torch.nn.init.orthogonal_()
- torch.nn.init.xavier_uniform_ - Xavier initialization (less gradient flow)
- torch.nn.init.kaiming_uniform_ - He initialization for ReLU
- torch.nn.init.calculate_gain - Get gain for specific activation function