torch.nn.functional.scaled_grouped_mm
function scaled_grouped_mm(input: Tensor, weight: Tensor, options?: ScaledGroupedMMFunctionalOptions): TensorPerforms batched (grouped) matrix multiplication with per-batch scaling.
Computes batched matrix multiplication where each batch element is independently
multiplied and then scaled by a single factor. More memory-efficient than grouped_mm
for regular batched operations. Essential for:
- Batched transformations: Processing batch of matrices in parallel
- Quantized operations: Scale by per-group quantization factors
- Attention scaling: Q @ K^T with 1/√d scaling in transformer attention
- Efficient inference: Single scale applied uniformly across batch
- Multi-sample processing: Processing multiple matrices simultaneously
- GPU acceleration: Leverages batch matmul optimization
- Fixed quantization: Constant scaling factor for all batch items
Operation:
For 3D tensors: output[b] = (input[b] @ weight[b]) * scale
The same scale factor is applied to all batch elements after multiplication, useful for learnable scaling or quantized inference.
Difference from grouped_mm:
- scaled_grouped_mm: Single scale for all batches (uses bmm)
- grouped_mm: Different weights and potentially biases per group
- Performance: scaled_grouped_mm optimized for batched matmul
- Batch matmul optimization: Uses GPU batch matmul for efficiency
- Single scale: Same scale applied to entire batch (vs grouped_mm's per-group)
- 3D tensors only: Input and weight must be 3D [batch, *, *]
- Batch sizes must match: input.shape[0] must equal weight.shape[0]
- Inner dimensions match: input.shape[2] must equal weight.shape[1]
- Scale=1 optimization: Returns unscaled result directly when scale=1.0
- Gradient propagation: Gradients flow to input and weight tensors
- Attention standard: Common pattern for transformer attention
- Invalid dimensions: Must have exactly 3D input and weight tensors
- Batch size mismatch: Batch dimensions must be identical
- Inner dimension mismatch: k dimensions must match for matmul
- Scale interpretation: Scale is multiplicative (not additive)
Parameters
inputTensor- Input tensor of shape
[batch, m, k]- batch: Number of independent matrix multiplications - m: Number of rows in each batch element - k: Number of columns (inner dimension for matmul) weightTensor- Weight tensor of shape
[batch, k, n]- batch: Must match input batch size - k: Must match input's last dimension (k) - n: Number of output columns optionsScaledGroupedMMFunctionalOptionsoptional
Returns
Tensor– Tensor of shape [batch, m, n] containing scaled batched matmul resultExamples
// Attention scaling: Q @ K^T / sqrt(d)
const batch_size = 32;
const seq_len = 10;
const head_dim = 64;
const queries = torch.randn(batch_size, seq_len, head_dim); // Q
const keys = torch.randn(batch_size, seq_len, head_dim); // K^T
const scale = 1.0 / Math.sqrt(head_dim); // 1/√64 ≈ 0.125
const attn_scores = torch.nn.functional.scaled_grouped_mm(queries, keys, scale);
// attn_scores: [32, 10, 10] - attention scores scaled by 1/√64// Batched matrix multiplication with quantization scale
const input = torch.randn(16, 128, 256); // [batch, m, k]
const weights = torch.randn(16, 256, 128); // [batch, k, n]
const quant_scale = 0.1; // Quantization factor
const output = torch.nn.functional.scaled_grouped_mm(input, weights, quant_scale);
// output: [16, 128, 128] scaled by 0.1// Batch processing with learnable scale
const batch = torch.randn(8, 100, 50);
const weights = torch.randn(8, 50, 30);
const learnable_scale = torch.tensor(1.5, { requires_grad: true });
const output = torch.nn.functional.scaled_grouped_mm(batch, weights, learnable_scale.item());
// Scale is learnable through backprop// Multi-head attention implementation
const batch = 32, heads = 8, seq = 10, dim = 64;
// Reshape [batch, seq, heads*dim] → [batch*heads, seq, dim] for faster matmul
const Q = torch.randn(batch * heads, seq, dim);
const K = torch.randn(batch * heads, seq, dim);
const scale = 1.0 / Math.sqrt(dim);
const attention = torch.nn.functional.scaled_grouped_mm(Q, K, scale);
// attention: [batch*heads, seq, seq]// No scaling optimization: scale=1.0 is efficient
const input = torch.randn(64, 256, 512);
const weights = torch.randn(64, 512, 256);
const result = torch.nn.functional.scaled_grouped_mm(input, weights, 1.0);
// Equivalent to input @ weights, no-op scalingSee Also
- [PyTorch torch.nn.functional._scaled_grouped_mm (internal)](https://pytorch.org/docs/stable/generated/torch.nn.functional._scaled_grouped_mm .html)
- grouped_mm - Grouped matmul with independent weights per group
- scaled_mm - General scaled matrix multiplication with advanced options
- Tensor.bmm - Batched matrix multiplication without scaling
- Tensor.matmul - General matrix multiplication