torch.nn.functional.scaled_grouped_mm

function scaled_grouped_mm(input: Tensor, weight: Tensor, options?: ScaledGroupedMMFunctionalOptions): Tensor

Performs batched (grouped) matrix multiplication with per-batch scaling.

Computes batched matrix multiplication where each batch element is independently multiplied and then scaled by a single factor. More memory-efficient than grouped_mm for regular batched operations. Essential for:

Batched transformations: Processing batch of matrices in parallel
Quantized operations: Scale by per-group quantization factors
Attention scaling: Q @ K^T with 1/√d scaling in transformer attention
Efficient inference: Single scale applied uniformly across batch
Multi-sample processing: Processing multiple matrices simultaneously
GPU acceleration: Leverages batch matmul optimization
Fixed quantization: Constant scaling factor for all batch items

Operation: For 3D tensors: output[b] = (input[b] @ weight[b]) * scale

The same scale factor is applied to all batch elements after multiplication, useful for learnable scaling or quantized inference.

Difference from grouped_mm:

scaled_grouped_mm: Single scale for all batches (uses bmm)
grouped_mm: Different weights and potentially biases per group
Performance: scaled_grouped_mm optimized for batched matmul

Batch matmul optimization: Uses GPU batch matmul for efficiency
Single scale: Same scale applied to entire batch (vs grouped_mm's per-group)
3D tensors only: Input and weight must be 3D [batch, *, *]
Batch sizes must match: input.shape[0] must equal weight.shape[0]
Inner dimensions match: input.shape[2] must equal weight.shape[1]
Scale=1 optimization: Returns unscaled result directly when scale=1.0
Gradient propagation: Gradients flow to input and weight tensors
Attention standard: Common pattern for transformer attention

Invalid dimensions: Must have exactly 3D input and weight tensors
Batch size mismatch: Batch dimensions must be identical
Inner dimension mismatch: k dimensions must match for matmul
Scale interpretation: Scale is multiplicative (not additive)

Parameters

inputTensor: Input tensor of shape [batch, m, k] - batch: Number of independent matrix multiplications - m: Number of rows in each batch element - k: Number of columns (inner dimension for matmul)
weightTensor: Weight tensor of shape [batch, k, n] - batch: Must match input batch size - k: Must match input's last dimension (k) - n: Number of output columns
optionsScaledGroupedMMFunctionalOptionsoptional

Returns

Tensor– Tensor of shape [batch, m, n] containing scaled batched matmul result

Examples

// Attention scaling: Q @ K^T / sqrt(d)
const batch_size = 32;
const seq_len = 10;
const head_dim = 64;

const queries = torch.randn(batch_size, seq_len, head_dim);     // Q
const keys = torch.randn(batch_size, seq_len, head_dim);         // K^T
const scale = 1.0 / Math.sqrt(head_dim);  // 1/√64 ≈ 0.125

const attn_scores = torch.nn.functional.scaled_grouped_mm(queries, keys, scale);
// attn_scores: [32, 10, 10] - attention scores scaled by 1/√64

// Batched matrix multiplication with quantization scale
const input = torch.randn(16, 128, 256);   // [batch, m, k]
const weights = torch.randn(16, 256, 128); // [batch, k, n]
const quant_scale = 0.1;  // Quantization factor

const output = torch.nn.functional.scaled_grouped_mm(input, weights, quant_scale);
// output: [16, 128, 128] scaled by 0.1

// Batch processing with learnable scale
const batch = torch.randn(8, 100, 50);
const weights = torch.randn(8, 50, 30);
const learnable_scale = torch.tensor(1.5, { requires_grad: true });

const output = torch.nn.functional.scaled_grouped_mm(batch, weights, learnable_scale.item());
// Scale is learnable through backprop

// Multi-head attention implementation
const batch = 32, heads = 8, seq = 10, dim = 64;
// Reshape [batch, seq, heads*dim] → [batch*heads, seq, dim] for faster matmul
const Q = torch.randn(batch * heads, seq, dim);
const K = torch.randn(batch * heads, seq, dim);

const scale = 1.0 / Math.sqrt(dim);
const attention = torch.nn.functional.scaled_grouped_mm(Q, K, scale);
// attention: [batch*heads, seq, seq]

// No scaling optimization: scale=1.0 is efficient
const input = torch.randn(64, 256, 512);
const weights = torch.randn(64, 512, 256);

const result = torch.nn.functional.scaled_grouped_mm(input, weights, 1.0);
// Equivalent to input @ weights, no-op scaling

torch.nn.functional.scaled_grouped_mm

Parameters

Returns

Examples

See Also

torch.nn.functional.scaled_grouped_mm

Parameters

Returns

Examples

See Also