torch.nn.functional.conv2d
function conv2d<S extends Shape, D extends DType = DType, Dev extends DeviceType = DeviceType>(input: Tensor<S, D, Dev>, weight: Tensor): Tensor<Shape, D, Dev>function conv2d<S extends Shape, D extends DType = DType, Dev extends DeviceType = DeviceType>(input: Tensor<S, D, Dev>, weight: Tensor, bias: Tensor | null, stride: number | [number, number], padding: number | [number, number], dilation: number | [number, number], groups: number, options: Conv2dFunctionalOptions): Tensor<Shape, D, Dev>2D Convolution: applies learned filters to extract spatial features from images.
Applies 2D convolution (cross-correlation in practice) over an input image by sliding a kernel over spatial dimensions and computing dot products. Core operation in convolutional neural networks. Foundation of all computer vision models. Essential for:
- Image classification (ResNet, VGG, Inception, MobileNet)
- Object detection (YOLO, Faster R-CNN, SSD)
- Semantic segmentation (FCN, U-Net, DeepLab)
- Image generation and super-resolution (GANs, diffusion models)
- Spatially-correlated feature extraction (audio spectrograms, time series with locality)
How Conv2D works: Slides a learnable kernel (filter) over input, computing element-wise products and summing results. Output at position (i,j) = Σ (kernel * input_patch) + bias. Captures local spatial patterns. Multiple kernels extract different features; deeper layers combine features hierarchically.
Key parameters:
- kernel_size: Receptive field size (e.g., 3x3 captures immediate neighbors; 5x5 broader context)
- stride: How many pixels to move kernel (larger = smaller output, faster but loses information)
- padding: Border zeros (preserves spatial dimensions with stride=1, centers kernel)
- dilation: Spacing between kernel elements (sparse sampling for broader context without growth)
- groups: Depthwise convolution (groups=C_in = depthwise; groups=1 = standard; groups>1 = grouped conv)
Common architectures:
- ResNet: 3x3 kernels, stride 1 or 2, batch norm, skip connections
- VGG: Multiple 3x3 kernels stacked (builds larger receptive field)
- Inception: Parallel 1x1, 3x3, 5x5 kernels for multi-scale features
- MobileNet: Depthwise separable convolutions (depthwise + pointwise) for efficiency
Computational complexity: O(C_out × C_in × kH × kW × output_H × output_W). Large kernels or many channels are expensive. Optimizations: grouped convolution, depthwise separable, dilated convolution, pruning.
- 1x1 convolutions: Used for channel dimension changes without spatial mixing
- Padding choices: padding=(kernel_size-1)//2 preserves spatial dims with stride=1
- Receptive field: Stacking convolutions increases receptive field multiplicatively
- Depthwise separable: Depthwise (groups=C_in) + pointwise (1x1) is much cheaper
- Dilated convolutions: Multi-rate processing captures context without spatial reduction
- Gradient flow: Convolution fully differentiable; gradients flow through efficiently
- Memory usage: Scales with input_channels × output_channels × kernel_size²
Parameters
inputTensor<S, D, Dev>- Input tensor of shape [N, C_in, H, W] - N: batch size - C_in: number of input channels (e.g., 3 for RGB images) - H, W: spatial height and width
weightTensor- Learnable filter tensor of shape [C_out, C_in/groups, kH, kW] - C_out: number of output channels (filters) - C_in/groups: input channels per group (C_in if groups=1) - kH, kW: kernel height and width (e.g., 3x3 for typical conv)
Returns
Tensor<Shape, D, Dev>– Output tensor of shape [N, C_out, H_out, W_out] - H_out = floor((H_in + 2*padH - dilH*(kH-1) - 1) / strideH + 1) - W_out = floor((W_in + 2*padW - dilW*(kW-1) - 1) / strideW + 1)Examples
// Image classification: extract features from input image
const image = torch.randn([1, 3, 224, 224]); // [batch=1, RGB, height, width]
const kernel = torch.randn([64, 3, 3, 3]); // [out_channels=64, in_channels=3, kH=3, kW=3]
const bias = torch.randn([64]);
const output = torch.nn.functional.conv2d(image, kernel, bias, { stride: 1, padding: 1 });
// output shape: [1, 64, 224, 224]// Strided convolution for downsampling
const input = torch.randn([4, 64, 112, 112]); // ResNet layer input
const weight = torch.randn([128, 64, 3, 3]);
const output = torch.nn.functional.conv2d(input, weight, undefined, { stride: 2, padding: 1 });
// stride=2 reduces spatial dims: [4, 128, 56, 56]
// Common in ResNet for dimension reduction between blocks// Dilated convolution for broader receptive field
const x = torch.randn([1, 128, 32, 32]);
const w = torch.randn([128, 128, 3, 3]);
// Dilation=2 creates 5x5 receptive field with 3x3 kernel
const y = torch.nn.functional.conv2d(x, w, undefined, { padding: 2, dilation: 2 });// Depthwise separable convolution: step 1 (depthwise)
const x = torch.randn([32, 64, 28, 28]);
const w = torch.randn([64, 1, 3, 3]); // One filter per input channel
// groups=C_in makes it depthwise
const depthwise = torch.nn.functional.conv2d(x, w, undefined, { padding: 1, groups: 64 });
// Output [32, 64, 28, 28] - spatial patterns per channel independently// Using separate projection weights manually
const input = torch.randn([8, 128, 16, 16]);
const weight = torch.randn([256, 128, 3, 3]);
const output = torch.nn.functional.conv2d(input, weight, undefined, {
stride: [2, 2],
padding: [1, 1],
dilation: [1, 1],
groups: 1
});