torch.nn.functional.conv2d

function conv2d<S extends Shape, D extends DType = DType, Dev extends DeviceType = DeviceType>(input: Tensor<S, D, Dev>, weight: Tensor): Tensor<Shape, D, Dev>

function conv2d<S extends Shape, D extends DType = DType, Dev extends DeviceType = DeviceType>(input: Tensor<S, D, Dev>, weight: Tensor, bias: Tensor | null, stride: number | [number, number], padding: number | [number, number], dilation: number | [number, number], groups: number, options: Conv2dFunctionalOptions): Tensor<Shape, D, Dev>

2D Convolution: applies learned filters to extract spatial features from images.

Applies 2D convolution (cross-correlation in practice) over an input image by sliding a kernel over spatial dimensions and computing dot products. Core operation in convolutional neural networks. Foundation of all computer vision models. Essential for:

Image classification (ResNet, VGG, Inception, MobileNet)
Object detection (YOLO, Faster R-CNN, SSD)
Semantic segmentation (FCN, U-Net, DeepLab)
Image generation and super-resolution (GANs, diffusion models)
Spatially-correlated feature extraction (audio spectrograms, time series with locality)

How Conv2D works: Slides a learnable kernel (filter) over input, computing element-wise products and summing results. Output at position (i,j) = Σ (kernel * input_patch) + bias. Captures local spatial patterns. Multiple kernels extract different features; deeper layers combine features hierarchically.

Key parameters:

kernel_size: Receptive field size (e.g., 3x3 captures immediate neighbors; 5x5 broader context)
stride: How many pixels to move kernel (larger = smaller output, faster but loses information)
padding: Border zeros (preserves spatial dimensions with stride=1, centers kernel)
dilation: Spacing between kernel elements (sparse sampling for broader context without growth)
groups: Depthwise convolution (groups=C_in = depthwise; groups=1 = standard; groups>1 = grouped conv)

Common architectures:

ResNet: 3x3 kernels, stride 1 or 2, batch norm, skip connections
VGG: Multiple 3x3 kernels stacked (builds larger receptive field)
Inception: Parallel 1x1, 3x3, 5x5 kernels for multi-scale features
MobileNet: Depthwise separable convolutions (depthwise + pointwise) for efficiency

Computational complexity: O(C_out × C_in × kH × kW × output_H × output_W). Large kernels or many channels are expensive. Optimizations: grouped convolution, depthwise separable, dilated convolution, pruning.

\begin{aligned} \text{Output}[n,c,i,j] = \text{bias}[c] + \sum_{k_h,k_w,c'} \text{weight}[c,c',k_h,k_w] \times \text{input}[n, c', i \cdot s_h + k_h \cdot d_h, j \cdot s_w + k_w \cdot d_w] \\ \text{Output height: } H_{out} = \lfloor \frac{H_{in} + 2 \cdot pad_h - dil_h \cdot (k_h - 1) - 1}{stride_h} \rfloor + 1 \\ \text{Output width: } W_{out} = \lfloor \frac{W_{in} + 2 \cdot pad_w - dil_w \cdot (k_w - 1) - 1}{stride_w} \rfloor + 1 \\ \text{Receptive field: } 1 + (k_h - 1) \cdot dil_h \times 1 + (k_w - 1) \cdot dil_w \end{aligned}

1x1 convolutions: Used for channel dimension changes without spatial mixing
Padding choices: padding=(kernel_size-1)//2 preserves spatial dims with stride=1
Receptive field: Stacking convolutions increases receptive field multiplicatively
Depthwise separable: Depthwise (groups=C_in) + pointwise (1x1) is much cheaper
Dilated convolutions: Multi-rate processing captures context without spatial reduction
Gradient flow: Convolution fully differentiable; gradients flow through efficiently
Memory usage: Scales with input_channels × output_channels × kernel_size²

Parameters

inputTensor<S, D, Dev>: Input tensor of shape [N, C_in, H, W] - N: batch size - C_in: number of input channels (e.g., 3 for RGB images) - H, W: spatial height and width
weightTensor: Learnable filter tensor of shape [C_out, C_in/groups, kH, kW] - C_out: number of output channels (filters) - C_in/groups: input channels per group (C_in if groups=1) - kH, kW: kernel height and width (e.g., 3x3 for typical conv)

Returns

Tensor<Shape, D, Dev>– Output tensor of shape [N, C_out, H_out, W_out] - H_out = floor((H_in + 2*padH - dilH*(kH-1) - 1) / strideH + 1) - W_out = floor((W_in + 2*padW - dilW*(kW-1) - 1) / strideW + 1)

Examples

// Image classification: extract features from input image
const image = torch.randn([1, 3, 224, 224]);  // [batch=1, RGB, height, width]
const kernel = torch.randn([64, 3, 3, 3]);    // [out_channels=64, in_channels=3, kH=3, kW=3]
const bias = torch.randn([64]);
const output = torch.nn.functional.conv2d(image, kernel, bias, { stride: 1, padding: 1 });
// output shape: [1, 64, 224, 224]

// Strided convolution for downsampling
const input = torch.randn([4, 64, 112, 112]);  // ResNet layer input
const weight = torch.randn([128, 64, 3, 3]);
const output = torch.nn.functional.conv2d(input, weight, undefined, { stride: 2, padding: 1 });
// stride=2 reduces spatial dims: [4, 128, 56, 56]
// Common in ResNet for dimension reduction between blocks

// Dilated convolution for broader receptive field
const x = torch.randn([1, 128, 32, 32]);
const w = torch.randn([128, 128, 3, 3]);
// Dilation=2 creates 5x5 receptive field with 3x3 kernel
const y = torch.nn.functional.conv2d(x, w, undefined, { padding: 2, dilation: 2 });

// Depthwise separable convolution: step 1 (depthwise)
const x = torch.randn([32, 64, 28, 28]);
const w = torch.randn([64, 1, 3, 3]);  // One filter per input channel
// groups=C_in makes it depthwise
const depthwise = torch.nn.functional.conv2d(x, w, undefined, { padding: 1, groups: 64 });
// Output [32, 64, 28, 28] - spatial patterns per channel independently

// Using separate projection weights manually
const input = torch.randn([8, 128, 16, 16]);
const weight = torch.randn([256, 128, 3, 3]);
const output = torch.nn.functional.conv2d(input, weight, undefined, {
  stride: [2, 2],
  padding: [1, 1],
  dilation: [1, 1],
  groups: 1
});

torch.nn.functional.conv2d

function conv2d<S extends Shape, D extends DType = DType, Dev extends DeviceType = DeviceType>(input: Tensor<S, D, Dev>, weight: Tensor): Tensor<Shape, D, Dev>

function conv2d<S extends Shape, D extends DType = DType, Dev extends DeviceType = DeviceType>(input: Tensor<S, D, Dev>, weight: Tensor, bias: Tensor | null, stride: number | [number, number], padding: number | [number, number], dilation: number | [number, number], groups: number, options: Conv2dFunctionalOptions): Tensor<Shape, D, Dev>

2D Convolution: applies learned filters to extract spatial features from images.

Image classification (ResNet, VGG, Inception, MobileNet)
Object detection (YOLO, Faster R-CNN, SSD)
Semantic segmentation (FCN, U-Net, DeepLab)
Image generation and super-resolution (GANs, diffusion models)
Spatially-correlated feature extraction (audio spectrograms, time series with locality)

Key parameters:

kernel_size: Receptive field size (e.g., 3x3 captures immediate neighbors; 5x5 broader context)
stride: How many pixels to move kernel (larger = smaller output, faster but loses information)
padding: Border zeros (preserves spatial dimensions with stride=1, centers kernel)
dilation: Spacing between kernel elements (sparse sampling for broader context without growth)
groups: Depthwise convolution (groups=C_in = depthwise; groups=1 = standard; groups>1 = grouped conv)

Common architectures:

ResNet: 3x3 kernels, stride 1 or 2, batch norm, skip connections
VGG: Multiple 3x3 kernels stacked (builds larger receptive field)
Inception: Parallel 1x1, 3x3, 5x5 kernels for multi-scale features
MobileNet: Depthwise separable convolutions (depthwise + pointwise) for efficiency

\begin{aligned} \text{Output}[n,c,i,j] = \text{bias}[c] + \sum_{k_h,k_w,c'} \text{weight}[c,c',k_h,k_w] \times \text{input}[n, c', i \cdot s_h + k_h \cdot d_h, j \cdot s_w + k_w \cdot d_w] \\ \text{Output height: } H_{out} = \lfloor \frac{H_{in} + 2 \cdot pad_h - dil_h \cdot (k_h - 1) - 1}{stride_h} \rfloor + 1 \\ \text{Output width: } W_{out} = \lfloor \frac{W_{in} + 2 \cdot pad_w - dil_w \cdot (k_w - 1) - 1}{stride_w} \rfloor + 1 \\ \text{Receptive field: } 1 + (k_h - 1) \cdot dil_h \times 1 + (k_w - 1) \cdot dil_w \end{aligned}

1x1 convolutions: Used for channel dimension changes without spatial mixing
Padding choices: padding=(kernel_size-1)//2 preserves spatial dims with stride=1
Receptive field: Stacking convolutions increases receptive field multiplicatively
Depthwise separable: Depthwise (groups=C_in) + pointwise (1x1) is much cheaper
Dilated convolutions: Multi-rate processing captures context without spatial reduction
Gradient flow: Convolution fully differentiable; gradients flow through efficiently
Memory usage: Scales with input_channels × output_channels × kernel_size²

Parameters

inputTensor<S, D, Dev>: Input tensor of shape [N, C_in, H, W] - N: batch size - C_in: number of input channels (e.g., 3 for RGB images) - H, W: spatial height and width
weightTensor: Learnable filter tensor of shape [C_out, C_in/groups, kH, kW] - C_out: number of output channels (filters) - C_in/groups: input channels per group (C_in if groups=1) - kH, kW: kernel height and width (e.g., 3x3 for typical conv)

Returns

Examples

// Image classification: extract features from input image
const image = torch.randn([1, 3, 224, 224]);  // [batch=1, RGB, height, width]
const kernel = torch.randn([64, 3, 3, 3]);    // [out_channels=64, in_channels=3, kH=3, kW=3]
const bias = torch.randn([64]);
const output = torch.nn.functional.conv2d(image, kernel, bias, { stride: 1, padding: 1 });
// output shape: [1, 64, 224, 224]

// Strided convolution for downsampling
const input = torch.randn([4, 64, 112, 112]);  // ResNet layer input
const weight = torch.randn([128, 64, 3, 3]);
const output = torch.nn.functional.conv2d(input, weight, undefined, { stride: 2, padding: 1 });
// stride=2 reduces spatial dims: [4, 128, 56, 56]
// Common in ResNet for dimension reduction between blocks

// Dilated convolution for broader receptive field
const x = torch.randn([1, 128, 32, 32]);
const w = torch.randn([128, 128, 3, 3]);
// Dilation=2 creates 5x5 receptive field with 3x3 kernel
const y = torch.nn.functional.conv2d(x, w, undefined, { padding: 2, dilation: 2 });

// Depthwise separable convolution: step 1 (depthwise)
const x = torch.randn([32, 64, 28, 28]);
const w = torch.randn([64, 1, 3, 3]);  // One filter per input channel
// groups=C_in makes it depthwise
const depthwise = torch.nn.functional.conv2d(x, w, undefined, { padding: 1, groups: 64 });
// Output [32, 64, 28, 28] - spatial patterns per channel independently

// Using separate projection weights manually
const input = torch.randn([8, 128, 16, 16]);
const weight = torch.randn([256, 128, 3, 3]);
const output = torch.nn.functional.conv2d(input, weight, undefined, {
  stride: [2, 2],
  padding: [1, 1],
  dilation: [1, 1],
  groups: 1
});

torch.nn.functional.conv2d

Parameters

Returns

Examples

See Also

torch.nn.functional.conv2d

Parameters

Returns

Examples

See Also