Skip to main content
torch.js has not been released yet.
torch.js logotorch.js logotorch.js
PlaygroundContact
Login
Documentation
IntroductionType SafetyTensor ExpressionsTensor IndexingEinsumEinopsAutogradTraining a ModelProfiling & MemoryPyTorch MigrationBest PracticesRuntimesPerformancePyTorch CompatibilityBenchmarksDType Coverage
torch.js· 2026
LegalTerms of UsePrivacy Policy
/
/
  1. docs
  2. torch.js
  3. torch
  4. nn
  5. functional
  6. Conv2d

torch.nn.functional.conv2d

function conv2d<S extends Shape, D extends DType = DType, Dev extends DeviceType = DeviceType>(input: Tensor<S, D, Dev>, weight: Tensor): Tensor<Shape, D, Dev>function conv2d<S extends Shape, D extends DType = DType, Dev extends DeviceType = DeviceType>(input: Tensor<S, D, Dev>, weight: Tensor, bias: Tensor | null, stride: number | [number, number], padding: number | [number, number], dilation: number | [number, number], groups: number, options: Conv2dFunctionalOptions): Tensor<Shape, D, Dev>

2D Convolution: applies learned filters to extract spatial features from images.

Applies 2D convolution (cross-correlation in practice) over an input image by sliding a kernel over spatial dimensions and computing dot products. Core operation in convolutional neural networks. Foundation of all computer vision models. Essential for:

  • Image classification (ResNet, VGG, Inception, MobileNet)
  • Object detection (YOLO, Faster R-CNN, SSD)
  • Semantic segmentation (FCN, U-Net, DeepLab)
  • Image generation and super-resolution (GANs, diffusion models)
  • Spatially-correlated feature extraction (audio spectrograms, time series with locality)

How Conv2D works: Slides a learnable kernel (filter) over input, computing element-wise products and summing results. Output at position (i,j) = Σ (kernel * input_patch) + bias. Captures local spatial patterns. Multiple kernels extract different features; deeper layers combine features hierarchically.

Key parameters:

  • kernel_size: Receptive field size (e.g., 3x3 captures immediate neighbors; 5x5 broader context)
  • stride: How many pixels to move kernel (larger = smaller output, faster but loses information)
  • padding: Border zeros (preserves spatial dimensions with stride=1, centers kernel)
  • dilation: Spacing between kernel elements (sparse sampling for broader context without growth)
  • groups: Depthwise convolution (groups=C_in = depthwise; groups=1 = standard; groups>1 = grouped conv)

Common architectures:

  • ResNet: 3x3 kernels, stride 1 or 2, batch norm, skip connections
  • VGG: Multiple 3x3 kernels stacked (builds larger receptive field)
  • Inception: Parallel 1x1, 3x3, 5x5 kernels for multi-scale features
  • MobileNet: Depthwise separable convolutions (depthwise + pointwise) for efficiency

Computational complexity: O(C_out × C_in × kH × kW × output_H × output_W). Large kernels or many channels are expensive. Optimizations: grouped convolution, depthwise separable, dilated convolution, pruning.

Output[n,c,i,j]=bias[c]+∑kh,kw,c′weight[c,c′,kh,kw]×input[n,c′,i⋅sh+kh⋅dh,j⋅sw+kw⋅dw]Output height: Hout=⌊Hin+2⋅padh−dilh⋅(kh−1)−1strideh⌋+1Output width: Wout=⌊Win+2⋅padw−dilw⋅(kw−1)−1stridew⌋+1Receptive field: 1+(kh−1)⋅dilh×1+(kw−1)⋅dilw\begin{aligned} \text{Output}[n,c,i,j] = \text{bias}[c] + \sum_{k_h,k_w,c'} \text{weight}[c,c',k_h,k_w] \times \text{input}[n, c', i \cdot s_h + k_h \cdot d_h, j \cdot s_w + k_w \cdot d_w] \\ \text{Output height: } H_{out} = \lfloor \frac{H_{in} + 2 \cdot pad_h - dil_h \cdot (k_h - 1) - 1}{stride_h} \rfloor + 1 \\ \text{Output width: } W_{out} = \lfloor \frac{W_{in} + 2 \cdot pad_w - dil_w \cdot (k_w - 1) - 1}{stride_w} \rfloor + 1 \\ \text{Receptive field: } 1 + (k_h - 1) \cdot dil_h \times 1 + (k_w - 1) \cdot dil_w \end{aligned}Output[n,c,i,j]=bias[c]+kh​,kw​,c′∑​weight[c,c′,kh​,kw​]×input[n,c′,i⋅sh​+kh​⋅dh​,j⋅sw​+kw​⋅dw​]Output height: Hout​=⌊strideh​Hin​+2⋅padh​−dilh​⋅(kh​−1)−1​⌋+1Output width: Wout​=⌊stridew​Win​+2⋅padw​−dilw​⋅(kw​−1)−1​⌋+1Receptive field: 1+(kh​−1)⋅dilh​×1+(kw​−1)⋅dilw​​
  • 1x1 convolutions: Used for channel dimension changes without spatial mixing
  • Padding choices: padding=(kernel_size-1)//2 preserves spatial dims with stride=1
  • Receptive field: Stacking convolutions increases receptive field multiplicatively
  • Depthwise separable: Depthwise (groups=C_in) + pointwise (1x1) is much cheaper
  • Dilated convolutions: Multi-rate processing captures context without spatial reduction
  • Gradient flow: Convolution fully differentiable; gradients flow through efficiently
  • Memory usage: Scales with input_channels × output_channels × kernel_size²

Parameters

inputTensor<S, D, Dev>
Input tensor of shape [N, C_in, H, W] - N: batch size - C_in: number of input channels (e.g., 3 for RGB images) - H, W: spatial height and width
weightTensor
Learnable filter tensor of shape [C_out, C_in/groups, kH, kW] - C_out: number of output channels (filters) - C_in/groups: input channels per group (C_in if groups=1) - kH, kW: kernel height and width (e.g., 3x3 for typical conv)

Returns

Tensor<Shape, D, Dev>– Output tensor of shape [N, C_out, H_out, W_out] - H_out = floor((H_in + 2*padH - dilH*(kH-1) - 1) / strideH + 1) - W_out = floor((W_in + 2*padW - dilW*(kW-1) - 1) / strideW + 1)

Examples

// Image classification: extract features from input image
const image = torch.randn([1, 3, 224, 224]);  // [batch=1, RGB, height, width]
const kernel = torch.randn([64, 3, 3, 3]);    // [out_channels=64, in_channels=3, kH=3, kW=3]
const bias = torch.randn([64]);
const output = torch.nn.functional.conv2d(image, kernel, bias, { stride: 1, padding: 1 });
// output shape: [1, 64, 224, 224]
// Strided convolution for downsampling
const input = torch.randn([4, 64, 112, 112]);  // ResNet layer input
const weight = torch.randn([128, 64, 3, 3]);
const output = torch.nn.functional.conv2d(input, weight, undefined, { stride: 2, padding: 1 });
// stride=2 reduces spatial dims: [4, 128, 56, 56]
// Common in ResNet for dimension reduction between blocks
// Dilated convolution for broader receptive field
const x = torch.randn([1, 128, 32, 32]);
const w = torch.randn([128, 128, 3, 3]);
// Dilation=2 creates 5x5 receptive field with 3x3 kernel
const y = torch.nn.functional.conv2d(x, w, undefined, { padding: 2, dilation: 2 });
// Depthwise separable convolution: step 1 (depthwise)
const x = torch.randn([32, 64, 28, 28]);
const w = torch.randn([64, 1, 3, 3]);  // One filter per input channel
// groups=C_in makes it depthwise
const depthwise = torch.nn.functional.conv2d(x, w, undefined, { padding: 1, groups: 64 });
// Output [32, 64, 28, 28] - spatial patterns per channel independently
// Using separate projection weights manually
const input = torch.randn([8, 128, 16, 16]);
const weight = torch.randn([256, 128, 3, 3]);
const output = torch.nn.functional.conv2d(input, weight, undefined, {
  stride: [2, 2],
  padding: [1, 1],
  dilation: [1, 1],
  groups: 1
});

See Also

  • PyTorch torch.nn.functional.conv2d