torch.nn.functional.conv_transpose2d
function conv_transpose2d<S extends Shape, D extends DType = DType, Dev extends DeviceType = DeviceType>(input: Tensor<S, D, Dev>, weight: Tensor): Tensor<Shape, D, Dev>function conv_transpose2d<S extends Shape, D extends DType = DType, Dev extends DeviceType = DeviceType>(input: Tensor<S, D, Dev>, weight: Tensor, bias: Tensor | null, stride: number | [number, number], padding: number | [number, number], output_padding: number | [number, number], groups: number, dilation: number | [number, number], options: ConvTranspose2dFunctionalOptions): Tensor<Shape, D, Dev>2D Transposed Convolution (Deconvolution): upsamples spatial dimensions with learned parameters.
Applies transposed 2D convolution for upsampling spatial dimensions. NOT true deconvolution (which inverts convolution), but learnable upsampling with a transposed kernel. Inverse operation of standard convolution in terms of spatial shape (stride increases spatial size). Essential for:
- Generative models (GANs, VAEs) - learnable upsampling for image generation
- Semantic segmentation (FCN, U-Net, DeepLab) - restoring spatial resolution
- Super-resolution and image enhancement (SRGAN, upsampling for perceptual losses)
- Video frame generation and temporal upsampling
- Dense prediction tasks requiring high-resolution outputs
- Replacing bilinear/nearest-neighbor upsampling with learned parameters
How transposed convolution works: Opposite of standard convolution: stride > 1 enlarges spatial dimensions instead of shrinking. Think of it as: spread input values over larger grid, apply convolution. Mathematically equivalent to padding input heavily, applying standard convolution with stride=1.
Key insight: Output size increases with stride (stride=2 roughly quadruples spatial area). Can learn fine-grain details during upsampling via kernel parameters. Better than fixed upsampling.
Common architectures:
- GAN generators: Stack transposed convolutions for progressive upsampling
- U-Net decoder: Transpose conv or concatenation + conv for skip connections
- FCN decoder: Transposed convolutions to restore input resolution
- Progressive GAN: Increasingly larger transposed conv layers (ProGAN)
Output size formula: output_size = (input_size - 1) * stride - 2 * padding + dilation * (kernel_size - 1) + output_padding + 1
- Learnable upsampling: Better than fixed bilinear/nearest; learns detail patterns
- Stride-determined expansion: stride=2 means ~4x spatial area, stride=3 means ~9x, etc.
- Checkerboard artifacts: Can create patterns at stride kernel_size; mitigated by kernel_size ≥ stride
- Output padding: Needed for stride 1 without ambiguity; most use stride=2, kernel=4, padding=1, output_padding=1
- GAN standard: Transposed convolution is de-facto standard for GAN generators (not bilinear)
- Gradient flow: Fully differentiable; gradients flow efficiently through upsampling
- Checkerboard artifacts: Naive transposed conv (kernel stride) creates visual artifacts
- Kernel-stride relationship: Use kernel_size ≥ stride to avoid artifacts (e.g., kernel=4 for stride=2)
- Output size ambiguity: Multiple (padding, output_padding) pairs give same output size
- Memory cost: Upsampling layers often memory-intensive; watch batch size
- NOT true deconvolution: Doesn't invert convolution exactly; learnable but imperfect inverse
- Dilation support limited: Some implementations don't support dilation 1 on transpose
Parameters
inputTensor<S, D, Dev>- Input tensor of shape [N, C_in, H, W] (typically smaller spatial dims) - N: batch size - C_in: number of input channels - H, W: input spatial height and width (usually small in decoder)
weightTensor- Learnable filter tensor of shape [C_in, C_out/groups, kH, kW] - C_in: input channels (matches input.shape[1]) - C_out/groups: output channels per group (total C_out = C_out_per_group * groups) - kH, kW: kernel height and width
Returns
Tensor<Shape, D, Dev>– Output tensor of shape [N, C_out, H_out, W_out] - H_out = (H_in - 1) * strideH - 2 * padH + dilH * (kH - 1) + output_padH + 1 - W_out = (W_in - 1) * strideW - 2 * padW + dilW * (kW - 1) + output_padW + 1Examples
// GAN generator: learnable upsampling from latent code
const latent = torch.randn([32, 128, 4, 4]); // [batch=32, channels=128, H=4, W=4]
const kernel = torch.randn([128, 256, 5, 5]); // [in=128, out=256, kH=5, kW=5]
const upsampled = torch.nn.functional.conv_transpose2d(
latent, kernel, undefined, 2, 2, 1
);
// stride=2: roughly 4x spatial expansion -> [32, 256, 8, 8]
// GAN stacks multiple layers: 4→8→16→32→64 for full resolution// U-Net decoder: restore spatial resolution with learned features
const encoded = torch.randn([8, 512, 8, 8]); // Encoder output
const weight = torch.randn([512, 256, 4, 4]);
const decoded = torch.nn.functional.conv_transpose2d(
encoded, weight, undefined, 2, 1, 1
);
// Spatial dims: 8×8 → 16×16 with learned upsampling
// Skip connections concatenate high-res features from encoder// Semantic segmentation: pixel-wise prediction upsampling
const features = torch.randn([4, 512, 16, 16]); // Feature maps from encoder
// Multiple transpose conv layers restore resolution
let x = features;
for (let i = 0; i < 4; i++) {
const kernel = torch.randn([x.shape[1], 256, 4, 4]); // Adaptive kernel
x = torch.nn.functional.conv_transpose2d(x, kernel, undefined, 2, 1, 1);
}
// 16×16 → 32×32 → 64×64 → 128×128 → 256×256
// Final shape: [4, 256, 256, 256] for full-resolution segmentation// Super-resolution: upscale low-res image to high-res
const lowres = torch.randn([1, 3, 32, 32]); // Low-res input
const kernel2x = torch.randn([3, 32, 3, 3]);
const medres = torch.nn.functional.conv_transpose2d(lowres, kernel2x, undefined, 2, 1, 1);
// medres: [1, 32, 64, 64]
const kernel_final = torch.randn([32, 3, 3, 3]);
const highres = torch.nn.functional.conv_transpose2d(medres, kernel_final, undefined, 1, 1, 0);
// highres: [1, 3, 64, 64] - 2x super-resolved RGB image// Progressive GAN: learnable hierarchical upsampling
const z = torch.randn([8, 512, 1, 1]); // Noise vector (1x1 spatial)
let x = z;
// Progressive upsampling: 1 → 2 → 4 → 8 → 16 → 32 → 64 (2x per layer)
const sizes = [2, 4, 8, 16, 32, 64];
for (const size of sizes) {
const out_channels = size === 1 ? 512 : 256;
const kernel = torch.randn([x.shape[1], out_channels, 4, 4]);
x = torch.nn.functional.conv_transpose2d(x, kernel, undefined, 2, 1, 1);
}
// Final: [8, 256, 64, 64] high-resolution generated image// Fractional upsampling: non-power-of-2 scaling via output_padding
const input = torch.randn([1, 64, 16, 16]);
const kernel = torch.randn([64, 128, 3, 3]);
// output_padding helps with stride choices for exact target sizes
const output = torch.nn.functional.conv_transpose2d(
input, kernel, undefined, 2, 1, [1, 0]
);
// Careful size management for architectures requiring specific dimensionsSee Also
- PyTorch torch.nn.functional.conv_transpose2d
- conv2d - Regular convolution (downsampling spatial dims with stride 1)
- upsample - Non-learnable upsampling (bilinear, nearest-neighbor)
- interpolate - Non-learnable spatial resampling
- conv_transpose1d - 1D variant for sequences
- conv_transpose3d - 3D variant for volumetric data