Not Found

Documentation

Introduction Type Safety Tensor Expressions Tensor Indexing Einsum Einops Autograd Training a Model Profiling & Memory PyTorch Migration Best Practices Runtimes Performance PyTorch Compatibility Benchmarks DType Coverage

abs acos acosh AdaptivePool1dShape AdaptivePool2dShape add addbmm AddbmmOptions addcdiv AddcdivOptions addcmul AddcmulOptions addmm AddmmOptions addmv AddmvOptions addr AddrOptions adjoint all allclose AllcloseOptions AlphaBetaOptions amax amin aminmax AminmaxOptions angle any applyOut arange are_deterministic_algorithms_enabled argmax argmin argsort argwhere as_strided as_tensor asin asinh AssertNoShapeError AssertNotError AsStridedOptions At at_error_index_out_of_bounds atan atan2 atanh atleast_1d atleast_2d atleast_3d AtShape autocast_decrement_nesting autocast_increment_nesting autograd_gradient_mismatch_error autograd_not_registered_error AutogradConfig AutogradDevice AutogradDType AutogradEntry AutogradHandle AutogradHandleImpl AxesRecord BackwardFn baddbmm BaddbmmOptions bartlett_window BaseKernelConfig batch_dimensions_do_not_match_error bernoulli BernoulliOptions BinaryBackwardFn BinaryBroadcastResult BinaryDType BinaryKernelConfigCPU BinaryKernelCPU BinaryOpConfig BinaryOpNames BinaryOpSchema BinaryOptions bincount BincountOptions bitwise_and bitwise_left_shift bitwise_not bitwise_or bitwise_right_shift bitwise_xor blackman_window block_diag bmm BooleanDTypeRule broadcast_error_incompatible_dimensions broadcast_shapes broadcast_tensors broadcast_to BroadcastShape BroadcastShapeRule broadcastShapes bucketize BucketizeOptions BufferUsage buildEinopsError buildErrorMessage canBroadcastTo cartesian_prod cat CatOptions CatShape CauchyOptions cdist CdistOptions ceil celu CeluFunctionalOptions chain_matmul CheckShapeError CholeskyShape chunk chunk_error_dim_out_of_range ChunkOptions clamp ClampOptions clear_autocast_cache clearEinopsCache clearEinsumCache clone column_stack combinations CombinationsOptions compiled_with_cxx11_abi complex conj conj_physical contiguous Conv1dShape Conv2dShape Conv3dShape ConvTranspose2dShape copysign corrcoef cos cosh count_nonzero CountNonzeroOptions cov coverage_report coverageReport CoverageReport CovOptions CPUForwardFn CPUKernelConfig CPUKernelEntry CPUOnlyResult CPUTensorData createCumExtremeResult createTorch CreationOpSchema CumExtremeResult cummax cummin cumprod CumShape cumsum cumulative_trapezoid CumulativeOptions CumulativeOptionsWithDim deg2rad detach DeterministicOptions DetShape Device device_error_requires DeviceBuffer DeviceCapabilities DeviceCheckedResult DeviceConfig DeviceContext DeviceEntry DeviceHandle DeviceInput DeviceOptions DeviceRegistry DeviceType diag diag_embed DiagEmbedOptions diagflat DiagflatOptions DiagFlatOptions diagonal_scatter DiagonalOptions DiagonalScatterOptions DiagOptions DiagShape diff DiffOptions digamma dimension_error_out_of_range DispatchConfig dist DistOptions div dot DotShapeRule Double DoubleDim dropout DropoutFunctionalOptions dsplit dstack DType dtype_already_registered_error dtype_components_mismatch_error dtype_not_found_error DTypeComponents DTypeConfig DTypeCoverageReport DTypeDisplayConfig DTypeEntry DTypeHandle DTypeHandleImpl DTypeInfo DTypeRegistry DTypeRule DTypeSerializationConfig DynamicShape EigShape einops_error_ambiguous_decomposition einops_error_anonymous_in_output einops_error_dimension_mismatch einops_error_invalid_pattern einops_error_reduce_undefined_output einops_error_repeat_missing_size einops_error_undefined_axis einsum einsum_error_dimension_mismatch einsum_error_index_out_of_range einsum_error_invalid_equation einsum_error_invalid_sublist_element einsum_error_operand_count_mismatch einsum_error_subscript_rank_mismatch einsum_error_unknown_output_index EinsumOptions EinsumOutputShape Ellipsis elu elu_EluFunctionalOptions embedding_bag_error_requires_2d_input empty empty_cache empty_like eq equal erf erfc erfinv exp exp2 expand expand_as expand_error_incompatible ExpandShape expm1 ExponentialOptions eye EyeOptions fft FFTOptions findKernelWithPredicate findSimilarPatterns flatten FlattenOptions FlattenShape flip flip_error_dim_out_of_range fliplr FlipShape flipud float_power FloatDTypeRule floor floor_divide fmax fmin fmod formatEquationError formatShape frac frexp frombuffer full full_like function_already_registered_error FunctionConfig FunctionEntry FunctionHandle gather gather_error_dim_out_of_range GatherShape gcd ge gelu GeometricOptions get_autocast_cpu_dtype get_autocast_gpu_dtype get_autocast_ipu_dtype get_autocast_xla_dtype get_default_device get_default_dtype get_deterministic_debug_mode get_device_config get_device_context get_device_module get_dtype_info get_file_path get_float32_matmul_precision get_num_interop_threads get_num_threads get_op_info get_printoptions get_real_dtype get_rng_state getAutograd getDType getEinopsCacheSize getEinsumCacheSize getFunction getKernel getMethod getOpInfo GetOpKind GetOpSchema getScalarKernel glu GluFunctionalOptions GradContext GradFn GradientsFor gt Half HalfDim hamming_window hann_window hardshrink hardsigmoid hardswish hardtanh hardtanh_HardtanhFunctionalOptions has_autograd has_device has_dtype has_kernel hasAutograd hasDType hasFunction hasKernel hasMethod hasScalarKernel HasShapeError heaviside histc HistcOptions histogram HistogramOptions HistogramResult hsplit hstack hypot i0 IdentityShape ifft imag index_add index_copy index_fill index_put index_reduce index_select index_select_error_dim_out_of_range IndexPutOptions IndexSelectShape IndexSpec IndicesOptions IndicesSpec initialize_device InputsFor InsertDim invalid_config_error inverse InverseShape irfft is_anomaly_check_nan_enabled is_anomaly_enabled is_autocast_cache_enabled is_autocast_cpu_enabled is_autocast_ipu_enabled is_autocast_xla_enabled is_complex is_complex_dtype is_cpu_only_mode is_deterministic_algorithms_warn_only_enabled is_floating_point is_floating_point_dtype is_inference_mode_enabled is_nonzero is_tensor is_warn_always_enabled is_webgpu_available Is2D IsAtLeast1D IsBinaryOp IsBinaryOpName isclose IscloseOptions isfinite isin isinf isnan isneginf isposinf isreal IsReductionOp IsReductionOpName IsRegistryError IsShapeError istft ISTFTOptions IsUnaryOp IsUnaryOpName item_error_not_scalar ItemResult kaiser_window KaiserWindowOptions kernel_not_registered_error kernel_signature_mismatch_error KernelConfig KernelConfigWebGPU KernelEntry KernelHandle KernelInfo KernelPredicate KernelRegistry KernelWebGPU kron kthvalue KthvalueOptions lcm ldexp le leaky_relu leaky_relu_LeakyReluFunctionalOptions lerp levenshteinDistance lgamma linalg_error_not_square_matrix linalg_error_requires_2d linalg_error_requires_at_least_2d linear linspace list_custom_devices list_custom_dtypes list_devices list_dtypes list_functions list_kernels list_methods list_ops listCustomDTypes listDTypes listFunctions listKernels ListKernelsOptions listMethods listOps ListOpsOptions log log_softmax log10 log1p log2 logaddexp logaddexp2 logcumsumexp logical_and logical_not logical_or logical_xor LogitOptions LogNormalOptions LogOptions logsigmoid logspace logsumexp LogsumexpOptions lt LUShape LuSolveOptions masked_fill masked_select masked_select_async MaskSpec matmul matmul_error_inner_dimensions_do_not_match Matmul2DShape MatmulShape MatmulShapeRule MatrixTransposeShape max maximum mean median memory_stats memory_summary meshgrid method_already_registered_error method_dtype_not_supported_error MethodConfig MethodEntry MethodHandle min minimum mish mm MMShapeRule mode movedim msort mul multinomial multinomial_async MultinomialAsyncOptions MultinomialOptions MultiplyBy mv MVShapeRule nan_to_num nanmean nanmedian nanquantile NanReductionOptions nansum NanToNumOptions narrow narrow_copy narrow_error_length_exceeds_bounds narrow_error_start_out_of_bounds NarrowShape ne needsBroadcast neg NegativeDim nextafter nonzero NonzeroOptions norm normal NormalOptions NormOptions numel ones ones_like op_kind_mismatch_error op_not_found_error OpCoverageEntry OpInfo OpKind OpName OpSchema OpSchemas outer OuterShape pack PackShape permute permute_error_dimension_count_mismatch PermuteShape poisson polar Pool1dShape Pool2dShape Pool3dShape positive pow prelu PrintOptions prod profiler_allow_cudagraph_cupti_lazy_reinit_cuda12 promote_types PromoteDTypeRule PutOptions quantile QuantileOptions rad2deg rand rand_like randint randint_like RandintLikeOptions RandintOptions randn randn_like RandomLikeOptions RandomOptions randperm RangeSpec Rank ravel real rearrange RearrangeOptions RearrangeShape reciprocal reduce ReduceOperation ReduceOptions ReduceShape ReductionKernelConfigCPU ReductionKernelCPU ReductionOpNames ReductionOpSchema ReductionOptions ReductionShapeRule register_backward register_device register_dtype register_forward register_function register_method register_scalar_forward registerAutograd RegisterBackwardOptions registerBinaryOp registerDType RegisterDTypeOptions RegisteredDType registerFunction RegisterFunctionOptions registerKernel RegisterKernelOptions registerMethod RegisterMethodOptions registerScalarKernel registerUnaryOp registration_failed_error relu relu_relu6 ReluFunctionalOptions remainder RemoveDim repeat repeat_interleave RepeatInterleaveOptions RepeatOptions RepeatShape ReplaceDim requireWebGPU reset_peak_memory_stats reshape ReshapeShape result_type rfft roll RollOptions rot90 Rot90Options round rrelu rrelu_RreluFunctionalOptions rsqrt SafeExpandShape SameDTypeRule SameShapeRule SaveForBackward ScalarCPUForwardFn ScalarCPUKernelConfig ScalarKernelEntry ScalarKernelHandle ScalarWebGPUKernelConfig ScaleDim scatter scatter_add scatter_add_scatter_error_dim_out_of_range scatter_reduce scatter_reduce_ScatterReduceOptions ScatterShape searchsorted SearchSortedOptions select select_error_index_out_of_bounds select_scatter SelectShape selu set_default_device set_default_tensor_type set_deterministic_debug_mode set_float32_matmul_precision set_printoptions set_warn_always SetupContextFn Shape ShapeCheckedResult ShapedTensor ShapeErrorMessage ShapeOpSchema ShapeRule sigmoid sign signbit silu sin sinc sinh SizeOptions slice_error_out_of_bounds slice_scatter SliceOptions SliceScatterOptions SliceShape SliceSpec softmax softmax_error_dim_out_of_range SoftmaxShape softmin SoftminFunctionalOptions softplus SoftplusFunctionalOptions softshrink softsign sort SortOptions split split_error_dim_out_of_range SplitOptions sqrt square squeeze SqueezeOptions SqueezeShape stack StackOptions StackShape std std_mean StdVarMeanOptions StdVarOptions stft STFTOptions StrideOptions sub Sublist SublistElement SubscriptIndex sum SVDShape swapaxes sym_float sym_int sym_not t take take_along_dim TakeAlongDimOptions tan tanh tanhshrink tensor tensor_split TensorCreator TensorData tensordot TensordotOptions TensorLike TensorMeta TensorOptions TensorStorage threshold threshold_tile TileShape ToOptions topk TopkOptions Torch trace TraceShape transpose transpose_dims_error_out_of_range transpose_error_requires_2d_tensor TransposeDimsShape TransposeDimsShapeChecked TransposeShape trapezoid TrapezoidOptions TriangularOptions tril tril_indices TriOptions Triple triu triu_indices true_divide trunc TupleOfLength TypedArray TypedArrayFor TypedStorage TypeOptions UnaryBackwardFn UnaryDType UnaryKernelConfigCPU UnaryKernelCPU UnaryOpConfig UnaryOpFn UnaryOpNames UnaryOpParams UnaryOpSchema UnaryOptions unbind unbind_error_dim_out_of_range UnbindOptions unflatten UniformOptions unique unique_consecutive UniqueConsecutiveOptions UniqueOptions unpack UnpackShape unravel_index unregister_device unsqueeze UnsqueezeOptions UnsqueezeShape use_deterministic_algorithms ValidateBatchedSquareMatrix ValidateChunkDim ValidatedEinsumShape validateDevice ValidateDevice ValidatedRearrangeShape ValidatedReduceShape ValidatedRepeatShape validateDType ValidateEinsum ValidateOperandCount ValidateRanks ValidateScalar ValidateSplitDim ValidateSquareMatrix ValidateUnbindDim ValueOptions var_var_mean vdot view view_as_complex view_as_real vmap vsplit vstack WebGPUKernelConfig WebGPUOnlyResult WebGPUTensorData where WindowOptions xlogy zeros zeros_like

torch.nn.MultiheadAttentionOptions

export interface MultiheadAttentionOptions {
  /**
   * Total embedding dimension of the model. This is the dimension of query, key, and value embeddings
   * before projection. Must be divisible by num_heads to split into independent attention heads.
   * Common values: 256, 512, 768 (BERT), 1024, 2048.
   *
   * **Example:** embed_dim=512, num_heads=8 means each head processes 512/8=64 dimensions independently.
   */
  embed_dim: number;

  /**
   * Number of parallel attention heads. Higher numbers enable learning diverse attention patterns but
   * increase computation and parameters. Each head processes embed_dim/num_heads dimensions.
   * Common values: 8, 12 (BERT), 16. Must divide embed_dim evenly.
   *
   * **Trade-offs:**
   * - More heads: Better for capturing diverse patterns (syntax, semantics, syntax, etc.)
   * - Fewer heads: Faster computation, fewer parameters, simpler learned patterns
   * - Typical: 8-16 heads for reasonable size models
   */
  num_heads: number;

  /**
   * Dropout probability applied to attention weights after softmax. Helps prevent overfitting by
   * randomly zeroing attention connections during training. Set to 0 (default) for no dropout.
   * Common values: 0.0 (no dropout), 0.1, 0.2. Automatically disabled during evaluation mode.
   *
   * **Effect:** Prevents co-adaptation of attention heads. During training, each head randomly
   * ignores some key-value pairs with probability dropout.
   *
   * **Default:** 0.0 (no dropout)
   */
  dropout?: number;

  /**
   * Whether to add learnable bias terms to query/key/value projections and output projection.
   * Typically true for better expressiveness, false for minimal parameters. Most models use bias=true.
   *
   * **With bias=true:** Allows shifting the input space before dot-product attention.
   * **With bias=false:** Simpler model, marginal accuracy difference in most cases.
   *
   * **Default:** true
   */
  bias?: boolean;

  /**
   * Whether to add a learnable key and value bias vector to the beginning of key/value sequences.
   * This is a specialized technique used in some Transformer variants. Rarely used in practice.
   *
   * **Effect:** When true, appends learned vectors to key and value sequences before attention.
   * Used in some variants for learnable positional biasing or auxiliary tokens.
   *
   * **Default:** false
   */
  add_bias_kv?: boolean;

  /**
   * Whether to add a zero attention vector to key and value sequences. This is a specialized
   * attention variant. Rarely used in modern architectures. When true, prepends zero vectors to
   * key/value sequences before computing attention.
   *
   * **Effect:** Provides additional "no attention" positions that can attend to nothing explicitly.
   *
   * **Default:** false
   */
  add_zero_attn?: boolean;

  /**
   * Total dimension of the key embeddings. Used for cross-attention where key comes from a different
   * source than query. For self-attention, usually equals embed_dim. For cross-attention (like
   * encoder-decoder), can differ if encoder has different hidden dimension.
   *
   * **Cross-attention example:** embed_dim=512 (decoder), kdim=768 (encoder output dimension)
   * **Self-attention:** kdim defaults to embed_dim
   *
   * **Default:** undefined (uses embed_dim)
   */
  kdim?: number;

  /**
   * Total dimension of the value embeddings. Used for cross-attention where value comes from a
   * different source. Usually equals kdim. For self-attention, equals embed_dim.
   *
   * **Purpose:** Allows projecting encoder outputs to different space for cross-attention.
   * **Cross-attention example:** vdim=768 when attending to encoder outputs with 768 dimensions
   * **Self-attention:** vdim defaults to embed_dim
   *
   * **Default:** undefined (uses embed_dim)
   */
  vdim?: number;

  /**
   * Whether input/output tensors follow batch-first format. Controls expected shape convention.
   * **batch_first=false (default):** Shapes are (sequence_length, batch, embedding_dim)
   *   - Matches PyTorch default and RNN conventions
   *   - More natural for variable-length sequences
   * **batch_first=true:** Shapes are (batch, sequence_length, embedding_dim)
   *   - More intuitive for most users (batch first like CNN conventions)
   *   - Requires automatic transposition internally
   *
   * **Behavior:** This parameter only controls shape interpretation. Internally, computation
   * uses sequence-first format for efficiency. Transposition is automatic based on this flag.
   *
   * **Example:**
   * - batch_first=false: input shape [50, 32, 512] = [seq_len=50, batch=32, embed=512]
   * - batch_first=true: input shape [32, 50, 512] = [batch=32, seq_len=50, embed=512]
   *
   * **Default:** false (sequence-first, matching PyTorch default)
   */
  batch_first?: boolean;
}

embed_dim(number)

– Total embedding dimension of the model. This is the dimension of query, key, and value embeddings before projection. Must be divisible by num_heads to split into independent attention heads. Common values: 256, 512, 768 (BERT), 1024, 2048. Example: embed_dim=512, num_heads=8 means each head processes 512/8=64 dimensions independently.

num_heads(number)

– Number of parallel attention heads. Higher numbers enable learning diverse attention patterns but increase computation and parameters. Each head processes embed_dim/num_heads dimensions. Common values: 8, 12 (BERT), 16. Must divide embed_dim evenly. Trade-offs: - More heads: Better for capturing diverse patterns (syntax, semantics, syntax, etc.) - Fewer heads: Faster computation, fewer parameters, simpler learned patterns - Typical: 8-16 heads for reasonable size models

dropout(number)optional

– Dropout probability applied to attention weights after softmax. Helps prevent overfitting by randomly zeroing attention connections during training. Set to 0 (default) for no dropout. Common values: 0.0 (no dropout), 0.1, 0.2. Automatically disabled during evaluation mode. Effect: Prevents co-adaptation of attention heads. During training, each head randomly ignores some key-value pairs with probability dropout. Default: 0.0 (no dropout)

bias(boolean)optional

– Whether to add learnable bias terms to query/key/value projections and output projection. Typically true for better expressiveness, false for minimal parameters. Most models use bias=true. With bias=true: Allows shifting the input space before dot-product attention. With bias=false: Simpler model, marginal accuracy difference in most cases. Default: true

add_bias_kv(boolean)optional

– Whether to add a learnable key and value bias vector to the beginning of key/value sequences. This is a specialized technique used in some Transformer variants. Rarely used in practice. Effect: When true, appends learned vectors to key and value sequences before attention. Used in some variants for learnable positional biasing or auxiliary tokens. Default: false

add_zero_attn(boolean)optional

– Whether to add a zero attention vector to key and value sequences. This is a specialized attention variant. Rarely used in modern architectures. When true, prepends zero vectors to key/value sequences before computing attention. Effect: Provides additional "no attention" positions that can attend to nothing explicitly. Default: false

kdim(number)optional

– Total dimension of the key embeddings. Used for cross-attention where key comes from a different source than query. For self-attention, usually equals embed_dim. For cross-attention (like encoder-decoder), can differ if encoder has different hidden dimension. Cross-attention example: embed_dim=512 (decoder), kdim=768 (encoder output dimension) Self-attention: kdim defaults to embed_dim Default: undefined (uses embed_dim)

vdim(number)optional

– Total dimension of the value embeddings. Used for cross-attention where value comes from a different source. Usually equals kdim. For self-attention, equals embed_dim. Purpose: Allows projecting encoder outputs to different space for cross-attention. Cross-attention example: vdim=768 when attending to encoder outputs with 768 dimensions Self-attention: vdim defaults to embed_dim Default: undefined (uses embed_dim)

batch_first(boolean)optional

– Whether input/output tensors follow batch-first format. Controls expected shape convention. batch_first=false (default): Shapes are (sequence_length, batch, embedding_dim) - Matches PyTorch default and RNN conventions - More natural for variable-length sequences batch_first=true: Shapes are (batch, sequence_length, embedding_dim) - More intuitive for most users (batch first like CNN conventions) - Requires automatic transposition internally Behavior: This parameter only controls shape interpretation. Internally, computation uses sequence-first format for efficiency. Transposition is automatic based on this flag. Example: - batch_first=false: input shape [50, 32, 512] = [seq_len=50, batch=32, embed=512] - batch_first=true: input shape [32, 50, 512] = [batch=32, seq_len=50, embed=512] Default: false (sequence-first, matching PyTorch default)