Not Found

Documentation

Introduction Type Safety Tensor Expressions Tensor Indexing Einsum Einops Autograd Training a Model Profiling & Memory PyTorch Migration Best Practices Runtimes Performance PyTorch Compatibility Benchmarks DType Coverage

abs acos acosh AdaptivePool1dShape AdaptivePool2dShape add addbmm AddbmmOptions addcdiv AddcdivOptions addcmul AddcmulOptions addmm AddmmOptions addmv AddmvOptions addr AddrOptions adjoint all allclose AllcloseOptions AlphaBetaOptions amax amin aminmax AminmaxOptions angle any applyOut arange are_deterministic_algorithms_enabled argmax argmin argsort argwhere as_strided as_tensor asin asinh AssertNoShapeError AssertNotError AsStridedOptions At at_error_index_out_of_bounds atan atan2 atanh atleast_1d atleast_2d atleast_3d AtShape autocast_decrement_nesting autocast_increment_nesting autograd_gradient_mismatch_error autograd_not_registered_error AutogradConfig AutogradDevice AutogradDType AutogradEntry AutogradHandle AutogradHandleImpl AxesRecord BackwardFn baddbmm BaddbmmOptions bartlett_window BaseKernelConfig batch_dimensions_do_not_match_error bernoulli BernoulliOptions BinaryBackwardFn BinaryBroadcastResult BinaryDType BinaryKernelConfigCPU BinaryKernelCPU BinaryOpConfig BinaryOpNames BinaryOpSchema BinaryOptions bincount BincountOptions bitwise_and bitwise_left_shift bitwise_not bitwise_or bitwise_right_shift bitwise_xor blackman_window block_diag bmm BooleanDTypeRule broadcast_error_incompatible_dimensions broadcast_shapes broadcast_tensors broadcast_to BroadcastShape BroadcastShapeRule broadcastShapes bucketize BucketizeOptions BufferUsage buildEinopsError buildErrorMessage canBroadcastTo cartesian_prod cat CatOptions CatShape CauchyOptions cdist CdistOptions ceil celu CeluFunctionalOptions chain_matmul CheckShapeError CholeskyShape chunk chunk_error_dim_out_of_range ChunkOptions clamp ClampOptions clear_autocast_cache clearEinopsCache clearEinsumCache clone column_stack combinations CombinationsOptions compiled_with_cxx11_abi complex conj conj_physical contiguous Conv1dShape Conv2dShape Conv3dShape ConvTranspose2dShape copysign corrcoef cos cosh count_nonzero CountNonzeroOptions cov coverage_report coverageReport CoverageReport CovOptions CPUForwardFn CPUKernelConfig CPUKernelEntry CPUOnlyResult CPUTensorData createCumExtremeResult createTorch CreationOpSchema CumExtremeResult cummax cummin cumprod CumShape cumsum cumulative_trapezoid CumulativeOptions CumulativeOptionsWithDim deg2rad detach DeterministicOptions DetShape Device device_error_requires DeviceBuffer DeviceCapabilities DeviceCheckedResult DeviceConfig DeviceContext DeviceEntry DeviceHandle DeviceInput DeviceOptions DeviceRegistry DeviceType diag diag_embed DiagEmbedOptions diagflat DiagflatOptions DiagFlatOptions diagonal_scatter DiagonalOptions DiagonalScatterOptions DiagOptions DiagShape diff DiffOptions digamma dimension_error_out_of_range DispatchConfig dist DistOptions div dot DotShapeRule Double DoubleDim dropout DropoutFunctionalOptions dsplit dstack DType dtype_already_registered_error dtype_components_mismatch_error dtype_not_found_error DTypeComponents DTypeConfig DTypeCoverageReport DTypeDisplayConfig DTypeEntry DTypeHandle DTypeHandleImpl DTypeInfo DTypeRegistry DTypeRule DTypeSerializationConfig DynamicShape EigShape einops_error_ambiguous_decomposition einops_error_anonymous_in_output einops_error_dimension_mismatch einops_error_invalid_pattern einops_error_reduce_undefined_output einops_error_repeat_missing_size einops_error_undefined_axis einsum einsum_error_dimension_mismatch einsum_error_index_out_of_range einsum_error_invalid_equation einsum_error_invalid_sublist_element einsum_error_operand_count_mismatch einsum_error_subscript_rank_mismatch einsum_error_unknown_output_index EinsumOptions EinsumOutputShape Ellipsis elu elu_EluFunctionalOptions embedding_bag_error_requires_2d_input empty empty_cache empty_like eq equal erf erfc erfinv exp exp2 expand expand_as expand_error_incompatible ExpandShape expm1 ExponentialOptions eye EyeOptions fft FFTOptions findKernelWithPredicate findSimilarPatterns flatten FlattenOptions FlattenShape flip flip_error_dim_out_of_range fliplr FlipShape flipud float_power FloatDTypeRule floor floor_divide fmax fmin fmod formatEquationError formatShape frac frexp frombuffer full full_like function_already_registered_error FunctionConfig FunctionEntry FunctionHandle gather gather_error_dim_out_of_range GatherShape gcd ge gelu GeometricOptions get_autocast_cpu_dtype get_autocast_gpu_dtype get_autocast_ipu_dtype get_autocast_xla_dtype get_default_device get_default_dtype get_deterministic_debug_mode get_device_config get_device_context get_device_module get_dtype_info get_file_path get_float32_matmul_precision get_num_interop_threads get_num_threads get_op_info get_printoptions get_real_dtype get_rng_state getAutograd getDType getEinopsCacheSize getEinsumCacheSize getFunction getKernel getMethod getOpInfo GetOpKind GetOpSchema getScalarKernel glu GluFunctionalOptions GradContext GradFn GradientsFor gt Half HalfDim hamming_window hann_window hardshrink hardsigmoid hardswish hardtanh hardtanh_HardtanhFunctionalOptions has_autograd has_device has_dtype has_kernel hasAutograd hasDType hasFunction hasKernel hasMethod hasScalarKernel HasShapeError heaviside histc HistcOptions histogram HistogramOptions HistogramResult hsplit hstack hypot i0 IdentityShape ifft imag index_add index_copy index_fill index_put index_reduce index_select index_select_error_dim_out_of_range IndexPutOptions IndexSelectShape IndexSpec IndicesOptions IndicesSpec initialize_device InputsFor InsertDim invalid_config_error inverse InverseShape irfft is_anomaly_check_nan_enabled is_anomaly_enabled is_autocast_cache_enabled is_autocast_cpu_enabled is_autocast_ipu_enabled is_autocast_xla_enabled is_complex is_complex_dtype is_cpu_only_mode is_deterministic_algorithms_warn_only_enabled is_floating_point is_floating_point_dtype is_inference_mode_enabled is_nonzero is_tensor is_warn_always_enabled is_webgpu_available Is2D IsAtLeast1D IsBinaryOp IsBinaryOpName isclose IscloseOptions isfinite isin isinf isnan isneginf isposinf isreal IsReductionOp IsReductionOpName IsRegistryError IsShapeError istft ISTFTOptions IsUnaryOp IsUnaryOpName item_error_not_scalar ItemResult kaiser_window KaiserWindowOptions kernel_not_registered_error kernel_signature_mismatch_error KernelConfig KernelConfigWebGPU KernelEntry KernelHandle KernelInfo KernelPredicate KernelRegistry KernelWebGPU kron kthvalue KthvalueOptions lcm ldexp le leaky_relu leaky_relu_LeakyReluFunctionalOptions lerp levenshteinDistance lgamma linalg_error_not_square_matrix linalg_error_requires_2d linalg_error_requires_at_least_2d linear linspace list_custom_devices list_custom_dtypes list_devices list_dtypes list_functions list_kernels list_methods list_ops listCustomDTypes listDTypes listFunctions listKernels ListKernelsOptions listMethods listOps ListOpsOptions log log_softmax log10 log1p log2 logaddexp logaddexp2 logcumsumexp logical_and logical_not logical_or logical_xor LogitOptions LogNormalOptions LogOptions logsigmoid logspace logsumexp LogsumexpOptions lt LUShape LuSolveOptions masked_fill masked_select masked_select_async MaskSpec matmul matmul_error_inner_dimensions_do_not_match Matmul2DShape MatmulShape MatmulShapeRule MatrixTransposeShape max maximum mean median memory_stats memory_summary meshgrid method_already_registered_error method_dtype_not_supported_error MethodConfig MethodEntry MethodHandle min minimum mish mm MMShapeRule mode movedim msort mul multinomial multinomial_async MultinomialAsyncOptions MultinomialOptions MultiplyBy mv MVShapeRule nan_to_num nanmean nanmedian nanquantile NanReductionOptions nansum NanToNumOptions narrow narrow_copy narrow_error_length_exceeds_bounds narrow_error_start_out_of_bounds NarrowShape ne needsBroadcast neg NegativeDim nextafter nonzero NonzeroOptions norm normal NormalOptions NormOptions numel ones ones_like op_kind_mismatch_error op_not_found_error OpCoverageEntry OpInfo OpKind OpName OpSchema OpSchemas outer OuterShape pack PackShape permute permute_error_dimension_count_mismatch PermuteShape poisson polar Pool1dShape Pool2dShape Pool3dShape positive pow prelu PrintOptions prod profiler_allow_cudagraph_cupti_lazy_reinit_cuda12 promote_types PromoteDTypeRule PutOptions quantile QuantileOptions rad2deg rand rand_like randint randint_like RandintLikeOptions RandintOptions randn randn_like RandomLikeOptions RandomOptions randperm RangeSpec Rank ravel real rearrange RearrangeOptions RearrangeShape reciprocal reduce ReduceOperation ReduceOptions ReduceShape ReductionKernelConfigCPU ReductionKernelCPU ReductionOpNames ReductionOpSchema ReductionOptions ReductionShapeRule register_backward register_device register_dtype register_forward register_function register_method register_scalar_forward registerAutograd RegisterBackwardOptions registerBinaryOp registerDType RegisterDTypeOptions RegisteredDType registerFunction RegisterFunctionOptions registerKernel RegisterKernelOptions registerMethod RegisterMethodOptions registerScalarKernel registerUnaryOp registration_failed_error relu relu_relu6 ReluFunctionalOptions remainder RemoveDim repeat repeat_interleave RepeatInterleaveOptions RepeatOptions RepeatShape ReplaceDim requireWebGPU reset_peak_memory_stats reshape ReshapeShape result_type rfft roll RollOptions rot90 Rot90Options round rrelu rrelu_RreluFunctionalOptions rsqrt SafeExpandShape SameDTypeRule SameShapeRule SaveForBackward ScalarCPUForwardFn ScalarCPUKernelConfig ScalarKernelEntry ScalarKernelHandle ScalarWebGPUKernelConfig ScaleDim scatter scatter_add scatter_add_scatter_error_dim_out_of_range scatter_reduce scatter_reduce_ScatterReduceOptions ScatterShape searchsorted SearchSortedOptions select select_error_index_out_of_bounds select_scatter SelectShape selu set_default_device set_default_tensor_type set_deterministic_debug_mode set_float32_matmul_precision set_printoptions set_warn_always SetupContextFn Shape ShapeCheckedResult ShapedTensor ShapeErrorMessage ShapeOpSchema ShapeRule sigmoid sign signbit silu sin sinc sinh SizeOptions slice_error_out_of_bounds slice_scatter SliceOptions SliceScatterOptions SliceShape SliceSpec softmax softmax_error_dim_out_of_range SoftmaxShape softmin SoftminFunctionalOptions softplus SoftplusFunctionalOptions softshrink softsign sort SortOptions split split_error_dim_out_of_range SplitOptions sqrt square squeeze SqueezeOptions SqueezeShape stack StackOptions StackShape std std_mean StdVarMeanOptions StdVarOptions stft STFTOptions StrideOptions sub Sublist SublistElement SubscriptIndex sum SVDShape swapaxes sym_float sym_int sym_not t take take_along_dim TakeAlongDimOptions tan tanh tanhshrink tensor tensor_split TensorCreator TensorData tensordot TensordotOptions TensorLike TensorMeta TensorOptions TensorStorage threshold threshold_tile TileShape ToOptions topk TopkOptions Torch trace TraceShape transpose transpose_dims_error_out_of_range transpose_error_requires_2d_tensor TransposeDimsShape TransposeDimsShapeChecked TransposeShape trapezoid TrapezoidOptions TriangularOptions tril tril_indices TriOptions Triple triu triu_indices true_divide trunc TupleOfLength TypedArray TypedArrayFor TypedStorage TypeOptions UnaryBackwardFn UnaryDType UnaryKernelConfigCPU UnaryKernelCPU UnaryOpConfig UnaryOpFn UnaryOpNames UnaryOpParams UnaryOpSchema UnaryOptions unbind unbind_error_dim_out_of_range UnbindOptions unflatten UniformOptions unique unique_consecutive UniqueConsecutiveOptions UniqueOptions unpack UnpackShape unravel_index unregister_device unsqueeze UnsqueezeOptions UnsqueezeShape use_deterministic_algorithms ValidateBatchedSquareMatrix ValidateChunkDim ValidatedEinsumShape validateDevice ValidateDevice ValidatedRearrangeShape ValidatedReduceShape ValidatedRepeatShape validateDType ValidateEinsum ValidateOperandCount ValidateRanks ValidateScalar ValidateSplitDim ValidateSquareMatrix ValidateUnbindDim ValueOptions var_var_mean vdot view view_as_complex view_as_real vmap vsplit vstack WebGPUKernelConfig WebGPUOnlyResult WebGPUTensorData where WindowOptions xlogy zeros zeros_like

torch.nn.TransformerEncoderLayerOptions

export interface TransformerEncoderLayerOptions {
  /**
   * Total embedding dimension of the model. All sub-layers (attention, FFN) project to this dimension.
   * Must be divisible by nhead to enable multi-head attention. Common values: 256, 512, 768 (BERT),
   * 1024 (GPT-2), 1280 (T5-large).
   *
   * **Example:** d_model=512 with nhead=8 means each attention head processes 512/8=64 dimensions.
   */
  d_model: number;

  /**
   * Number of parallel attention heads in the multi-head attention sublayer. Higher values enable
   * learning diverse attention patterns (different heads focus on different aspects of input).
   * Common values: 8, 12 (BERT), 16 (GPT-2), 20 (T5).
   *
   * **Constraint:** d_model must be divisible by nhead (enforced in MultiheadAttention constructor).
   * **Trade-off:** More heads = more parameters and computation, but potentially better expressiveness.
   */
  nhead: number;

  /**
   * Dimensionality of the intermediate feed-forward network layer. The FFN projects from d_model to
   * dim_feedforward, applies activation, then projects back to d_model. This intermediate expansion
   * enables non-linear feature combinations.
   *
   * **Common ratio:** Typically 4x d_model (e.g., d_model=512 → dim_feedforward=2048).
   * **Effect:** Larger values increase model capacity and parameters, but also computation and memory.
   * **Recommendation:** Keep at 4x d_model unless you have specific reasons to change.
   *
   * **Default:** 2048
   */
  dim_feedforward?: number;

  /**
   * Dropout probability applied to attention weights and between FFN layers. Helps prevent overfitting
   * by randomly dropping attention connections and intermediate activations during training.
   * Automatically disabled during evaluation mode (self.training=false).
   *
   * **Common values:** 0.0 (no dropout), 0.1 (10%), 0.2 (20%) for regularization.
   * **Effect:** Higher values = more regularization = potentially better generalization but slower convergence.
   * **Typical:** 0.1 for most models, 0.2 for smaller datasets to prevent overfitting.
   *
   * **Default:** 0.1
   */
  dropout?: number;

  /**
   * Activation function used in the feed-forward network. Two options: 'relu' (original Transformer)
   * or 'gelu' (used in BERT, GPT-2/3). ReLU is simpler and faster, GELU is smoother.
   *
   * **'relu':** max(0, x). Sharp activation, was standard in "Attention is All You Need" (2017).
   *            Simpler, slightly faster, but gradient can be unstable.
   * **'gelu':** x * Φ(x) where Φ is CDF of normal distribution. Smoother, used in modern models (BERT, GPT-2).
   *            Better gradient flow, better generalization in practice.
   *
   * **Recommendation:** Use 'gelu' for new models unless you have compatibility requirements.
   *
   * **Default:** 'relu'
   */
  activation?: 'relu' | 'gelu';

  /**
   * Epsilon value for layer normalization to ensure numerical stability. The LayerNorm computes
   * (x - mean) / sqrt(var + eps). Prevents division by zero when variance is very small.
   *
   * **Typical values:** 1e-5, 1e-6, 1e-12.
   * **Effect:** Smaller values = more stable but can cause numerical issues. Larger values = less stable
   *            but safer for extreme cases.
   * **Recommendation:** Keep at default 1e-5 unless debugging numerical issues.
   *
   * **Default:** 1e-5
   */
  layer_norm_eps?: number;

  /**
   * Input/output shape format convention. Controls how tensors are interpreted without changing computation.
   *
   * **batch_first=false (default):** Shape is (sequence_length, batch_size, d_model).
   *   - Standard in PyTorch RNNs and original Transformer paper.
   *   - More natural for variable-length sequences (pad to max_len).
   *   - Internally, attention computation prefers this order for efficiency.
   *
   * **batch_first=true:** Shape is (batch_size, sequence_length, d_model).
   *   - More intuitive for users familiar with CNN conventions (batch first).
   *   - Requires automatic transposition internally (negligible overhead).
   *
   * **Behavior:** Setting this flag only changes shape interpretation. Computation remains identical.
   *
   * **Default:** false (sequence-first)
   */
  batch_first?: boolean;

  /**
   * Layer normalization placement within the block (Pre-LN vs Post-LN architecture).
   *
   * **norm_first=false (Post-LN, default):** LayerNorm applied AFTER each sub-layer.
   *   - Original "Attention is All You Need" design (2017).
   *   - Formula: x' = LayerNorm(x + SubLayer(x))
   *   - Can suffer from training instability with many layers (gradient scaling issues).
   *
   * **norm_first=true (Pre-LN):** LayerNorm applied BEFORE each sub-layer.
   *   - Formula: x' = x + SubLayer(LayerNorm(x))
   *   - More stable for deep models (12+ layers). Used in GPT-2, GPT-3, modern Transformers.
   *   - Better gradient flow, enables training without careful learning rate tuning.
   *
   * **Recommendation:** Use norm_first=true for models with 12+ layers or when training is unstable.
   *
   * **Default:** false (Post-LN, matching original Transformer)
   */
  norm_first?: boolean;
}

d_model(number)

– Total embedding dimension of the model. All sub-layers (attention, FFN) project to this dimension. Must be divisible by nhead to enable multi-head attention. Common values: 256, 512, 768 (BERT), 1024 (GPT-2), 1280 (T5-large). Example: d_model=512 with nhead=8 means each attention head processes 512/8=64 dimensions.

nhead(number)

– Number of parallel attention heads in the multi-head attention sublayer. Higher values enable learning diverse attention patterns (different heads focus on different aspects of input). Common values: 8, 12 (BERT), 16 (GPT-2), 20 (T5). Constraint: d_model must be divisible by nhead (enforced in MultiheadAttention constructor). Trade-off: More heads = more parameters and computation, but potentially better expressiveness.

dim_feedforward(number)optional

– Dimensionality of the intermediate feed-forward network layer. The FFN projects from d_model to dim_feedforward, applies activation, then projects back to d_model. This intermediate expansion enables non-linear feature combinations. Common ratio: Typically 4x d_model (e.g., d_model=512 → dim_feedforward=2048). Effect: Larger values increase model capacity and parameters, but also computation and memory. Recommendation: Keep at 4x d_model unless you have specific reasons to change. Default: 2048

dropout(number)optional

– Dropout probability applied to attention weights and between FFN layers. Helps prevent overfitting by randomly dropping attention connections and intermediate activations during training. Automatically disabled during evaluation mode (self.training=false). Common values: 0.0 (no dropout), 0.1 (10%), 0.2 (20%) for regularization. Effect: Higher values = more regularization = potentially better generalization but slower convergence. Typical: 0.1 for most models, 0.2 for smaller datasets to prevent overfitting. Default: 0.1

activation('relu' | 'gelu')optional

– Activation function used in the feed-forward network. Two options: 'relu' (original Transformer) or 'gelu' (used in BERT, GPT-2/3). ReLU is simpler and faster, GELU is smoother. 'relu': max(0, x). Sharp activation, was standard in "Attention is All You Need" (2017). Simpler, slightly faster, but gradient can be unstable. 'gelu': x * Φ(x) where Φ is CDF of normal distribution. Smoother, used in modern models (BERT, GPT-2). Better gradient flow, better generalization in practice. Recommendation: Use 'gelu' for new models unless you have compatibility requirements. Default: 'relu'

layer_norm_eps(number)optional

– Epsilon value for layer normalization to ensure numerical stability. The LayerNorm computes (x - mean) / sqrt(var + eps). Prevents division by zero when variance is very small. Typical values: 1e-5, 1e-6, 1e-12. Effect: Smaller values = more stable but can cause numerical issues. Larger values = less stable but safer for extreme cases. Recommendation: Keep at default 1e-5 unless debugging numerical issues. Default: 1e-5

batch_first(boolean)optional

– Input/output shape format convention. Controls how tensors are interpreted without changing computation. batch_first=false (default): Shape is (sequence_length, batch_size, d_model). - Standard in PyTorch RNNs and original Transformer paper. - More natural for variable-length sequences (pad to max_len). - Internally, attention computation prefers this order for efficiency. batch_first=true: Shape is (batch_size, sequence_length, d_model). - More intuitive for users familiar with CNN conventions (batch first). - Requires automatic transposition internally (negligible overhead). Behavior: Setting this flag only changes shape interpretation. Computation remains identical. Default: false (sequence-first)

norm_first(boolean)optional

– Layer normalization placement within the block (Pre-LN vs Post-LN architecture). norm_first=false (Post-LN, default): LayerNorm applied AFTER each sub-layer. - Original "Attention is All You Need" design (2017). - Formula: x' = LayerNorm(x + SubLayer(x)) - Can suffer from training instability with many layers (gradient scaling issues). norm_first=true (Pre-LN): LayerNorm applied BEFORE each sub-layer. - Formula: x' = x + SubLayer(LayerNorm(x)) - More stable for deep models (12+ layers). Used in GPT-2, GPT-3, modern Transformers. - Better gradient flow, enables training without careful learning rate tuning. Recommendation: Use norm_first=true for models with 12+ layers or when training is unstable. Default: false (Post-LN, matching original Transformer)