Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

Appendix H: Safety Differential Analysis

Analysis of 998 CANN 8.5 kernel pairs (AscendC C++ vs ascend-rs Rust).

For each kernel, we identify which memory safety vulnerabilities exist in the C++ version and how the Rust transpilation prevents them.

H.1 Safety Class Summary

#Safety ClassC++ RiskRust PreventionKernels Affected
1Type ConfusionGM_ADDR type erasureTyped pointer signature (*const T)983/998 (98%)
2Buffer OverflowGetValue(i)/SetValue(i,v) with i >= countOpaque buffer ID + explicit count parameter9/998 (0%)
3Use-After-FreeFreeTensor() leaves stale handleNo FreeTensor operation in ascend-rs API3/998 (0%)
4Missing SynchronizationDMA→compute without pipe_barrier()kernel_ops composites include barriers internally793/998 (79%)
5Double FreeFreeTensor() called twice on same handleNo FreeTensor operation in ascend-rs API3/998 (0%)
6Integer Overflowu32 arithmetic: blockIdx * perBlockLenwrapping_mul makes overflow semantics explicit785/998 (78%)

H.2 Category Breakdown

CategoryTotalC1: TypeC2: BoundsC3: UAFC4: SyncC5: DblFreeC6: Overflow
ops_index1141143076076
ops_legacy200200001360128
ops_math1201200084084
ops_nn150150631293129
ops_optimizer82820062062
ops_reduce80800080080
ops_resize52520052052
ops_transformer200185001740174

H.3 Counter-Example Inputs

For each safety class, a counter-example input that triggers the vulnerability in C++ but is caught/prevented in Rust.

Class 1: Type Confusion

Trigger: Pass f16 data to f32 kernel

C++ behavior: Silent data corruption (interprets f16 bits as f32)

Rust behavior: Compile-time type error (*const u16 ≠ *const f32)

Example kernel: foreach_exp_f32

Evidence: Uses GM_ADDR (type-erased uint8_t*)


Class 2: Buffer Overflow

Trigger: count = buffer_size + 1

C++ behavior: Out-of-bounds SRAM read/write (undefined behavior)

Rust behavior: Buffer ID abstraction prevents raw indexing

Example kernel: foreach_dropout_f32

Evidence: Uses GetValue (unchecked index) + array indexing


Class 3: Use-After-Free

Trigger: Free buffer then read through stale handle

C++ behavior: Reads deallocated SRAM (garbage data)

Rust behavior: No free API exists — buffer lifetime managed by runtime

Example kernel: foreach_dropout_f32

Evidence: Calls FreeTensor() — handle remains valid


Class 4: Missing Synchronization

Trigger: Remove barrier between load and compute

C++ behavior: Reads stale/partial DMA data (non-deterministic)

Rust behavior: ascend_pipe_barrier() always emitted between stages

Example kernel: foreach_add_list_f32

Evidence: Has 2 barriers — missing one causes data races


Class 5: Double Free

Trigger: Call FreeTensor twice on same LocalTensor

C++ behavior: Corrupts queue free list (undefined behavior)

Rust behavior: No free API exists — impossible to double-free

Example kernel: foreach_dropout_f32

Evidence: FreeTensor called 54 times


Class 6: Integer Overflow

Trigger: blockIdx=1048576, perBlockLen=4096 → wraps to 0

C++ behavior: Silent wrap to 0, wrong memory offset

Rust behavior: wrapping_mul(4096) → 0 (explicit, debug-mode panic)

Example kernel: foreach_dropout_f32

Evidence: Uses block index for offset calculation


H.4 Per-Kernel Safety Report (All 998 Kernels)

foreach_exp_f32 (ops_legacy, f32, ✓ real source): C1

foreach_exp_f16 (ops_legacy, f16, ✓ real source): C1

foreach_exp_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_abs_f32 (ops_legacy, f32, ✓ real source): C1

foreach_abs_f16 (ops_legacy, f16, ✓ real source): C1

foreach_abs_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_neg_f32 (ops_legacy, f32, ✓ real source): C1

foreach_neg_f16 (ops_legacy, f16, ✓ real source): C1

foreach_neg_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_sqrt_f32 (ops_legacy, f32, ✓ real source): C1

foreach_sqrt_f16 (ops_legacy, f16, ✓ real source): C1

foreach_sqrt_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_rsqrt_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_rsqrt_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_rsqrt_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_reciprocal_f32 (ops_legacy, f32, ✓ real source): C1

foreach_reciprocal_f16 (ops_legacy, f16, ✓ real source): C1

foreach_reciprocal_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_ln_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_ln_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_ln_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_log2_f32 (ops_legacy, f32, ✓ real source): C1

foreach_log2_f16 (ops_legacy, f16, ✓ real source): C1

foreach_log2_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_log10_f32 (ops_legacy, f32, ✓ real source): C1

foreach_log10_f16 (ops_legacy, f16, ✓ real source): C1

foreach_log10_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_ceil_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_ceil_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_ceil_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_floor_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_floor_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_floor_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_round_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_round_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_round_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_trunc_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_trunc_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_trunc_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_sign_f32 (ops_legacy, f32, ✓ real source): C1

foreach_sign_f16 (ops_legacy, f16, ✓ real source): C1

foreach_sign_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_not_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_not_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_not_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_bitwise_not_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_bitwise_not_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_bitwise_not_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_logical_not_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_logical_not_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_logical_not_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_clamp_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_clamp_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_clamp_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_add_list_f32 (ops_legacy, f32, ✓ real source): C1, C4

foreach_add_list_f16 (ops_legacy, f16, ✓ real source): C1, C4

foreach_add_list_bf16 (ops_legacy, bf16, ✓ real source): C1, C4

foreach_sub_list_f32 (ops_legacy, f32, ✓ real source): C1, C4

foreach_sub_list_f16 (ops_legacy, f16, ✓ real source): C1, C4

foreach_sub_list_bf16 (ops_legacy, bf16, ✓ real source): C1, C4

foreach_mul_list_f32 (ops_legacy, f32, ✓ real source): C1

foreach_mul_list_f16 (ops_legacy, f16, ✓ real source): C1

foreach_mul_list_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_div_list_f32 (ops_legacy, f32, ✓ real source): C1

foreach_div_list_f16 (ops_legacy, f16, ✓ real source): C1

foreach_div_list_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_max_list_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_max_list_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_max_list_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_min_list_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_min_list_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_min_list_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_pow_list_f32 (ops_legacy, f32, ✓ real source): C1

foreach_pow_list_f16 (ops_legacy, f16, ✓ real source): C1

foreach_pow_list_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_fmod_list_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_fmod_list_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_fmod_list_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_bitwise_and_list_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_bitwise_and_list_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_bitwise_and_list_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_bitwise_or_list_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_bitwise_or_list_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_bitwise_or_list_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_bitwise_xor_list_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_bitwise_xor_list_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_bitwise_xor_list_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_logical_and_list_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_logical_and_list_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_logical_and_list_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_logical_or_list_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_logical_or_list_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_logical_or_list_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_add_scalar_f32 (ops_legacy, f32, ✓ real source): C1

foreach_add_scalar_f16 (ops_legacy, f16, ✓ real source): C1

foreach_add_scalar_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_sub_scalar_f32 (ops_legacy, f32, ✓ real source): C1

foreach_sub_scalar_f16 (ops_legacy, f16, ✓ real source): C1

foreach_sub_scalar_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_mul_scalar_f32 (ops_legacy, f32, ✓ real source): C1

foreach_mul_scalar_f16 (ops_legacy, f16, ✓ real source): C1

foreach_mul_scalar_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_div_scalar_f32 (ops_legacy, f32, ✓ real source): C1

foreach_div_scalar_f16 (ops_legacy, f16, ✓ real source): C1

foreach_div_scalar_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_max_scalar_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_max_scalar_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_max_scalar_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_min_scalar_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_min_scalar_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_min_scalar_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_pow_scalar_f32 (ops_legacy, f32, ✓ real source): C1

foreach_pow_scalar_f16 (ops_legacy, f16, ✓ real source): C1

foreach_pow_scalar_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_clamp_scalar_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_clamp_scalar_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_clamp_scalar_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_add_list_alpha_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_add_list_alpha_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_add_list_alpha_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_sub_list_alpha_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_sub_list_alpha_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_sub_list_alpha_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_addcmul_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_addcdiv_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_copy_f32 (ops_legacy, f32, ✓ real source): C1

foreach_zero_inplace_f32 (ops_legacy, f32, ✓ real source): C1

foreach_lerp_f32 (ops_legacy, f32, stub): C1, C4, C6

foreach_addcmul_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_addcdiv_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_copy_f16 (ops_legacy, f16, ✓ real source): C1

foreach_zero_inplace_f16 (ops_legacy, f16, ✓ real source): C1

foreach_lerp_f16 (ops_legacy, f16, stub): C1, C4, C6

foreach_addcmul_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_addcdiv_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_copy_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_zero_inplace_bf16 (ops_legacy, bf16, ✓ real source): C1

foreach_lerp_bf16 (ops_legacy, bf16, stub): C1, C4, C6

zeros_like_f32 (ops_legacy, f32, stub): C1, C4, C6

ones_like_f32 (ops_legacy, f32, stub): C1, C4, C6

zeros_like_f16 (ops_legacy, f16, stub): C1, C4, C6

ones_like_f16 (ops_legacy, f16, stub): C1, C4, C6

zeros_like_bf16 (ops_legacy, bf16, stub): C1, C4, C6

ones_like_bf16 (ops_legacy, bf16, stub): C1, C4, C6

zeros_like_int32 (ops_legacy, i32, stub): C1, C4, C6

ones_like_int32 (ops_legacy, i32, stub): C1, C4, C6

elementwise_abs_f32 (ops_legacy, f32, stub): C1, C4, C6

elementwise_abs_f16 (ops_legacy, f16, stub): C1, C4, C6

elementwise_abs_bf16 (ops_legacy, bf16, stub): C1, C4, C6

elementwise_relu_f32 (ops_legacy, f32, stub): C1, C4, C6

elementwise_relu_f16 (ops_legacy, f16, stub): C1, C4, C6

elementwise_relu_bf16 (ops_legacy, bf16, stub): C1, C4, C6

elementwise_gelu_f32 (ops_legacy, f32, stub): C1, C4, C6

elementwise_gelu_f16 (ops_legacy, f16, stub): C1, C4, C6

elementwise_gelu_bf16 (ops_legacy, bf16, stub): C1, C4, C6

elementwise_silu_f32 (ops_legacy, f32, stub): C1, C4, C6

elementwise_silu_f16 (ops_legacy, f16, stub): C1, C4, C6

elementwise_silu_bf16 (ops_legacy, bf16, stub): C1, C4, C6

elementwise_neg_f32 (ops_legacy, f32, stub): C1, C4, C6

elementwise_neg_f16 (ops_legacy, f16, stub): C1, C4, C6

elementwise_neg_bf16 (ops_legacy, bf16, stub): C1, C4, C6

elementwise_sign_f32 (ops_legacy, f32, stub): C1, C4, C6

elementwise_sign_f16 (ops_legacy, f16, stub): C1, C4, C6

elementwise_sign_bf16 (ops_legacy, bf16, stub): C1, C4, C6

elementwise_ceil_f32 (ops_legacy, f32, stub): C1, C4, C6

elementwise_ceil_f16 (ops_legacy, f16, stub): C1, C4, C6

elementwise_ceil_bf16 (ops_legacy, bf16, stub): C1, C4, C6

elementwise_floor_f32 (ops_legacy, f32, stub): C1, C4, C6

elementwise_floor_f16 (ops_legacy, f16, stub): C1, C4, C6

elementwise_floor_bf16 (ops_legacy, bf16, stub): C1, C4, C6

elementwise16b_abs_f32 (ops_legacy, f32, stub): C1, C4, C6

elementwise16b_abs_f16 (ops_legacy, f16, stub): C1, C4, C6

elementwise16b_abs_bf16 (ops_legacy, bf16, stub): C1, C4, C6

elementwise16b_relu_f32 (ops_legacy, f32, stub): C1, C4, C6

elementwise16b_relu_f16 (ops_legacy, f16, stub): C1, C4, C6

elementwise16b_relu_bf16 (ops_legacy, bf16, stub): C1, C4, C6

elementwise16b_neg_f32 (ops_legacy, f32, stub): C1, C4, C6

elementwise16b_neg_f16 (ops_legacy, f16, stub): C1, C4, C6

elementwise16b_neg_bf16 (ops_legacy, bf16, stub): C1, C4, C6

elementwise16b_sign_f32 (ops_legacy, f32, stub): C1, C4, C6

elementwise16b_sign_f16 (ops_legacy, f16, stub): C1, C4, C6

elementwise16b_sign_bf16 (ops_legacy, bf16, stub): C1, C4, C6

foreach_abs_int32 (ops_legacy, i32, ✓ real source): C1

foreach_neg_int32 (ops_legacy, i32, ✓ real source): C1

foreach_sign_int32 (ops_legacy, i32, ✓ real source): C1

foreach_bitwise_not_int32 (ops_legacy, i32, stub): C1, C4, C6

foreach_logical_not_int32 (ops_legacy, i32, stub): C1, C4, C6

foreach_clamp_int32 (ops_legacy, i32, stub): C1, C4, C6

foreach_add_list_int32 (ops_legacy, i32, ✓ real source): C1, C4

foreach_sub_list_int32 (ops_legacy, i32, ✓ real source): C1, C4

foreach_mul_list_int32 (ops_legacy, i32, ✓ real source): C1

foreach_max_list_int32 (ops_legacy, i32, stub): C1, C4, C6

foreach_abs_int8 (ops_legacy, i8, ✓ real source): C1

foreach_neg_int8 (ops_legacy, i8, ✓ real source): C1

foreach_bitwise_not_int8 (ops_legacy, i8, stub): C1, C4, C6

foreach_clamp_int8 (ops_legacy, i8, stub): C1, C4, C6

foreach_add_scalar_int32 (ops_legacy, i32, ✓ real source): C1

foreach_sub_scalar_int32 (ops_legacy, i32, ✓ real source): C1

foreach_mul_scalar_int32 (ops_legacy, i32, ✓ real source): C1

foreach_div_scalar_int32 (ops_legacy, i32, ✓ real source): C1

foreach_sin_f32 (ops_math, f32, ✓ real source): C1

foreach_sin_f16 (ops_math, f16, ✓ real source): C1

foreach_sin_bf16 (ops_math, bf16, ✓ real source): C1

foreach_cos_f32 (ops_math, f32, ✓ real source): C1

foreach_cos_f16 (ops_math, f16, ✓ real source): C1

foreach_cos_bf16 (ops_math, bf16, ✓ real source): C1

foreach_tan_f32 (ops_math, f32, ✓ real source): C1

foreach_tan_f16 (ops_math, f16, ✓ real source): C1

foreach_tan_bf16 (ops_math, bf16, ✓ real source): C1

foreach_asin_f32 (ops_math, f32, ✓ real source): C1

foreach_asin_f16 (ops_math, f16, ✓ real source): C1

foreach_asin_bf16 (ops_math, bf16, ✓ real source): C1

foreach_acos_f32 (ops_math, f32, ✓ real source): C1

foreach_acos_f16 (ops_math, f16, ✓ real source): C1

foreach_acos_bf16 (ops_math, bf16, ✓ real source): C1

foreach_atan_f32 (ops_math, f32, ✓ real source): C1

foreach_atan_f16 (ops_math, f16, ✓ real source): C1

foreach_atan_bf16 (ops_math, bf16, ✓ real source): C1

foreach_atan2_f32 (ops_math, f32, stub): C1, C4, C6

foreach_atan2_f16 (ops_math, f16, stub): C1, C4, C6

foreach_atan2_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_sinh_f32 (ops_math, f32, ✓ real source): C1

foreach_sinh_f16 (ops_math, f16, ✓ real source): C1

foreach_sinh_bf16 (ops_math, bf16, ✓ real source): C1

foreach_cosh_f32 (ops_math, f32, ✓ real source): C1

foreach_cosh_f16 (ops_math, f16, ✓ real source): C1

foreach_cosh_bf16 (ops_math, bf16, ✓ real source): C1

foreach_tanh_math_f32 (ops_math, f32, stub): C1, C4, C6

foreach_tanh_math_f16 (ops_math, f16, stub): C1, C4, C6

foreach_tanh_math_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_asinh_f32 (ops_math, f32, stub): C1, C4, C6

foreach_asinh_f16 (ops_math, f16, stub): C1, C4, C6

foreach_asinh_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_acosh_f32 (ops_math, f32, stub): C1, C4, C6

foreach_acosh_f16 (ops_math, f16, stub): C1, C4, C6

foreach_acosh_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_atanh_f32 (ops_math, f32, stub): C1, C4, C6

foreach_atanh_f16 (ops_math, f16, stub): C1, C4, C6

foreach_atanh_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_erf_f32 (ops_math, f32, ✓ real source): C1

foreach_erf_f16 (ops_math, f16, ✓ real source): C1

foreach_erf_bf16 (ops_math, bf16, ✓ real source): C1

foreach_erfc_f32 (ops_math, f32, ✓ real source): C1

foreach_erfc_f16 (ops_math, f16, ✓ real source): C1

foreach_erfc_bf16 (ops_math, bf16, ✓ real source): C1

foreach_erfinv_f32 (ops_math, f32, stub): C1, C4, C6

foreach_erfinv_f16 (ops_math, f16, stub): C1, C4, C6

foreach_erfinv_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_expm1_f32 (ops_math, f32, ✓ real source): C1

foreach_expm1_f16 (ops_math, f16, ✓ real source): C1

foreach_expm1_bf16 (ops_math, bf16, ✓ real source): C1

foreach_log1p_f32 (ops_math, f32, ✓ real source): C1

foreach_log1p_f16 (ops_math, f16, ✓ real source): C1

foreach_log1p_bf16 (ops_math, bf16, ✓ real source): C1

foreach_softplus_f32 (ops_math, f32, stub): C1, C4, C6

foreach_softplus_f16 (ops_math, f16, stub): C1, C4, C6

foreach_softplus_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_digamma_f32 (ops_math, f32, stub): C1, C4, C6

foreach_digamma_f16 (ops_math, f16, stub): C1, C4, C6

foreach_digamma_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_lgamma_f32 (ops_math, f32, stub): C1, C4, C6

foreach_lgamma_f16 (ops_math, f16, stub): C1, C4, C6

foreach_lgamma_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_i0_f32 (ops_math, f32, stub): C1, C4, C6

foreach_i0_f16 (ops_math, f16, stub): C1, C4, C6

foreach_i0_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_i1_f32 (ops_math, f32, stub): C1, C4, C6

foreach_i1_f16 (ops_math, f16, stub): C1, C4, C6

foreach_i1_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_hypot_f32 (ops_math, f32, stub): C1, C4, C6

foreach_hypot_f16 (ops_math, f16, stub): C1, C4, C6

foreach_hypot_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_fma_f32 (ops_math, f32, stub): C1, C4, C6

foreach_fma_f16 (ops_math, f16, stub): C1, C4, C6

foreach_fma_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_remainder_f32 (ops_math, f32, stub): C1, C4, C6

foreach_remainder_f16 (ops_math, f16, stub): C1, C4, C6

foreach_remainder_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_copysign_f32 (ops_math, f32, stub): C1, C4, C6

foreach_copysign_f16 (ops_math, f16, stub): C1, C4, C6

foreach_copysign_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_nextafter_f32 (ops_math, f32, stub): C1, C4, C6

foreach_nextafter_f16 (ops_math, f16, stub): C1, C4, C6

foreach_nextafter_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_ldexp_f32 (ops_math, f32, stub): C1, C4, C6

foreach_ldexp_f16 (ops_math, f16, stub): C1, C4, C6

foreach_ldexp_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_frexp_f32 (ops_math, f32, stub): C1, C4, C6

foreach_frexp_f16 (ops_math, f16, stub): C1, C4, C6

foreach_frexp_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_logaddexp_f32 (ops_math, f32, stub): C1, C4, C6

foreach_logaddexp_f16 (ops_math, f16, stub): C1, C4, C6

foreach_logaddexp_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_logaddexp2_f32 (ops_math, f32, stub): C1, C4, C6

foreach_logaddexp2_f16 (ops_math, f16, stub): C1, C4, C6

foreach_logaddexp2_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_sincos_f32_910b (ops_math, f32, stub): C1, C4, C6

foreach_sincos_f16_910b (ops_math, f32, stub): C1, C4, C6

foreach_sincos_bf16_910b (ops_math, f32, stub): C1, C4, C6

foreach_sincospi_f32_910b (ops_math, f32, stub): C1, C4, C6

foreach_sincospi_f16_910b (ops_math, f32, stub): C1, C4, C6

foreach_sincospi_bf16_910b (ops_math, f32, stub): C1, C4, C6

foreach_j0_f32 (ops_math, f32, stub): C1, C4, C6

foreach_j0_f16 (ops_math, f16, stub): C1, C4, C6

foreach_j0_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_j1_f32 (ops_math, f32, stub): C1, C4, C6

foreach_j1_f16 (ops_math, f16, stub): C1, C4, C6

foreach_j1_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_y0_f32 (ops_math, f32, stub): C1, C4, C6

foreach_y0_f16 (ops_math, f16, stub): C1, C4, C6

foreach_y0_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_y1_f32 (ops_math, f32, stub): C1, C4, C6

foreach_y1_f16 (ops_math, f16, stub): C1, C4, C6

foreach_y1_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_polygamma_f32 (ops_math, f32, stub): C1, C4, C6

foreach_polygamma_f16 (ops_math, f16, stub): C1, C4, C6

foreach_polygamma_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_zeta_f32 (ops_math, f32, stub): C1, C4, C6

foreach_zeta_f16 (ops_math, f16, stub): C1, C4, C6

foreach_zeta_bf16 (ops_math, bf16, stub): C1, C4, C6

foreach_relu_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_relu_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_relu_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_relu6_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_relu6_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_relu6_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_leaky_relu_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_leaky_relu_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_leaky_relu_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_prelu_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_prelu_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_prelu_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_elu_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_elu_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_elu_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_selu_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_selu_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_selu_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_gelu_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_gelu_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_gelu_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_fast_gelu_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_fast_gelu_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_fast_gelu_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_sigmoid_f32 (ops_nn, f32, ✓ real source): C1

foreach_sigmoid_f16 (ops_nn, f16, ✓ real source): C1

foreach_sigmoid_bf16 (ops_nn, bf16, ✓ real source): C1

foreach_hardsigmoid_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_hardsigmoid_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_hardsigmoid_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_hardswish_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_hardswish_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_hardswish_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_hardtanh_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_hardtanh_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_hardtanh_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_silu_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_silu_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_silu_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_mish_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_mish_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_mish_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_softplus_nn_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_softplus_nn_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_softplus_nn_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_softsign_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_softsign_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_softsign_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_tanh_nn_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_tanh_nn_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_tanh_nn_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_celu_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_celu_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_celu_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_glu_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_glu_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_glu_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_rrelu_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_rrelu_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_rrelu_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_batch_norm_f32 (ops_nn, f32, ✓ real source): C1

foreach_batch_norm_f16 (ops_nn, f16, ✓ real source): C1

foreach_batch_norm_bf16 (ops_nn, bf16, ✓ real source): C1

foreach_instance_norm_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_instance_norm_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_instance_norm_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_layer_norm_f32 (ops_nn, f32, ✓ real source): C1

foreach_layer_norm_f16 (ops_nn, f16, ✓ real source): C1

foreach_layer_norm_bf16 (ops_nn, bf16, ✓ real source): C1

foreach_group_norm_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_group_norm_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_group_norm_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_rms_norm_f32 (ops_nn, f32, ✓ real source): C1

foreach_rms_norm_f16 (ops_nn, f16, ✓ real source): C1

foreach_rms_norm_bf16 (ops_nn, bf16, ✓ real source): C1

foreach_softmax_f32 (ops_nn, f32, ✓ real source): C1

foreach_softmax_f16 (ops_nn, f16, ✓ real source): C1

foreach_softmax_bf16 (ops_nn, bf16, ✓ real source): C1

foreach_log_softmax_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_log_softmax_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_log_softmax_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_dropout_f32 (ops_nn, f32, ✓ real source): C1, C2, C3, C4, C5, C6

foreach_dropout_f16 (ops_nn, f16, ✓ real source): C1, C2, C3, C4, C5, C6

foreach_dropout_bf16 (ops_nn, bf16, ✓ real source): C1, C2, C3, C4, C5, C6

foreach_embedding_f32 (ops_nn, f32, ✓ real source): C1, C2

foreach_embedding_f16 (ops_nn, f16, ✓ real source): C1, C2

foreach_embedding_bf16 (ops_nn, bf16, ✓ real source): C1, C2

foreach_swish_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_swish_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_swish_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_logsigmoid_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_logsigmoid_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_logsigmoid_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_tanhshrink_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_tanhshrink_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_tanhshrink_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_softshrink_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_softshrink_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_softshrink_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_hardshrink_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_hardshrink_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_hardshrink_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_threshold_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_threshold_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_threshold_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_cross_entropy_loss_f32 (ops_nn, f32, ✓ real source): C1

foreach_cross_entropy_loss_f16 (ops_nn, f16, ✓ real source): C1

foreach_cross_entropy_loss_bf16 (ops_nn, bf16, ✓ real source): C1

foreach_mse_loss_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_mse_loss_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_mse_loss_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_l1_loss_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_l1_loss_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_l1_loss_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_smooth_l1_loss_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_smooth_l1_loss_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_smooth_l1_loss_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_nll_loss_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_nll_loss_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_nll_loss_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_avg_pool_2d_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_avg_pool_2d_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_avg_pool_2d_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_max_pool_2d_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_max_pool_2d_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_max_pool_2d_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_avg_pool_1d_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_avg_pool_1d_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_avg_pool_1d_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_max_pool_1d_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_max_pool_1d_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_max_pool_1d_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_lp_pool_2d_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_lp_pool_2d_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_lp_pool_2d_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_bce_loss_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_bce_loss_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_bce_loss_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_bce_with_logits_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_bce_with_logits_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_bce_with_logits_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_hinge_loss_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_hinge_loss_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_hinge_loss_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_kl_div_loss_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_kl_div_loss_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_kl_div_loss_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_cosine_embedding_loss_f32 (ops_nn, f32, stub): C1, C4, C6

foreach_cosine_embedding_loss_f16 (ops_nn, f16, stub): C1, C4, C6

foreach_cosine_embedding_loss_bf16 (ops_nn, bf16, stub): C1, C4, C6

foreach_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_attention_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_attention_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_scaled_dot_product_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_scaled_dot_product_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_scaled_dot_product_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_scaled_dot_product_attention_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_scaled_dot_product_attention_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_multi_head_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_multi_head_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_multi_head_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_multi_head_attention_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_multi_head_attention_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_flash_attention_v1_f32 (ops_transformer, f32, ✓ real source):

foreach_flash_attention_v1_f16 (ops_transformer, f16, ✓ real source):

foreach_flash_attention_v1_bf16 (ops_transformer, bf16, ✓ real source):

foreach_flash_attention_v1_f16_910b (ops_transformer, f32, ✓ real source):

foreach_flash_attention_v1_f16_310p (ops_transformer, f32, ✓ real source):

foreach_flash_attention_v2_f32 (ops_transformer, f32, ✓ real source):

foreach_flash_attention_v2_f16 (ops_transformer, f16, ✓ real source):

foreach_flash_attention_v2_bf16 (ops_transformer, bf16, ✓ real source):

foreach_flash_attention_v2_f16_910b (ops_transformer, f32, ✓ real source):

foreach_flash_attention_v2_f16_310p (ops_transformer, f32, ✓ real source):

foreach_flash_attention_v3_f32 (ops_transformer, f32, ✓ real source):

foreach_flash_attention_v3_f16 (ops_transformer, f16, ✓ real source):

foreach_flash_attention_v3_bf16 (ops_transformer, bf16, ✓ real source):

foreach_flash_attention_v3_f16_910b (ops_transformer, f32, ✓ real source):

foreach_flash_attention_v3_f16_310p (ops_transformer, f32, ✓ real source):

foreach_paged_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_paged_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_paged_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_paged_attention_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_paged_attention_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_rotary_embedding_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_rotary_embedding_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_rotary_embedding_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_rotary_embedding_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_rotary_embedding_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_rope_apply_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_rope_apply_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_rope_apply_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_rope_apply_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_rope_apply_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_alibi_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_alibi_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_alibi_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_alibi_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_alibi_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_kv_cache_update_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_kv_cache_update_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_kv_cache_update_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_kv_cache_update_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_kv_cache_update_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_beam_search_score_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_beam_search_score_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_beam_search_score_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_beam_search_score_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_beam_search_score_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_matmul_f32 (ops_transformer, f32, ✓ real source): C1

foreach_matmul_f16 (ops_transformer, f16, ✓ real source): C1

foreach_matmul_bf16 (ops_transformer, bf16, ✓ real source): C1

foreach_matmul_f32_910b (ops_transformer, f32, ✓ real source): C1

foreach_matmul_f32_310p (ops_transformer, f32, ✓ real source): C1

foreach_matmul_f16_910b (ops_transformer, f32, ✓ real source): C1

foreach_matmul_f16_310p (ops_transformer, f32, ✓ real source): C1

foreach_batch_matmul_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_batch_matmul_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_batch_matmul_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_batch_matmul_f32_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_batch_matmul_f32_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_batch_matmul_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_batch_matmul_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_linear_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_linear_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_linear_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_linear_f32_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_linear_f32_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_linear_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_linear_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_gemm_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_gemm_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_gemm_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_gemm_f32_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_gemm_f32_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_gemm_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_gemm_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_gemv_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_gemv_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_gemv_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_gemv_f32_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_gemv_f32_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_gemv_f16_910b (ops_transformer, f32, stub): C1, C4, C6

foreach_gemv_f16_310p (ops_transformer, f32, stub): C1, C4, C6

foreach_position_encoding_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_position_encoding_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_position_encoding_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_causal_mask_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_causal_mask_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_causal_mask_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_cross_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_cross_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_cross_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_grouped_query_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_grouped_query_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_grouped_query_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_sliding_window_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_sliding_window_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_sliding_window_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_linear_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_linear_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_linear_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_sparse_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_sparse_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_sparse_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_local_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_local_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_local_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_ring_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_ring_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_ring_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_prefix_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_prefix_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_prefix_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_kv_cache_quantize_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_kv_cache_quantize_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_kv_cache_quantize_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_attention_score_mod_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_attention_score_mod_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_attention_score_mod_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_rope_neox_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_rope_neox_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_rope_neox_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_rope_glm_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_rope_glm_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_rope_glm_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_matmul_quant_int8_f16 (ops_transformer, f16, ✓ real source): C1

foreach_matmul_quant_int8_bf16 (ops_transformer, bf16, ✓ real source): C1

foreach_attention_quant_int8_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_attention_quant_int8_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_linear_quant_int8_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_linear_quant_int8_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_matmul_quant_int4_f16 (ops_transformer, f16, ✓ real source): C1

foreach_matmul_quant_int4_bf16 (ops_transformer, bf16, ✓ real source): C1

foreach_attention_quant_int4_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_attention_quant_int4_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_linear_quant_int4_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_linear_quant_int4_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_multi_query_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_multi_query_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_multi_query_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_flash_decoding_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_flash_decoding_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_flash_decoding_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_speculative_decoding_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_speculative_decoding_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_speculative_decoding_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_token_mixing_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_token_mixing_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_token_mixing_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_channel_mixing_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_channel_mixing_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_channel_mixing_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_moe_gate_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_moe_gate_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_moe_gate_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_moe_dispatch_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_moe_dispatch_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_moe_dispatch_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_moe_combine_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_moe_combine_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_moe_combine_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_swiglu_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_swiglu_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_swiglu_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_geglu_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_geglu_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_geglu_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_reglu_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_reglu_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_reglu_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_rmsnorm_linear_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_rmsnorm_linear_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_rmsnorm_linear_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_prenorm_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_prenorm_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_prenorm_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_postnorm_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_postnorm_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_postnorm_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_parallel_attention_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_parallel_attention_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_parallel_attention_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_sandwich_norm_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_sandwich_norm_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_sandwich_norm_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_qk_norm_f32 (ops_transformer, f32, stub): C1, C4, C6

foreach_qk_norm_f16 (ops_transformer, f16, stub): C1, C4, C6

foreach_qk_norm_bf16 (ops_transformer, bf16, stub): C1, C4, C6

foreach_adam_f32 (ops_optimizer, f32, ✓ real source): C1

foreach_adam_f16 (ops_optimizer, f16, ✓ real source): C1

foreach_adam_bf16 (ops_optimizer, bf16, ✓ real source): C1

foreach_adam_f32_wd (ops_optimizer, f32, ✓ real source): C1

foreach_adamw_f32 (ops_optimizer, f32, ✓ real source): C1

foreach_adamw_f16 (ops_optimizer, f16, ✓ real source): C1

foreach_adamw_bf16 (ops_optimizer, bf16, ✓ real source): C1

foreach_adamw_f32_wd (ops_optimizer, f32, ✓ real source): C1

foreach_sgd_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_sgd_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_sgd_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_sgd_f32_wd (ops_optimizer, f32, stub): C1, C4, C6

foreach_sgd_momentum_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_sgd_momentum_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_sgd_momentum_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_sgd_momentum_f32_wd (ops_optimizer, f32, stub): C1, C4, C6

foreach_adagrad_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_adagrad_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_adagrad_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_adagrad_f32_wd (ops_optimizer, f32, stub): C1, C4, C6

foreach_adadelta_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_adadelta_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_adadelta_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_adadelta_f32_wd (ops_optimizer, f32, stub): C1, C4, C6

foreach_rmsprop_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_rmsprop_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_rmsprop_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_rmsprop_f32_wd (ops_optimizer, f32, stub): C1, C4, C6

foreach_lamb_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_lamb_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_lamb_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_lamb_f32_wd (ops_optimizer, f32, stub): C1, C4, C6

foreach_lars_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_lars_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_lars_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_lars_f32_wd (ops_optimizer, f32, stub): C1, C4, C6

foreach_ftrl_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_ftrl_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_ftrl_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_ftrl_f32_wd (ops_optimizer, f32, stub): C1, C4, C6

foreach_adam_amsgrad_f32 (ops_optimizer, f32, ✓ real source): C1

foreach_adam_amsgrad_f16 (ops_optimizer, f16, ✓ real source): C1

foreach_adam_amsgrad_bf16 (ops_optimizer, bf16, ✓ real source): C1

foreach_adamw_amsgrad_f32 (ops_optimizer, f32, ✓ real source): C1

foreach_adamw_amsgrad_f16 (ops_optimizer, f16, ✓ real source): C1

foreach_adamw_amsgrad_bf16 (ops_optimizer, bf16, ✓ real source): C1

foreach_adam_fused_f32 (ops_optimizer, f32, ✓ real source): C1

foreach_adam_fused_f16 (ops_optimizer, f16, ✓ real source): C1

foreach_adam_fused_bf16 (ops_optimizer, bf16, ✓ real source): C1

foreach_adamw_fused_f32 (ops_optimizer, f32, ✓ real source): C1

foreach_adamw_fused_f16 (ops_optimizer, f16, ✓ real source): C1

foreach_adamw_fused_bf16 (ops_optimizer, bf16, ✓ real source): C1

foreach_sgd_nesterov_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_sgd_nesterov_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_sgd_nesterov_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_lion_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_lion_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_lion_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_adafactor_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_adafactor_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_adafactor_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_sophia_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_sophia_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_sophia_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_came_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_came_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_came_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_novograd_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_novograd_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_novograd_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_prodigy_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_prodigy_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_prodigy_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_shampoo_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_shampoo_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_shampoo_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_adalomo_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_adalomo_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_adalomo_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_galore_f32 (ops_optimizer, f32, stub): C1, C4, C6

foreach_galore_f16 (ops_optimizer, f16, stub): C1, C4, C6

foreach_galore_bf16 (ops_optimizer, bf16, stub): C1, C4, C6

foreach_reduce_sum_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_sum_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_sum_int32 (ops_reduce, i32, stub): C1, C4, C6

foreach_reduce_max_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_max_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_max_int32 (ops_reduce, i32, stub): C1, C4, C6

foreach_reduce_min_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_min_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_min_int32 (ops_reduce, i32, stub): C1, C4, C6

foreach_reduce_mean_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_mean_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_mean_int32 (ops_reduce, i32, stub): C1, C4, C6

foreach_reduce_prod_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_prod_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_prod_int32 (ops_reduce, i32, stub): C1, C4, C6

foreach_reduce_any_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_any_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_any_int32 (ops_reduce, i32, stub): C1, C4, C6

foreach_reduce_all_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_all_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_all_int32 (ops_reduce, i32, stub): C1, C4, C6

foreach_argmax_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_argmax_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_argmax_int32 (ops_reduce, i32, stub): C1, C4, C6

foreach_argmin_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_argmin_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_argmin_int32 (ops_reduce, i32, stub): C1, C4, C6

foreach_cumsum_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_cumsum_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_cumsum_int32 (ops_reduce, i32, stub): C1, C4, C6

foreach_cumprod_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_cumprod_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_cumprod_int32 (ops_reduce, i32, stub): C1, C4, C6

foreach_reduce_sum_bf16 (ops_reduce, bf16, stub): C1, C4, C6

foreach_reduce_sum_f32_axis0 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_sum_f16_axis0 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_sum_f32_axis1 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_sum_f16_axis1 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_max_bf16 (ops_reduce, bf16, stub): C1, C4, C6

foreach_reduce_max_f32_axis0 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_max_f16_axis0 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_max_f32_axis1 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_max_f16_axis1 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_min_bf16 (ops_reduce, bf16, stub): C1, C4, C6

foreach_reduce_min_f32_axis0 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_min_f16_axis0 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_min_f32_axis1 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_min_f16_axis1 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_mean_bf16 (ops_reduce, bf16, stub): C1, C4, C6

foreach_reduce_mean_f32_axis0 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_mean_f16_axis0 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_mean_f32_axis1 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_mean_f16_axis1 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_prod_bf16 (ops_reduce, bf16, stub): C1, C4, C6

foreach_reduce_prod_f32_axis0 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_prod_f16_axis0 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_prod_f32_axis1 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_prod_f16_axis1 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_l1_norm_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_l1_norm_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_l2_norm_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_l2_norm_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_logsumexp_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_logsumexp_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_nansum_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_nansum_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_nanmean_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_nanmean_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_count_nonzero_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_count_nonzero_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_median_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_median_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_var_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_var_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_std_f32 (ops_reduce, f32, stub): C1, C4, C6

foreach_reduce_std_f16 (ops_reduce, f16, stub): C1, C4, C6

foreach_reduce_l1_norm_bf16 (ops_reduce, bf16, stub): C1, C4, C6

foreach_reduce_l2_norm_bf16 (ops_reduce, bf16, stub): C1, C4, C6

foreach_reduce_logsumexp_bf16 (ops_reduce, bf16, stub): C1, C4, C6

foreach_reduce_nansum_bf16 (ops_reduce, bf16, stub): C1, C4, C6

foreach_upsample_nearest_2d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_upsample_nearest_2d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_upsample_nearest_3d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_upsample_nearest_3d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_upsample_bilinear_2d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_upsample_bilinear_2d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_upsample_bilinear_3d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_upsample_bilinear_3d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_upsample_bicubic_2d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_upsample_bicubic_2d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_upsample_trilinear_3d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_upsample_trilinear_3d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_interpolate_nearest_2d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_interpolate_nearest_2d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_interpolate_nearest_3d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_interpolate_nearest_3d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_interpolate_bilinear_2d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_interpolate_bilinear_2d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_interpolate_bilinear_3d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_interpolate_bilinear_3d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_interpolate_bicubic_2d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_interpolate_bicubic_2d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_resize_nearest_2d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_resize_nearest_2d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_resize_bilinear_2d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_resize_bilinear_2d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_adaptive_avg_pool_2d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_adaptive_avg_pool_2d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_adaptive_avg_pool_3d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_adaptive_avg_pool_3d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_adaptive_max_pool_2d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_adaptive_max_pool_2d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_adaptive_max_pool_3d_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_adaptive_max_pool_3d_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_upsample_bilinear_2d_align_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_upsample_bilinear_2d_align_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_upsample_bicubic_2d_align_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_upsample_bicubic_2d_align_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_interpolate_bilinear_2d_align_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_interpolate_bilinear_2d_align_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_resize_bilinear_2d_align_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_resize_bilinear_2d_align_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_grid_sample_bilinear_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_grid_sample_bilinear_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_grid_sample_nearest_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_grid_sample_nearest_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_grid_sample_bicubic_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_grid_sample_bicubic_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_pixel_shuffle_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_pixel_unshuffle_f32 (ops_resize, f32, stub): C1, C4, C6

foreach_pixel_shuffle_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_pixel_unshuffle_f16 (ops_resize, f16, stub): C1, C4, C6

foreach_gather_f32 (ops_index, f32, ✓ real source): C1

foreach_gather_f16 (ops_index, f16, ✓ real source): C1

foreach_gather_int32 (ops_index, i32, ✓ real source): C1

foreach_scatter_f32 (ops_index, f32, ✓ real source): C1

foreach_scatter_f16 (ops_index, f16, ✓ real source): C1

foreach_scatter_int32 (ops_index, i32, ✓ real source): C1

foreach_scatter_add_f32 (ops_index, f32, ✓ real source): C1

foreach_scatter_add_f16 (ops_index, f16, ✓ real source): C1

foreach_scatter_add_int32 (ops_index, i32, ✓ real source): C1

foreach_scatter_mul_f32 (ops_index, f32, ✓ real source): C1

foreach_scatter_mul_f16 (ops_index, f16, ✓ real source): C1

foreach_scatter_mul_int32 (ops_index, i32, ✓ real source): C1

foreach_index_add_f32 (ops_index, f32, stub): C1, C4, C6

foreach_index_add_f16 (ops_index, f16, stub): C1, C4, C6

foreach_index_add_int32 (ops_index, i32, stub): C1, C4, C6

foreach_index_copy_f32 (ops_index, f32, stub): C1, C4, C6

foreach_index_copy_f16 (ops_index, f16, stub): C1, C4, C6

foreach_index_copy_int32 (ops_index, i32, stub): C1, C4, C6

foreach_index_fill_f32 (ops_index, f32, stub): C1, C4, C6

foreach_index_fill_f16 (ops_index, f16, stub): C1, C4, C6

foreach_index_fill_int32 (ops_index, i32, stub): C1, C4, C6

foreach_index_select_f32 (ops_index, f32, stub): C1, C4, C6

foreach_index_select_f16 (ops_index, f16, stub): C1, C4, C6

foreach_index_select_int32 (ops_index, i32, stub): C1, C4, C6

foreach_index_put_f32 (ops_index, f32, stub): C1, C4, C6

foreach_index_put_f16 (ops_index, f16, stub): C1, C4, C6

foreach_index_put_int32 (ops_index, i32, stub): C1, C4, C6

foreach_masked_fill_f32 (ops_index, f32, stub): C1, C4, C6

foreach_masked_fill_f16 (ops_index, f16, stub): C1, C4, C6

foreach_masked_fill_int32 (ops_index, i32, stub): C1, C4, C6

foreach_masked_select_f32 (ops_index, f32, stub): C1, C4, C6

foreach_masked_select_f16 (ops_index, f16, stub): C1, C4, C6

foreach_masked_select_int32 (ops_index, i32, stub): C1, C4, C6

foreach_masked_scatter_f32 (ops_index, f32, stub): C1, C4, C6

foreach_masked_scatter_f16 (ops_index, f16, stub): C1, C4, C6

foreach_masked_scatter_int32 (ops_index, i32, stub): C1, C4, C6

foreach_where_f32 (ops_index, f32, stub): C1, C4, C6

foreach_where_f16 (ops_index, f16, stub): C1, C4, C6

foreach_where_int32 (ops_index, i32, stub): C1, C4, C6

foreach_nonzero_f32 (ops_index, f32, stub): C1, C4, C6

foreach_nonzero_f16 (ops_index, f16, stub): C1, C4, C6

foreach_nonzero_int32 (ops_index, i32, stub): C1, C4, C6

foreach_sort_f32 (ops_index, f32, stub): C1, C4, C6

foreach_sort_f16 (ops_index, f16, stub): C1, C4, C6

foreach_sort_int32 (ops_index, i32, stub): C1, C4, C6

foreach_argsort_f32 (ops_index, f32, stub): C1, C4, C6

foreach_argsort_f16 (ops_index, f16, stub): C1, C4, C6

foreach_argsort_int32 (ops_index, i32, stub): C1, C4, C6

foreach_topk_f32 (ops_index, f32, ✓ real source): C1

foreach_topk_f16 (ops_index, f16, ✓ real source): C1

foreach_topk_int32 (ops_index, i32, ✓ real source): C1

foreach_unique_f32 (ops_index, f32, stub): C1, C4, C6

foreach_unique_f16 (ops_index, f16, stub): C1, C4, C6

foreach_unique_int32 (ops_index, i32, stub): C1, C4, C6

foreach_searchsorted_f32 (ops_index, f32, stub): C1, C4, C6

foreach_searchsorted_f16 (ops_index, f16, stub): C1, C4, C6

foreach_searchsorted_int32 (ops_index, i32, stub): C1, C4, C6

foreach_bucketize_f32 (ops_index, f32, stub): C1, C4, C6

foreach_bucketize_f16 (ops_index, f16, stub): C1, C4, C6

foreach_bucketize_int32 (ops_index, i32, stub): C1, C4, C6

foreach_one_hot_f32 (ops_index, f32, stub): C1, C4, C6

foreach_one_hot_f16 (ops_index, f16, stub): C1, C4, C6

foreach_one_hot_int32 (ops_index, i32, stub): C1, C4, C6

foreach_embedding_bag_f32 (ops_index, f32, ✓ real source): C1, C2

foreach_embedding_bag_f16 (ops_index, f16, ✓ real source): C1, C2

foreach_embedding_bag_int32 (ops_index, i32, ✓ real source): C1, C2

foreach_cummax_f32 (ops_index, f32, stub): C1, C4, C6

foreach_cummax_f16 (ops_index, f16, stub): C1, C4, C6

foreach_cummax_int32 (ops_index, i32, stub): C1, C4, C6

foreach_cummin_f32 (ops_index, f32, stub): C1, C4, C6

foreach_cummin_f16 (ops_index, f16, stub): C1, C4, C6

foreach_cummin_int32 (ops_index, i32, stub): C1, C4, C6

foreach_scatter_nd_f32 (ops_index, f32, ✓ real source): C1

foreach_scatter_nd_f16 (ops_index, f16, ✓ real source): C1

foreach_scatter_nd_int32 (ops_index, i32, ✓ real source): C1

foreach_gather_nd_f32 (ops_index, f32, ✓ real source): C1

foreach_gather_nd_f16 (ops_index, f16, ✓ real source): C1

foreach_gather_nd_int32 (ops_index, i32, ✓ real source): C1

foreach_index_put_accumulate_f32 (ops_index, f32, stub): C1, C4, C6

foreach_index_put_accumulate_f16 (ops_index, f16, stub): C1, C4, C6

foreach_index_put_accumulate_int32 (ops_index, i32, stub): C1, C4, C6

foreach_take_along_axis_f32 (ops_index, f32, stub): C1, C4, C6

foreach_take_along_axis_f16 (ops_index, f16, stub): C1, C4, C6

foreach_take_along_axis_int32 (ops_index, i32, stub): C1, C4, C6

foreach_put_along_axis_f32 (ops_index, f32, stub): C1, C4, C6

foreach_put_along_axis_f16 (ops_index, f16, stub): C1, C4, C6

foreach_put_along_axis_int32 (ops_index, i32, stub): C1, C4, C6

foreach_bincount_f32 (ops_index, f32, stub): C1, C4, C6

foreach_bincount_f16 (ops_index, f16, stub): C1, C4, C6

foreach_bincount_int32 (ops_index, i32, stub): C1, C4, C6

foreach_scatter_max_f32 (ops_index, f32, ✓ real source): C1

foreach_scatter_max_f16 (ops_index, f16, ✓ real source): C1

foreach_scatter_max_int32 (ops_index, i32, ✓ real source): C1

foreach_scatter_min_f32 (ops_index, f32, ✓ real source): C1

foreach_scatter_min_f16 (ops_index, f16, ✓ real source): C1

foreach_scatter_min_int32 (ops_index, i32, ✓ real source): C1

foreach_gather_bf16 (ops_index, bf16, ✓ real source): C1

foreach_scatter_bf16 (ops_index, bf16, ✓ real source): C1

foreach_index_select_bf16 (ops_index, bf16, stub): C1, C4, C6

foreach_where_bf16 (ops_index, bf16, stub): C1, C4, C6

foreach_sort_bf16 (ops_index, bf16, stub): C1, C4, C6

foreach_topk_bf16 (ops_index, bf16, ✓ real source): C1

foreach_masked_fill_bf16 (ops_index, bf16, stub): C1, C4, C6

foreach_masked_select_bf16 (ops_index, bf16, stub): C1, C4, C6

foreach_sort_int64 (ops_index, f32, stub): C1, C4, C6

foreach_argsort_int64 (ops_index, f32, stub): C1, C4, C6

foreach_topk_int64 (ops_index, f32, ✓ real source): C1

foreach_unique_int64 (ops_index, f32, stub): C1, C4, C6

foreach_gather_int8 (ops_index, i8, ✓ real source): C1

foreach_scatter_int8 (ops_index, i8, ✓ real source): C1

foreach_scatter_add_bf16 (ops_index, bf16, ✓ real source): C1

foreach_scatter_mul_bf16 (ops_index, bf16, ✓ real source): C1

foreach_index_add_bf16 (ops_index, bf16, stub): C1, C4, C6

foreach_index_copy_bf16 (ops_index, bf16, stub): C1, C4, C6