Appendix C: Vulnerability Analysis of 300 MultiKernelBench Kernels

The 300 kernels in MultiKernelBench span 15 categories. If implemented as standard AscendC C++ kernels, each inherits the structural vulnerability patterns of the GM_ADDR/LocalTensor/FreeTensor API. We systematically classify which patterns affect which kernel categories, count the exposure, and show the highest-risk C++ vs. ascend-rs comparisons.

C.1 Vulnerability Pattern Prevalence

Vulnerability Pattern	Affected Kernel Categories	Count (/300)	Severity
V1: GM_ADDR type erasure	All 15 categories	300	High
V2: Unchecked `GetValue`/`SetValue` OOB	Index (12), Conv (34), Pooling (6), Resize (10), Architecture (50), Attention (15), Math (6)	133	Critical
V3: Integer overflow in offset calc	All multi-block kernels: Activation (16), Broadcast (10), Reduce (5), Normalization (8), Fuse (100), Matmul (17), Optimizer (5)	161	High
V4: FreeTensor use-after-free	All tiled/pipelined kernels	300	High
V5: Double-free of LocalTensor	All tiled/pipelined kernels	300	Medium
V6: Missing `pipe_barrier` sync	All DMA+compute kernels	300	Critical

Key finding: Every AscendC C++ kernel is structurally exposed to V1 (type erasure), V4 (use-after-free), V5 (double-free), and V6 (missing sync) because these are properties of the API itself, not of specific algorithms. The algorithmic vulnerabilities (V2, V3) affect subsets depending on whether the kernel uses element-indexed access or multi-block offset arithmetic.

C.2 Highest-Risk Category: Index Operations (12 kernels)

Index kernels (gather, scatter, scatter_add, index_select, index_copy, index_add, embedding, masked_fill, inplace_update, take_along_dim, argmax, argmin) are the highest-risk category because they combine all six vulnerability patterns simultaneously:

V1: GM_ADDR erases tensor element types
V2: User-provided index values access arbitrary offsets with no bounds check
V3: idx * row_len + j can overflow uint32_t for large tensors
V4/V5: Tiled implementations use FreeTensor lifecycle
V6: DMA ↔ compute synchronization required

C++ AscendC gather (vulnerable):

#include "kernel_operator.h"

// GM_ADDR erases all type info — caller can pass any dtype
extern "C" __global__ __aicore__
void gather(GM_ADDR input, GM_ADDR index, GM_ADDR output, GM_ADDR len_buf) {
    uint32_t n = *((__gm__ uint32_t *)len_buf);
    // V1: Manual cast from GM_ADDR — no compile-time type safety
    __gm__ float *in_ptr = (__gm__ float *)input;
    __gm__ uint32_t *idx_ptr = (__gm__ uint32_t *)index;
    __gm__ float *out_ptr = (__gm__ float *)output;

    for (uint32_t i = 0; i < n; i++) {
        uint32_t idx = idx_ptr[i];
        // V2: No bounds check on idx — attacker-controlled index
        // reads arbitrary memory within GM address space
        out_ptr[i] = in_ptr[idx];  // OOB if idx >= input_len
    }
}

ascend-rs gather (mitigated):

#[ascend_std::aiv_kernel]
pub unsafe fn gather(
    input: *const f32,   // V1 mitigated: typed pointer, not GM_ADDR
    index: *const u32,
    output: *mut f32,
    len: *const u32,
) {
    unsafe {
        let n = *len;
        let mut i = 0u32;
        loop {
            if i >= n { break; }  // Loop bounds explicit
            let idx = *index.wrapping_add(i as usize);
            // V2: wrapping_add is explicit about pointer arithmetic semantics
            // V3: no integer overflow — each offset cast individually
            *output.wrapping_add(i as usize) = *input.wrapping_add(idx as usize);
            i = i + 1;
        }
        // V4/V5: No FreeTensor — buffer IDs auto-managed
        // V6: No DMA/compute split — scalar ops on GM directly
    }
}

C.3 High-Risk Category: Convolution Kernels (34 kernels)

Convolution kernels have deeply nested loops with complex multi-dimensional index arithmetic (oc * in_ch * k_h * k_w + ic * k_h * k_w + kh * k_w + kw). A single wrong dimension in the index expression silently reads from wrong memory.

C++ AscendC conv2d index calculation (vulnerable):

// V2+V3: 6-level nested index arithmetic — easy to get a dimension wrong
for (int oc = 0; oc < out_ch; oc++) {
    for (int oh = 0; oh < out_h; oh++) {
        for (int ow = 0; ow < out_w; ow++) {
            float sum = 0.0f;
            for (int ic = 0; ic < in_ch; ic++) {
                for (int kh = 0; kh < k_h; kh++) {
                    for (int kw = 0; kw < k_w; kw++) {
                        int ih = oh * stride + kh * dilation;
                        int iw = ow * stride + kw * dilation;
                        // V3: 32-bit multiply chain can overflow
                        int in_idx = ic * in_h * in_w + ih * in_w + iw;
                        int w_idx = oc * in_ch * k_h * k_w
                                  + ic * k_h * k_w + kh * k_w + kw;
                        // V2: No bounds check — if ih >= in_h or iw >= in_w,
                        // reads out-of-bounds from GM
                        sum += (float)inLocal.GetValue(in_idx)
                             * (float)wLocal.GetValue(w_idx);
                    }
                }
            }
            outLocal.SetValue(oc * out_h * out_w + oh * out_w + ow, sum);
        }
    }
}

ascend-rs conv2d (mitigated):

#[ascend_std::aiv_kernel]
pub unsafe fn conv_standard_2d(
    input: *const f32, weight: *const f32, output: *mut f32,
    params: *const u32,  // [in_ch, out_ch, in_h, in_w, k_h, k_w, stride, dilation]
) {
    unsafe {
        // All params read from typed pointer — no GM_ADDR cast
        let in_ch = *params;
        let out_ch = *params.wrapping_add(1);
        // ... (read remaining params)
        let out_h = (in_h - (k_h - 1) * dilation - 1) / stride + 1;
        let out_w = (in_w - (k_w - 1) * dilation - 1) / stride + 1;

        let mut oc = 0u32;
        loop {
            if oc >= out_ch { break; }
            // ... nested loops with explicit bounds ...
            let ih = oh * stride + kh * dilation;
            let iw = ow * stride + kw * dilation;
            // V3 mitigated: wrapping semantics explicit via `as usize`
            // Debug builds panic on overflow, release wraps intentionally
            let in_idx = (ic * in_h * in_w + ih * in_w + iw) as usize;
            let w_idx = (oc * in_ch * k_h * k_w
                       + ic * k_h * k_w + kh * k_w + kw) as usize;
            sum = sum + *input.wrapping_add(in_idx) * *weight.wrapping_add(w_idx);
            // V4/V5: No FreeTensor needed
            // V6: No DMA — scalar GM access
        }
    }
}

C.4 High-Risk Category: Fused Operations (100 kernels)

Fused kernels (matmul+activation, conv+norm+activation, etc.) chain multiple pipeline stages. In C++, each stage requires its own AllocTensor/FreeTensor/pipe_barrier — missing any one produces silent data corruption.

C++ fused matmul+sigmoid (vulnerable):

// Fused matmul + sigmoid: C = sigmoid(A * B)
// V4: 4 tensors allocated/freed — each is a use-after-free opportunity
// V5: Copy-paste between fused variants can duplicate FreeTensor
// V6: 3 pipeline transitions (DMA→cube, cube→vector, vector→DMA)
//     — each requires pipe_barrier, forgetting any one = stale data

AscendC::LocalTensor<half> aLocal = inQueueA.AllocTensor<half>();
AscendC::DataCopy(aLocal, aGm, m * k);
inQueueA.EnQue(aLocal);
// V6: Need barrier here for DMA → cube
aLocal = inQueueA.DeQue<half>();

// ... matmul ...

inQueueA.FreeTensor(aLocal);
// V4: aLocal handle still valid — accidental read compiles and runs

AscendC::LocalTensor<float> cLocal = outQueue.AllocTensor<float>();
// V6: Need barrier here for cube → vector
AscendC::Muls(cLocal, cLocal, -1.0f, total);  // sigmoid step 1
AscendC::Exp(cLocal, cLocal, total);            // sigmoid step 2
// V6: Need inter-op barriers for in-place chained ops on 310P
AscendC::Adds(cLocal, cLocal, 1.0f, total);    // sigmoid step 3
AscendC::Reciprocal(cLocal, cLocal, total);     // sigmoid step 4
outQueue.FreeTensor(cLocal);

ascend-rs fused matmul+sigmoid (mitigated):

#[ascend_std::aiv_kernel]
pub unsafe fn fused_matmul_sigmoid(
    a: *const u16, b: *const u16, c: *mut f32, dims: *const u32,
) {
    unsafe {
        let m = *dims;
        let k = *dims.wrapping_add(1);
        let n = *dims.wrapping_add(2);

        // V6 mitigated: matmul_f16 handles DMA+cube internally
        ascend_std::kernel_ops::matmul_f16(c, a, b, m, k, n);
        ascend_std::ascend_pipe_barrier();  // Explicit, visible

        let total = m * n;
        let buf_c = ascend_std::ascend_buf_alloc(total);
        ascend_std::ascend_buf_load_f32(buf_c, c as *const f32, total);
        ascend_std::ascend_pipe_barrier();  // Explicit, visible

        // V6 mitigated: sigmoid_f32 includes ALL internal barriers
        // (muls → barrier → exp → barrier → adds → barrier → reciprocal)
        ascend_std::kernel_ops::sigmoid_f32(buf_c, buf_c, total);

        ascend_std::ascend_pipe_barrier();  // Explicit, visible
        ascend_std::ascend_buf_store_f32(c, buf_c, total);
        // V4/V5: No FreeTensor — buf_c auto-managed
    }
}

C.5 Vulnerability Tally: 300 Kernels x 6 Patterns

Category	Kernels	V1 Type	V2 OOB	V3 Overflow	V4 UAF	V5 DblFree	V6 Sync	Total Exposures
Activation	16	16	0	16	16	16	16	80
Architecture	50	50	50	50	50	50	50	300
Attention	15	15	15	15	15	15	15	90
Broadcast	10	10	0	10	10	10	10	50
Convolution	34	34	34	34	34	34	34	204
Fuse	100	100	0	100	100	100	100	500
Index	12	12	12	12	12	12	12	72
Loss	7	7	0	7	7	7	7	35
Math	6	6	6	6	6	6	6	36
Matmul	17	17	0	17	17	17	17	85
Normalization	8	8	0	8	8	8	8	40
Optimizer	5	5	0	5	5	5	5	25
Pooling	6	6	6	6	6	6	6	36
Reduce	5	5	0	5	5	5	5	25
Resize	10	10	10	10	10	10	10	60
Total	300	300	133	300	300	300	300	1,633

C.6 How ascend-rs Eliminates Each Pattern

Pattern	C++ Root Cause	ascend-rs Mitigation	Residual Risk
V1: Type erasure	`GM_ADDR = uint8_t*` for all tensors	Typed `const f32` / `const u16` in fn signatures	None (compile-time)
V2: Unchecked OOB	`GetValue(i)` / `SetValue(i,v)` with no bounds check	Vector intrinsics with explicit count `n`; scalar loops use `wrapping_add`	`unsafe` pointer arithmetic still unchecked at runtime
V3: Integer overflow	`blockIdx * perBlockLen` silent wraparound	`wrapping_mul` makes overflow explicit; debug builds panic	Developer must choose `wrapping_` vs `checked_`
V4: Use-after-free	`FreeTensor()` invalidates handle, C++ allows continued use	No `FreeTensor` API; buffer IDs are typed newtypes (`UbBuf`, `L1Buf`, etc.), not owning handles	None (API-level)
V5: Double-free	`FreeTensor()` called twice corrupts free list	No `FreeTensor` API; buffer lifecycle auto-managed	None (API-level)
V6: Missing sync	Manual `pipe_barrier()` between every pipeline transition	`kernel_ops` composites include all internal barriers; DMA barriers explicit and few	Developer must place DMA↔compute barriers (2 per kernel, not per-op)

Net result: Of the 1,633 total vulnerability exposures across 300 kernels, ascend-rs eliminates 1,500 at the API/type level (V1, V4, V5 fully; V6 reduced from per-op to per-kernel). The remaining 133 OOB exposures (V2) are mitigated by replacing element-indexed access with whole-vector operations, though unsafe pointer arithmetic in scalar fallback kernels remains the programmer’s responsibility.

Keyboard shortcuts