English | 中文版
Appendix C: Vulnerability Analysis of 300 MultiKernelBench Kernels
The 300 kernels in MultiKernelBench span 15 categories. If implemented as standard AscendC C++ kernels, each inherits the structural vulnerability patterns of the GM_ADDR/LocalTensor/FreeTensor API. We systematically classify which patterns affect which kernel categories, count the exposure, and show the highest-risk C++ vs. ascend-rs comparisons.
C.1 Vulnerability Pattern Prevalence
| Vulnerability Pattern | Affected Kernel Categories | Count (/300) | Severity |
|---|---|---|---|
| V1: GM_ADDR type erasure | All 15 categories | 300 | High |
V2: Unchecked GetValue/SetValue OOB | Index (12), Conv (34), Pooling (6), Resize (10), Architecture (50), Attention (15), Math (6) | 133 | Critical |
| V3: Integer overflow in offset calc | All multi-block kernels: Activation (16), Broadcast (10), Reduce (5), Normalization (8), Fuse (100), Matmul (17), Optimizer (5) | 161 | High |
| V4: FreeTensor use-after-free | All tiled/pipelined kernels | 300 | High |
| V5: Double-free of LocalTensor | All tiled/pipelined kernels | 300 | Medium |
V6: Missing pipe_barrier sync | All DMA+compute kernels | 300 | Critical |
Key finding: Every AscendC C++ kernel is structurally exposed to V1 (type erasure), V4 (use-after-free), V5 (double-free), and V6 (missing sync) because these are properties of the API itself, not of specific algorithms. The algorithmic vulnerabilities (V2, V3) affect subsets depending on whether the kernel uses element-indexed access or multi-block offset arithmetic.
C.2 Highest-Risk Category: Index Operations (12 kernels)
Index kernels (gather, scatter, scatter_add, index_select, index_copy, index_add, embedding, masked_fill, inplace_update, take_along_dim, argmax, argmin) are the highest-risk category because they combine all six vulnerability patterns simultaneously:
- V1:
GM_ADDRerases tensor element types - V2: User-provided index values access arbitrary offsets with no bounds check
- V3:
idx * row_len + jcan overflowuint32_tfor large tensors - V4/V5: Tiled implementations use
FreeTensorlifecycle - V6: DMA ↔ compute synchronization required
C++ AscendC gather (vulnerable):
#include "kernel_operator.h"
// GM_ADDR erases all type info — caller can pass any dtype
extern "C" __global__ __aicore__
void gather(GM_ADDR input, GM_ADDR index, GM_ADDR output, GM_ADDR len_buf) {
uint32_t n = *((__gm__ uint32_t *)len_buf);
// V1: Manual cast from GM_ADDR — no compile-time type safety
__gm__ float *in_ptr = (__gm__ float *)input;
__gm__ uint32_t *idx_ptr = (__gm__ uint32_t *)index;
__gm__ float *out_ptr = (__gm__ float *)output;
for (uint32_t i = 0; i < n; i++) {
uint32_t idx = idx_ptr[i];
// V2: No bounds check on idx — attacker-controlled index
// reads arbitrary memory within GM address space
out_ptr[i] = in_ptr[idx]; // OOB if idx >= input_len
}
}
ascend-rs gather (mitigated):
#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn gather(
input: *const f32, // V1 mitigated: typed pointer, not GM_ADDR
index: *const u32,
output: *mut f32,
len: *const u32,
) {
unsafe {
let n = *len;
let mut i = 0u32;
loop {
if i >= n { break; } // Loop bounds explicit
let idx = *index.wrapping_add(i as usize);
// V2: wrapping_add is explicit about pointer arithmetic semantics
// V3: no integer overflow — each offset cast individually
*output.wrapping_add(i as usize) = *input.wrapping_add(idx as usize);
i = i + 1;
}
// V4/V5: No FreeTensor — buffer IDs auto-managed
// V6: No DMA/compute split — scalar ops on GM directly
}
}
}
C.3 High-Risk Category: Convolution Kernels (34 kernels)
Convolution kernels have deeply nested loops with complex multi-dimensional index arithmetic (oc * in_ch * k_h * k_w + ic * k_h * k_w + kh * k_w + kw). A single wrong dimension in the index expression silently reads from wrong memory.
C++ AscendC conv2d index calculation (vulnerable):
// V2+V3: 6-level nested index arithmetic — easy to get a dimension wrong
for (int oc = 0; oc < out_ch; oc++) {
for (int oh = 0; oh < out_h; oh++) {
for (int ow = 0; ow < out_w; ow++) {
float sum = 0.0f;
for (int ic = 0; ic < in_ch; ic++) {
for (int kh = 0; kh < k_h; kh++) {
for (int kw = 0; kw < k_w; kw++) {
int ih = oh * stride + kh * dilation;
int iw = ow * stride + kw * dilation;
// V3: 32-bit multiply chain can overflow
int in_idx = ic * in_h * in_w + ih * in_w + iw;
int w_idx = oc * in_ch * k_h * k_w
+ ic * k_h * k_w + kh * k_w + kw;
// V2: No bounds check — if ih >= in_h or iw >= in_w,
// reads out-of-bounds from GM
sum += (float)inLocal.GetValue(in_idx)
* (float)wLocal.GetValue(w_idx);
}
}
}
outLocal.SetValue(oc * out_h * out_w + oh * out_w + ow, sum);
}
}
}
ascend-rs conv2d (mitigated):
#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn conv_standard_2d(
input: *const f32, weight: *const f32, output: *mut f32,
params: *const u32, // [in_ch, out_ch, in_h, in_w, k_h, k_w, stride, dilation]
) {
unsafe {
// All params read from typed pointer — no GM_ADDR cast
let in_ch = *params;
let out_ch = *params.wrapping_add(1);
// ... (read remaining params)
let out_h = (in_h - (k_h - 1) * dilation - 1) / stride + 1;
let out_w = (in_w - (k_w - 1) * dilation - 1) / stride + 1;
let mut oc = 0u32;
loop {
if oc >= out_ch { break; }
// ... nested loops with explicit bounds ...
let ih = oh * stride + kh * dilation;
let iw = ow * stride + kw * dilation;
// V3 mitigated: wrapping semantics explicit via `as usize`
// Debug builds panic on overflow, release wraps intentionally
let in_idx = (ic * in_h * in_w + ih * in_w + iw) as usize;
let w_idx = (oc * in_ch * k_h * k_w
+ ic * k_h * k_w + kh * k_w + kw) as usize;
sum = sum + *input.wrapping_add(in_idx) * *weight.wrapping_add(w_idx);
// V4/V5: No FreeTensor needed
// V6: No DMA — scalar GM access
}
}
}
}
C.4 High-Risk Category: Fused Operations (100 kernels)
Fused kernels (matmul+activation, conv+norm+activation, etc.) chain multiple pipeline stages. In C++, each stage requires its own AllocTensor/FreeTensor/pipe_barrier — missing any one produces silent data corruption.
C++ fused matmul+sigmoid (vulnerable):
// Fused matmul + sigmoid: C = sigmoid(A * B)
// V4: 4 tensors allocated/freed — each is a use-after-free opportunity
// V5: Copy-paste between fused variants can duplicate FreeTensor
// V6: 3 pipeline transitions (DMA→cube, cube→vector, vector→DMA)
// — each requires pipe_barrier, forgetting any one = stale data
AscendC::LocalTensor<half> aLocal = inQueueA.AllocTensor<half>();
AscendC::DataCopy(aLocal, aGm, m * k);
inQueueA.EnQue(aLocal);
// V6: Need barrier here for DMA → cube
aLocal = inQueueA.DeQue<half>();
// ... matmul ...
inQueueA.FreeTensor(aLocal);
// V4: aLocal handle still valid — accidental read compiles and runs
AscendC::LocalTensor<float> cLocal = outQueue.AllocTensor<float>();
// V6: Need barrier here for cube → vector
AscendC::Muls(cLocal, cLocal, -1.0f, total); // sigmoid step 1
AscendC::Exp(cLocal, cLocal, total); // sigmoid step 2
// V6: Need inter-op barriers for in-place chained ops on 310P
AscendC::Adds(cLocal, cLocal, 1.0f, total); // sigmoid step 3
AscendC::Reciprocal(cLocal, cLocal, total); // sigmoid step 4
outQueue.FreeTensor(cLocal);
ascend-rs fused matmul+sigmoid (mitigated):
#![allow(unused)]
fn main() {
#[ascend_std::aiv_kernel]
pub unsafe fn fused_matmul_sigmoid(
a: *const u16, b: *const u16, c: *mut f32, dims: *const u32,
) {
unsafe {
let m = *dims;
let k = *dims.wrapping_add(1);
let n = *dims.wrapping_add(2);
// V6 mitigated: matmul_f16 handles DMA+cube internally
ascend_std::kernel_ops::matmul_f16(c, a, b, m, k, n);
ascend_std::ascend_pipe_barrier(); // Explicit, visible
let total = m * n;
let buf_c = ascend_std::ascend_buf_alloc(total);
ascend_std::ascend_buf_load_f32(buf_c, c as *const f32, total);
ascend_std::ascend_pipe_barrier(); // Explicit, visible
// V6 mitigated: sigmoid_f32 includes ALL internal barriers
// (muls → barrier → exp → barrier → adds → barrier → reciprocal)
ascend_std::kernel_ops::sigmoid_f32(buf_c, buf_c, total);
ascend_std::ascend_pipe_barrier(); // Explicit, visible
ascend_std::ascend_buf_store_f32(c, buf_c, total);
// V4/V5: No FreeTensor — buf_c auto-managed
}
}
}
C.5 Vulnerability Tally: 300 Kernels x 6 Patterns
| Category | Kernels | V1 Type | V2 OOB | V3 Overflow | V4 UAF | V5 DblFree | V6 Sync | Total Exposures |
|---|---|---|---|---|---|---|---|---|
| Activation | 16 | 16 | 0 | 16 | 16 | 16 | 16 | 80 |
| Architecture | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 300 |
| Attention | 15 | 15 | 15 | 15 | 15 | 15 | 15 | 90 |
| Broadcast | 10 | 10 | 0 | 10 | 10 | 10 | 10 | 50 |
| Convolution | 34 | 34 | 34 | 34 | 34 | 34 | 34 | 204 |
| Fuse | 100 | 100 | 0 | 100 | 100 | 100 | 100 | 500 |
| Index | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 72 |
| Loss | 7 | 7 | 0 | 7 | 7 | 7 | 7 | 35 |
| Math | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 36 |
| Matmul | 17 | 17 | 0 | 17 | 17 | 17 | 17 | 85 |
| Normalization | 8 | 8 | 0 | 8 | 8 | 8 | 8 | 40 |
| Optimizer | 5 | 5 | 0 | 5 | 5 | 5 | 5 | 25 |
| Pooling | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 36 |
| Reduce | 5 | 5 | 0 | 5 | 5 | 5 | 5 | 25 |
| Resize | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 60 |
| Total | 300 | 300 | 133 | 300 | 300 | 300 | 300 | 1,633 |
C.6 How ascend-rs Eliminates Each Pattern
| Pattern | C++ Root Cause | ascend-rs Mitigation | Residual Risk |
|---|---|---|---|
| V1: Type erasure | GM_ADDR = uint8_t* for all tensors | Typed *const f32 / *const u16 in fn signatures | None (compile-time) |
| V2: Unchecked OOB | GetValue(i) / SetValue(i,v) with no bounds check | Vector intrinsics with explicit count n; scalar loops use wrapping_add | unsafe pointer arithmetic still unchecked at runtime |
| V3: Integer overflow | blockIdx * perBlockLen silent wraparound | wrapping_mul makes overflow explicit; debug builds panic | Developer must choose wrapping_* vs checked_* |
| V4: Use-after-free | FreeTensor() invalidates handle, C++ allows continued use | No FreeTensor API; buffer IDs are typed newtypes (UbBuf, L1Buf, etc.), not owning handles | None (API-level) |
| V5: Double-free | FreeTensor() called twice corrupts free list | No FreeTensor API; buffer lifecycle auto-managed | None (API-level) |
| V6: Missing sync | Manual pipe_barrier() between every pipeline transition | kernel_ops composites include all internal barriers; DMA barriers explicit and few | Developer must place DMA↔compute barriers (2 per kernel, not per-op) |
Net result: Of the 1,633 total vulnerability exposures across 300 kernels, ascend-rs eliminates 1,500 at the API/type level (V1, V4, V5 fully; V6 reduced from per-op to per-kernel). The remaining 133 OOB exposures (V2) are mitigated by replacing element-indexed access with whole-vector operations, though unsafe pointer arithmetic in scalar fallback kernels remains the programmer’s responsibility.