6. Memory Safety Case Studies: AscendC C++ vs ascend-rs

With 16 kernels deployed on NPU hardware, 522 passing NPU correctness tests on Ascend 910B3, and 502 total kernels compiling through the MLIR backend, ascend-rs’s value proposition extends beyond performance parity — the key advantage is memory safety. Below we present 6 paired case studies where each AscendC C++ kernel contains a real, exploitable memory safety vulnerability that the equivalent Rust ascend-rs kernel structurally prevents.

These aren’t contrived toy examples. Each vulnerability class is a real pattern that occurs in production AscendC C++ kernel development:

Case	Vulnerability	C++ Root Cause
1. Type Confusion	`GM_ADDR` erases all type info at entry	Function signature encodes element type
2. Buffer Overflow	`GetValue(i)`/`SetValue(i,v)` unchecked	Buffer-ID API with explicit count
3. Use-After-Free	`FreeTensor()` then stale `LocalTensor` access	No manual free in API
4. Missing Sync	Forgetting `pipe_barrier()` between DMA and compute	`kernel_ops` composites include barriers
5. Double Free	`FreeTensor()` called twice	No free operation exists
6. Integer Overflow	Silent `u32` wrap in offset calculation	`wrapping_mul` makes overflow explicit

6.1 Type Confusion via GM_ADDR Type Erasure

AscendC kernel entry points receive all tensor pointers as GM_ADDR (= uint8_t*). The kernel must manually cast to the correct element type. If the host passes f16 data but the kernel casts to float*, each element reads 4 bytes instead of 2 — producing garbage values with no warning. This occurs whenever a kernel is reused for a different dtype without updating the cast, or when a host wrapper passes the wrong tensor format.

C++ — Vulnerable:

#include "kernel_operator.h"

class KernelSoftmaxConfused {
public:
    __aicore__ inline void Init(GM_ADDR input, GM_ADDR output, GM_ADDR len_buf) {
        uint32_t n = *((__gm__ uint32_t *)len_buf);

        // BUG: Host passed half-precision (f16) data, but we cast to float.
        // Each "float" element reads 4 bytes instead of 2, so we get:
        //   - Half the expected number of meaningful values
        //   - Each value is garbage (two f16 bit patterns reinterpreted as one float)
        // The compiler cannot catch this because GM_ADDR is just uint8_t*.
        inputGm.SetGlobalBuffer((__gm__ float *)input, n);
        outputGm.SetGlobalBuffer((__gm__ float *)output, n);
        // ...
    }

    __aicore__ inline void Compute(int32_t len) {
        AscendC::LocalTensor<float> xLocal = inQueue.DeQue<float>();
        AscendC::LocalTensor<float> yLocal = outQueue.AllocTensor<float>();
        // All computation operates on garbage values due to the type confusion.
        // Silently wrong output — no crash, no error.
        AscendC::Exp(yLocal, xLocal, len);
        outQueue.EnQue<float>(yLocal);
        inQueue.FreeTensor(xLocal);
    }
    // ...
};

// The entry point uses GM_ADDR (= uint8_t*) for all tensor arguments.
// The caller can pass any data type — no type checking at this boundary.
extern "C" __global__ __aicore__ void softmax_confused(
        GM_ADDR input, GM_ADDR output, GM_ADDR len_buf) {
    KernelSoftmaxConfused op;
    op.Init(input, output, len_buf);
    op.Process();
}

Rust — Safe:

#![feature(no_core)]
#![no_std]
#![no_core]

/// The signature `input: *const f32` means the host MUST pass an f32 tensor.
/// If the host has f16 data (*const u16), calling this function is a type error:
///     softmax(f16_ptr, ...)  // ERROR: expected *const f32, found *const u16
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len;
        let buf_in = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);
        let buf_work = ascend_std::ascend_buf_alloc(n);

        // Load f32 data — the _f32 suffix matches the pointer type.
        // There is no way to accidentally load f16 data through an f32 API.
        ascend_std::ascend_buf_load_f32(buf_in, input, n);
        ascend_std::ascend_pipe_barrier();

        // softmax_f32 expects f32 buffers — type consistency maintained
        // throughout the entire pipeline without manual casts.
        ascend_std::kernel_ops::softmax_f32(buf_out, buf_in, buf_work, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}

Key insight: In C++, GM_ADDR is a type-erased uint8_t* that accepts any data format. In Rust, the function signature *const f32 is part of the type system — the compiler rejects mismatched types at compile time.

6.2 Buffer Overflow via Unchecked Tensor Index

AscendC’s GetValue(i) and SetValue(i, v) perform no bounds checking. If the loop bound is wrong — an off-by-one error, using the wrong length variable, or confusing input/output sizes — the kernel reads or writes out of bounds on local SRAM. This is especially dangerous because local SRAM is shared across all tensor allocations within a tile — an OOB write silently overwrites a neighboring tensor’s data.

C++ — Vulnerable:

#include "kernel_operator.h"

class KernelScalarSoftmax {
    // ...
    __aicore__ inline void Compute(int32_t len, int32_t alignedLen) {
        AscendC::LocalTensor<float> xLocal = inQueue.DeQue<float>();
        AscendC::LocalTensor<float> yLocal = outQueue.AllocTensor<float>();

        // Step 1: Find max (scalar loop)
        float maxVal = xLocal.GetValue(0);
        for (int32_t i = 1; i < len; i++) {
            float v = xLocal.GetValue(i);
            if (v > maxVal) maxVal = v;
        }

        // Step 2: Compute exp(x - max) and sum
        float sum = 0.0f;
        for (int32_t i = 0; i < len; i++) {
            float v = xLocal.GetValue(i) - maxVal;
            yLocal.SetValue(i, v);
            sum += v;
        }

        // Step 3: Normalize
        float invSum = 1.0f / sum;

        // BUG: Off-by-one — loop condition uses <= instead of <.
        // When i == len, SetValue writes one element past the allocated buffer.
        // This overwrites whatever is adjacent in SRAM (another tensor's data,
        // queue metadata, etc.) with no error or warning.
        for (int32_t i = 0; i <= len; i++) {  // should be i < len
            yLocal.SetValue(i, yLocal.GetValue(i) * invSum);  // OOB at i==len
        }

        outQueue.EnQue<float>(yLocal);
        inQueue.FreeTensor(xLocal);
    }
    // ...
};

Rust — Safe:

#![feature(no_core)]
#![no_std]
#![no_core]

/// The count `n` passed to each vector op is the same value used to allocate
/// the buffer. There is no separate loop variable that could drift out of
/// sync. No element-wise indexing means no off-by-one.
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len;
        let buf_in = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);
        let buf_work = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(buf_in, input, n);
        ascend_std::ascend_pipe_barrier();

        // softmax_f32 operates on the entire buffer of `n` elements.
        // There is no loop index, no GetValue(i), no SetValue(i, v).
        // The count `n` is the same value used in ascend_buf_alloc —
        // the allocation and the operation are inherently consistent.
        ascend_std::kernel_ops::softmax_f32(buf_out, buf_in, buf_work, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}

Key insight: The C++ API exposes GetValue(i)/SetValue(i, v) with no bounds check — a classic source of off-by-one errors. The Rust buffer-ID API operates on whole buffers with an explicit count parameter, eliminating element-wise indexing entirely.

6.3 Use-After-Free of LocalTensor

AscendC requires manual FreeTensor() calls to return SRAM buffers to the queue’s free pool. After FreeTensor(), the LocalTensor handle remains valid at the C++ type level — it still holds the original buffer address. Any subsequent GetValue() or SetValue() compiles and runs, reading/writing memory that may already be reallocated for a different tensor.

C++ — Vulnerable:

#include "kernel_operator.h"

class KernelVecAddUAF {
    // ...
    __aicore__ inline void Compute(int32_t len) {
        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();
        AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();

        AscendC::Add(zLocal, xLocal, yLocal, len);

        // Return buffers to the free pool
        inQueueX.FreeTensor(xLocal);
        inQueueY.FreeTensor(yLocal);

        // BUG: xLocal was freed above, but the C++ handle still compiles.
        // The SRAM region has been returned to inQueueX's free list.
        // In a multi-tile kernel, this buffer may already be reallocated
        // by the next iteration's AllocTensor() call.
        half check = xLocal.GetValue(0);  // use-after-free!

        // The stale value may cause incorrect control flow decisions
        if ((float)check > 100.0f) {
            AscendC::Muls(zLocal, zLocal, (half)0.5f, len);  // based on garbage
        }

        outQueueZ.EnQue<half>(zLocal);
    }
    // ...
};

Rust — Safe:

#![feature(no_core)]
#![no_std]
#![no_core]

/// buf_x is a typed UbBuf ID — it never becomes invalid.
/// Compare with C++ where FreeTensor(xLocal) invalidates the buffer,
/// but xLocal.GetValue(0) still compiles and accesses freed SRAM.
#[ascend_std::aiv_kernel]
pub unsafe fn vec_add(x: *const u16, y: *const u16, z: *mut u16, len: *const u32) {
    unsafe {
        let n = *len;
        let block_idx = ascend_std::get_block_idx() as u32;
        let base = block_idx * n;

        let tile_size = 256u32;
        let buf_x = ascend_std::ascend_buf_alloc(tile_size);
        let buf_y = ascend_std::ascend_buf_alloc(tile_size);
        let buf_z = ascend_std::ascend_buf_alloc(tile_size);

        let mut offset = 0u32;
        loop {
            if offset >= n { break; }
            let mut len = tile_size;
            if offset + len > n { len = n - offset; }
            let gm_off = (base + offset) as usize;

            ascend_std::ascend_buf_load_f16(buf_x, x.wrapping_add(gm_off), len);
            ascend_std::ascend_buf_load_f16(buf_y, y.wrapping_add(gm_off), len);
            ascend_std::ascend_pipe_barrier();

            ascend_std::ascend_add_f16(buf_z, buf_x, buf_y, len);
            ascend_std::ascend_pipe_barrier();

            // No FreeTensor needed. buf_x, buf_y, buf_z are still valid.
            // The same buffer IDs are reused in the next tile iteration.
            ascend_std::ascend_buf_store_f16(z.wrapping_add(gm_off), buf_z, len);
            offset = offset + tile_size;
        }
        // Kernel returns. All buffers implicitly released.
    }
}

Key insight: C++ LocalTensor handles remain syntactically valid after FreeTensor() — the compiler cannot distinguish freed from live handles. In Rust, buffer IDs are #[repr(transparent)] newtype wrappers (UbBuf, L1Buf, L0aBuf, L0bBuf, L0cBuf) with no free operation; “using a buffer after it’s freed” is not a meaningful concept. The newtypes also prevent passing a buffer to the wrong memory level — e.g., passing an L0aBuf to a vector operation that expects UbBuf is a compile error.

6.4 Missing Synchronization Between Pipeline Stages

Ascend NPUs execute DMA (MTE2/MTE3), vector (V), and scalar (S) pipelines concurrently. A pipe_barrier() is required between a DMA load and a subsequent vector operation to ensure the data has actually arrived in local SRAM before computation begins. Forgetting this barrier is the single most common NPU bug — the kernel compiles and runs without error, but produces silently wrong results.

C++ — Vulnerable:

#include "kernel_operator.h"

class KernelSigmoidNoSync {
    // ...
    __aicore__ inline void CopyIn(int32_t offset, int32_t len) {
        AscendC::LocalTensor<float> xLocal = inQueue.AllocTensor<float>();
        AscendC::DataCopy(xLocal, inputGm[offset], len);
        // BUG: Missing pipe_barrier() between DMA load and EnQue.
        // The EnQue only marks the tensor as "available" in the queue,
        // but does NOT ensure the DMA transfer has completed.
        // If the DMA pipeline (MTE2) is slower than the scalar pipeline (S),
        // the subsequent DeQue + vector operations will read stale SRAM data.
        inQueue.EnQue(xLocal);
    }

    __aicore__ inline void Compute(int32_t len) {
        AscendC::LocalTensor<float> xLocal = inQueue.DeQue<float>();
        AscendC::LocalTensor<float> yLocal = outQueue.AllocTensor<float>();

        // Sigmoid = 1 / (1 + exp(-x))
        // Each of these vector operations may execute before the DMA load
        // completes, reading uninitialized or stale data from SRAM.
        AscendC::Muls(yLocal, xLocal, -1.0f, len);       // -x (stale data?)
        AscendC::Exp(yLocal, yLocal, len);                // exp(-x)
        AscendC::Adds(yLocal, yLocal, 1.0f, len);         // 1 + exp(-x)
        AscendC::Reciprocal(yLocal, yLocal, len);          // 1 / (1 + exp(-x))

        outQueue.EnQue<float>(yLocal);
        inQueue.FreeTensor(xLocal);
    }
    // ...
};

Rust — Safe:

#![feature(no_core)]
#![no_std]
#![no_core]

/// The pipe_barrier() between DMA load and compute is explicit and visible.
/// The sigmoid_f32 composite includes all internal barriers between its
/// four steps (muls → exp → adds → reciprocal).
#[ascend_std::aiv_kernel]
pub unsafe fn sigmoid(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len;
        let buf_in = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);

        // DMA load from GM to UB
        ascend_std::ascend_buf_load_f32(buf_in, input, n);

        // Explicit barrier: guarantees DMA load is complete before
        // any vector operations read from buf_in.
        ascend_std::ascend_pipe_barrier();

        // sigmoid_f32 is a composite that internally does:
        //   muls(-1) → pipe_barrier → exp → pipe_barrier →
        //   adds(1) → pipe_barrier → reciprocal
        // All internal barriers are included — no way to forget one.
        ascend_std::kernel_ops::sigmoid_f32(buf_out, buf_in, n);

        // Explicit barrier: guarantees vector compute is complete
        // before DMA store reads from buf_out.
        ascend_std::ascend_pipe_barrier();

        // DMA store from UB to GM
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}

Key insight: The C++ queue model (EnQue/DeQue) provides the illusion of synchronization but does not actually ensure DMA completion. In Rust, every barrier is explicit (ascend_pipe_barrier()), and kernel_ops composites include all internal barriers — the programmer cannot accidentally omit one within a composite operation.

6.5 Double-Free of Tensor Buffers

Calling FreeTensor() twice on the same LocalTensor inserts the same buffer address into the queue’s free list twice. The next two AllocTensor() calls will both return the same buffer, causing two “different” tensors to alias the same SRAM region. This manifests as intermittent data corruption that is tile-count-dependent.

C++ — Vulnerable:

#include "kernel_operator.h"

class KernelVecAddDoubleFree {
    // ...
    __aicore__ inline void Compute(int32_t len) {
        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();
        AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();

        AscendC::Add(zLocal, xLocal, yLocal, len);

        inQueueX.FreeTensor(xLocal);
        inQueueY.FreeTensor(yLocal);
        outQueueZ.EnQue<half>(zLocal);

        // BUG: Copy-paste error from a refactoring — FreeTensor called again.
        // xLocal's buffer is now in inQueueX's free list TWICE.
        // On the next two tile iterations, AllocTensor will return the same
        // buffer address for two "different" tensors, causing them to alias.
        // One tile's DMA load will silently overwrite another tile's data.
        inQueueX.FreeTensor(xLocal);  // double-free! Corrupts free list
    }
    // ...
};

Rust — Safe:

#![feature(no_core)]
#![no_std]
#![no_core]

/// Buffer IDs (buf_x, buf_y, buf_z) are allocated once and reused across
/// all tile iterations. No manual lifecycle management means no double-free.
#[ascend_std::aiv_kernel]
pub unsafe fn vec_add(x: *const u16, y: *const u16, z: *mut u16, len: *const u32) {
    unsafe {
        let n = *len;
        let block_idx = ascend_std::get_block_idx() as u32;
        let base = block_idx * n;
        let tile_size = 256u32;

        // Allocate buffers once. These IDs are valid for the entire kernel.
        let buf_x = ascend_std::ascend_buf_alloc(tile_size);
        let buf_y = ascend_std::ascend_buf_alloc(tile_size);
        let buf_z = ascend_std::ascend_buf_alloc(tile_size);

        let mut offset = 0u32;
        loop {
            if offset >= n { break; }
            let mut len = tile_size;
            if offset + len > n { len = n - offset; }
            let gm_off = (base + offset) as usize;

            ascend_std::ascend_buf_load_f16(buf_x, x.wrapping_add(gm_off), len);
            ascend_std::ascend_buf_load_f16(buf_y, y.wrapping_add(gm_off), len);
            ascend_std::ascend_pipe_barrier();

            ascend_std::ascend_add_f16(buf_z, buf_x, buf_y, len);
            ascend_std::ascend_pipe_barrier();

            ascend_std::ascend_buf_store_f16(z.wrapping_add(gm_off), buf_z, len);

            // No FreeTensor here. Even if this line were duplicated by
            // copy-paste, there is simply no free function to call.
            offset = offset + tile_size;
        }
        // Kernel returns — all buffers implicitly released.
    }
}

Key insight: In C++, FreeTensor() is a manual operation that can be accidentally duplicated. In Rust, there is no free operation — buffer IDs are typed newtype wrappers (UbBuf, L1Buf, etc.) that encode the memory level at compile time. “Double-freeing” a buffer ID is meaningless.

6.6 Silent Integer Overflow in Multi-Block Offset

Multi-block kernels distribute work across NPU cores by computing offset = blockIdx * perBlockLen. With uint32_t arithmetic, this multiplication silently wraps on overflow — e.g., 8192 * 524288 = 0x100000000 wraps to 0. The kernel reads/writes from the wrong memory region, potentially aliasing another block’s data. In C++, unsigned overflow is defined behavior (modular arithmetic), so no warning is generated.

C++ — Vulnerable:

#include "kernel_operator.h"

class KernelVecAddOverflow {
    // ...
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR len_buf) {
        uint32_t perBlockLen = *((__gm__ uint32_t *)len_buf);

        // BUG: Silent uint32_t overflow when blockIdx * perBlockLen > 2^32.
        //
        // Example: With 8192 blocks and perBlockLen = 524288 (512K elements),
        // total tensor size is 4GB of half-precision data. Block 8192 computes:
        //   offset = 8192 * 524288 = 4294967296 = 0x100000000
        // But uint32_t wraps: offset = 0. This block now aliases block 0's data.
        //
        // C++ provides no warning — unsigned overflow is well-defined as
        // modular arithmetic. The kernel silently reads the wrong data.
        uint32_t offset = AscendC::GetBlockIdx() * perBlockLen;

        xGm.SetGlobalBuffer((__gm__ half *)x + offset, perBlockLen);
        yGm.SetGlobalBuffer((__gm__ half *)y + offset, perBlockLen);
        zGm.SetGlobalBuffer((__gm__ half *)z + offset, perBlockLen);
        // ...
    }
    // ...
};

Rust — Safe:

#![feature(no_core)]
#![no_std]
#![no_core]

/// wrapping_mul documents that this multiplication may overflow for large
/// tensor sizes. A reviewer seeing wrapping_mul knows to check whether
/// the overflow is actually safe. In debug builds, plain `*` panics.
#[ascend_std::aiv_kernel]
pub unsafe fn vec_add(x: *const u16, y: *const u16, z: *mut u16, len: *const u32) {
    unsafe {
        let n = *len;
        let block_idx = ascend_std::get_block_idx() as u32;

        // wrapping_mul makes overflow semantics explicit.
        // A developer reading this line knows that:
        //   1. This multiplication CAN overflow for large inputs
        //   2. The overflow behavior is intentionally wrapping
        //   3. This is a potential correctness concern worth reviewing
        //
        // In debug builds (CPU-side testing), plain `*` would panic:
        //   let offset = block_idx * n;  // panics in debug if overflows!
        let offset = block_idx.wrapping_mul(n);

        let tile_size = 256u32;
        let buf_x = ascend_std::ascend_buf_alloc(tile_size);
        let buf_y = ascend_std::ascend_buf_alloc(tile_size);
        let buf_z = ascend_std::ascend_buf_alloc(tile_size);

        let mut tile_off = 0u32;
        loop {
            if tile_off >= n { break; }
            let mut len = tile_size;
            if tile_off + len > n { len = n - tile_off; }
            let gm_off = (offset.wrapping_add(tile_off)) as usize;

            ascend_std::ascend_buf_load_f16(buf_x, x.wrapping_add(gm_off), len);
            ascend_std::ascend_buf_load_f16(buf_y, y.wrapping_add(gm_off), len);
            ascend_std::ascend_pipe_barrier();

            ascend_std::ascend_add_f16(buf_z, buf_x, buf_y, len);
            ascend_std::ascend_pipe_barrier();

            ascend_std::ascend_buf_store_f16(z.wrapping_add(gm_off), buf_z, len);
            tile_off = tile_off + tile_size;
        }
    }
}

Key insight: In C++, blockIdx * perBlockLen silently wraps with no indication the developer considered overflow. In Rust, wrapping_mul explicitly documents the intent, and in debug builds regular * panics on overflow — catching the bug during development before it reaches hardware.

Keyboard shortcuts

ascend-rs: Memory-Safe NPU Kernel Programming in Rust

6. Memory Safety Case Studies: AscendC C++ vs ascend-rs

6.1 Type Confusion via GM_ADDR Type Erasure

6.2 Buffer Overflow via Unchecked Tensor Index

6.3 Use-After-Free of LocalTensor

6.4 Missing Synchronization Between Pipeline Stages

6.5 Double-Free of Tensor Buffers

6.6 Silent Integer Overflow in Multi-Block Offset