4. A More Realistic Example: Softmax

Vector multiplication demonstrates the basics, but real neural network workloads require math functions like exp(), log(), and sqrt(). The softmax function — used in attention layers, classification heads, and probability normalization — is a perfect example:

$$\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$$

4.1 Math Intrinsics in `ascend_std`

ascend-rs exposes hardware math operations as Rust methods on primitive types. Under the hood, f32::exp() maps to the expf32 compiler intrinsic, which the MLIR codegen backend lowers to llvm.intr.exp — ultimately executing as a native NPU math instruction.

// In ascend_std: these methods are available on f32/f64 in kernel code
let y = x.exp();   // expf32 → llvm.intr.exp
let y = x.ln();    // logf32 → llvm.intr.log
let y = x.sqrt();  // sqrtf32 → llvm.intr.sqrt

4.2 The Softmax Kernel

Here is a complete softmax kernel written in Rust for the Ascend NPU:

#![feature(no_core)]
#![no_std]
#![no_core]

#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len as usize;

        // Step 1: Find max value for numerical stability
        let mut max_val = *input;
        let mut i = 1usize;
        loop {
            if i >= n { break; }
            let val = *input.wrapping_add(i);
            if val > max_val { max_val = val; }
            i = i + 1;
        }

        // Step 2: Compute exp(x_i - max) and accumulate sum
        let mut sum: f32 = 0.0;
        i = 0;
        loop {
            if i >= n { break; }
            let exp_val = (*input.wrapping_add(i) - max_val).exp();
            *output.wrapping_add(i) = exp_val;
            sum = sum + exp_val;
            i = i + 1;
        }

        // Step 3: Normalize
        i = 0;
        loop {
            if i >= n { break; }
            *output.wrapping_add(i) = *output.wrapping_add(i) / sum;
            i = i + 1;
        }
    }
}

The key line is (*input.wrapping_add(i) - max_val).exp() — this calls f32::exp(), which compiles through the MLIR backend into a native NPU exponential instruction. The subtraction of max_val before exponentiation is the standard numerical stability trick that prevents overflow.

This demonstrates that ascend-rs kernel code isn’t limited to simple arithmetic — it can express the same algorithms you’d write in C++ AscendC, with Rust’s safety guarantees.

4.3 Performance: Rust vs C++ on Real Hardware

How does a Rust kernel perform compared to hand-written C++ on actual NPU hardware? We benchmarked the softmax kernel on an Ascend 310P NPU with four implementations:

C++ naive (scalar) — A hand-written C++ kernel using scalar loops with GetValue/SetValue accessors
C++ optimized (vector) — An expert-written C++ kernel using AscendC vector intrinsics (ReduceMax, Exp, Muls)
Rust scalar — The Rust kernel above, compiled through the MLIR-to-C++ codegen pipeline
Rust vector — A Rust kernel using ascend-rs vector intrinsics (ascend_reduce_max_f32, ascend_exp_f32, ascend_muls_f32), compiled through the same pipeline

Each kernel processes f32 input arrays, with 1 warmup iteration and 10 timed iterations per configuration. All results are verified against a CPU reference for correctness.

Size	C++ Naive (ms)	C++ Opt (ms)	Rust Scalar (ms)	Rust Vector (ms)	Scalar vs Naive	Vector vs Opt
256	0.100	0.078	0.099	0.077	0.99x	0.99x
1,024	0.191	0.077	0.202	0.076	1.06x	0.99x
4,096	0.568	0.079	0.607	0.079	1.07x	1.00x
16,384	2.073	0.089	2.221	0.087	1.07x	0.98x

Key findings:

Rust vector matches C++ optimized performance. The Rust vectorized kernel, using ascend_std vector intrinsics that map to AscendC operations, performs within 1-2% of the hand-optimized C++ kernel across all sizes. At 16,384 elements, the Rust vector kernel (0.087ms) is actually slightly faster than C++ optimized (0.089ms). This means there is zero performance penalty for writing vectorized NPU kernels in Rust instead of C++.
Vector intrinsics provide massive speedups. Both vectorized kernels are 1.3x faster at small sizes and up to 25x faster at 16,384 elements compared to their scalar counterparts. The vector pipeline processes 256 bits (8 floats) per cycle vs one element per cycle for scalar code.
Rust scalar is within 5-7% of C++ scalar. The scalar codegen path also produces competitive code, with the small overhead coming from different UB access patterns (direct pointer arithmetic vs accessor methods).
All implementations are numerically correct. Every kernel-size combination produces results matching the CPU reference (max error < 1e-8, output sum ≈ 1.0). The vector implementations achieve even lower error than scalar (max_err ~1e-10 vs ~1e-8) due to hardware-optimized math operations.

Here is what the Rust vectorized softmax kernel looks like — it reads almost identically to the C++ version:

#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len_buf: *const u32) {
    unsafe {
        let n = *len_buf;
        let in_buf  = ascend_std::ascend_buf_alloc(n);
        let out_buf = ascend_std::ascend_buf_alloc(n);
        let work    = ascend_std::ascend_buf_alloc(n);
        let rwork   = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(in_buf, input, n);
        ascend_std::ascend_pipe_barrier();

        let max_val = ascend_std::ascend_reduce_max_f32(work, in_buf, rwork, n);
        ascend_std::ascend_adds_f32(out_buf, in_buf, 0.0f32 - max_val, n);
        ascend_std::ascend_exp_f32(out_buf, out_buf, n);
        let sum_val = ascend_std::ascend_reduce_sum_f32(work, out_buf, rwork, n);
        ascend_std::ascend_muls_f32(out_buf, out_buf, 1.0f32 / sum_val, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, out_buf, n);
    }
}

The ascend_buf_alloc / ascend_buf_load_f32 / ascend_reduce_max_f32 calls are extern "C" stubs in ascend_std that the MLIR codegen backend recognizes and translates to AscendC API calls (TBuf, DataCopy, ReduceMax, etc.) during C++ code generation. This gives Rust kernels direct access to the NPU’s vector pipeline with zero overhead.

4.4 Beyond Softmax: Activation Function Benchmarks

To validate the breadth of the vector intrinsic API, we benchmarked three additional activation functions — Relu, Sigmoid, and Tanh — each composed from the same primitive operations. Unlike softmax, these activations don’t have dedicated AscendC builtins; instead they are constructed from composable vector primitives:

Relu(x) = max(x, 0) → Maxs
Sigmoid(x) = 1 / (1 + exp(-x)) → Muls → Exp → Adds → Reciprocal
Tanh(x) = 2 · sigmoid(2x) - 1 → Muls → Exp → Adds → Reciprocal → Muls → Adds

For each function, we compare a C++ implementation (TQue pipeline) against the equivalent Rust-style code (TBuf pipeline matching the mlir_to_cpp output):

Size	Relu C++ (ms)	Relu Rust (ms)	Sigmoid C++ (ms)	Sigmoid Rust (ms)	Tanh C++ (ms)	Tanh Rust (ms)
256	0.078	0.075	0.075	0.075	0.075	0.077
1,024	0.075	0.076	0.075	0.074	0.075	0.076
4,096	0.075	0.076	0.077	0.077	0.076	0.078
16,384	0.083	0.083	0.086	0.086	0.085	0.086

All six kernels perform identically within measurement noise. Relu achieves exact correctness (max_err = 0), while Sigmoid and Tanh achieve max_err < 3e-3 at sizes ≥ 1024. The size=256 correctness issue affects both C++ and Rust equally — it’s an AscendC hardware-level precision artifact at small vector sizes, not a codegen issue.

This confirms that the Rust vector intrinsic API generalizes beyond softmax. For the activation functions tested here — each a composition of AscendC vector primitives — Rust and C++ produce identical performance. We expect this to hold for any kernel composed purely from vector intrinsics, since the codegen maps each Rust intrinsic call 1:1 to the same AscendC C++ call. Cube engine operations (matmul via Mmad) and multi-level buffer hierarchies (L1/L0A/L0B/L0C) are supported at the API level but have not yet been hardware-verified through the full pipeline.

4.5 Formal Equivalence Verification: AscendC vs AscendRS

Performance parity is compelling, but the strongest argument for the Rust codegen pipeline is bitwise equivalence — proving that Rust-generated kernels produce exactly the same numerical results as hand-written AscendC C++ kernels on real NPU hardware.

We selected three representative kernels that cover the most common neural network operation patterns:

ReLU — single vector op: output[i] = max(input[i], 0) → ascend_maxs_f32
Sigmoid — chained vector ops: output[i] = 1/(1 + exp(-input[i])) → Muls → Exp → Adds → Reciprocal
Vec Add — binary vector op: z[i] = x[i] + y[i] → ascend_add_f32

For each kernel, we compiled two implementations:

AscendC original — idiomatic C++ using the TQue pipeline (EnQue/DeQue implicit synchronization), as a 910B production engineer would write it
AscendRS equivalent — C++ generated from Rust source via the mlir_to_cpp pipeline (TBuf + explicit pipe_barrier(PIPE_ALL))

Both were run on the 310P NPU with identical inputs (256 f32 elements, deterministic PRNG) and compared at three levels:

Test	C++ vs CPU	RS vs CPU	C++ vs RS
ReLU	PASS (err=0.00)	PASS (err=0.00)	PASS (err=0.00)
Sigmoid	PASS (err=2.4e-3)	PASS (err=2.4e-3)	PASS (err=0.00)
Vec Add	PASS (err=0.00)	PASS (err=0.00)	PASS (err=0.00)

The C++ vs RS column shows bitwise identical output (max error = 0.0) for all three kernels. The NPU produces exactly the same bits whether the kernel was written in C++ or Rust. The small sigmoid CPU difference (2.4e-3) is the NPU’s Exp() vector unit precision vs x86 expf() — it affects both implementations equally and is not a codegen issue.

Here is the Rust sigmoid kernel — four lines of vector intrinsic calls that produce identical NPU output to the 40-line AscendC C++ class:

#[ascend_std::aiv_kernel]
pub unsafe fn sigmoid(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len;
        let buf_in = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(buf_in, input, n);
        ascend_std::ascend_pipe_barrier();

        ascend_std::ascend_muls_f32(buf_out, buf_in, -1.0f32, n);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_exp_f32(buf_out, buf_out, n);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_adds_f32(buf_out, buf_out, 1.0f32, n);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_reciprocal_f32(buf_out, buf_out, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}

A notable discovery during this work: in-place chained vector operations on the 310P require explicit pipe_barrier(PIPE_ALL) between each step. Without barriers between Muls→Exp→Adds→Reciprocal on the same buffer, the next operation reads stale data. This is a hardware synchronization requirement that the Rust codegen pipeline now handles correctly — and the equivalence test serves as a regression test for this behavior.

Keyboard shortcuts