Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

4. A More Realistic Example: Softmax

Vector multiplication demonstrates the basics, but real neural network workloads require math functions like exp(), log(), and sqrt(). The softmax function — used in attention layers, classification heads, and probability normalization — is a perfect example:

$$\text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_j e^{x_j - \max(x)}}$$

4.1 Math Intrinsics in ascend_std

ascend-rs exposes hardware math operations as Rust methods on primitive types. Under the hood, f32::exp() maps to the expf32 compiler intrinsic, which the MLIR codegen backend lowers to llvm.intr.exp — ultimately executing as a native NPU math instruction.

// In ascend_std: these methods are available on f32/f64 in kernel code
let y = x.exp();   // expf32 → llvm.intr.exp
let y = x.ln();    // logf32 → llvm.intr.log
let y = x.sqrt();  // sqrtf32 → llvm.intr.sqrt

4.2 The Softmax Kernel

Here is a complete softmax kernel written in Rust for the Ascend NPU:

#![feature(no_core)]
#![no_std]
#![no_core]

#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len as usize;

        // Step 1: Find max value for numerical stability
        let mut max_val = *input;
        let mut i = 1usize;
        loop {
            if i >= n { break; }
            let val = *input.wrapping_add(i);
            if val > max_val { max_val = val; }
            i = i + 1;
        }

        // Step 2: Compute exp(x_i - max) and accumulate sum
        let mut sum: f32 = 0.0;
        i = 0;
        loop {
            if i >= n { break; }
            let exp_val = (*input.wrapping_add(i) - max_val).exp();
            *output.wrapping_add(i) = exp_val;
            sum = sum + exp_val;
            i = i + 1;
        }

        // Step 3: Normalize
        i = 0;
        loop {
            if i >= n { break; }
            *output.wrapping_add(i) = *output.wrapping_add(i) / sum;
            i = i + 1;
        }
    }
}

The key line is (*input.wrapping_add(i) - max_val).exp() — this calls f32::exp(), which compiles through the MLIR backend into a native NPU exponential instruction. The subtraction of max_val before exponentiation is the standard numerical stability trick that prevents overflow.

This demonstrates that ascend-rs kernel code isn’t limited to simple arithmetic — it can express the same algorithms you’d write in C++ AscendC, with Rust’s safety guarantees.

4.3 Performance: Rust vs C++ on Real Hardware

How does a Rust kernel perform compared to hand-written C++ on actual NPU hardware? We benchmarked the softmax kernel on an Ascend 310P NPU with four implementations:

  • C++ naive (scalar) — A hand-written C++ kernel using scalar loops with GetValue/SetValue accessors
  • C++ optimized (vector) — An expert-written C++ kernel using AscendC vector intrinsics (ReduceMax, Exp, Muls)
  • Rust scalar — The Rust kernel above, compiled through the MLIR-to-C++ codegen pipeline
  • Rust vector — A Rust kernel using ascend-rs vector intrinsics (ascend_reduce_max_f32, ascend_exp_f32, ascend_muls_f32), compiled through the same pipeline

Each kernel processes f32 input arrays, with 1 warmup iteration and 10 timed iterations per configuration. All results are verified against a CPU reference for correctness.

SizeC++ Naive (ms)C++ Opt (ms)Rust Scalar (ms)Rust Vector (ms)Scalar vs NaiveVector vs Opt
2560.1000.0780.0990.0770.99x0.99x
1,0240.1910.0770.2020.0761.06x0.99x
4,0960.5680.0790.6070.0791.07x1.00x
16,3842.0730.0892.2210.0871.07x0.98x

Key findings:

  1. Rust vector matches C++ optimized performance. The Rust vectorized kernel, using ascend_std vector intrinsics that map to AscendC operations, performs within 1-2% of the hand-optimized C++ kernel across all sizes. At 16,384 elements, the Rust vector kernel (0.087ms) is actually slightly faster than C++ optimized (0.089ms). This means there is zero performance penalty for writing vectorized NPU kernels in Rust instead of C++.

  2. Vector intrinsics provide massive speedups. Both vectorized kernels are 1.3x faster at small sizes and up to 25x faster at 16,384 elements compared to their scalar counterparts. The vector pipeline processes 256 bits (8 floats) per cycle vs one element per cycle for scalar code.

  3. Rust scalar is within 5-7% of C++ scalar. The scalar codegen path also produces competitive code, with the small overhead coming from different UB access patterns (direct pointer arithmetic vs accessor methods).

  4. All implementations are numerically correct. Every kernel-size combination produces results matching the CPU reference (max error < 1e-8, output sum ≈ 1.0). The vector implementations achieve even lower error than scalar (max_err ~1e-10 vs ~1e-8) due to hardware-optimized math operations.

Here is what the Rust vectorized softmax kernel looks like — it reads almost identically to the C++ version:

#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len_buf: *const u32) {
    unsafe {
        let n = *len_buf;
        let in_buf  = ascend_std::ascend_buf_alloc(n);
        let out_buf = ascend_std::ascend_buf_alloc(n);
        let work    = ascend_std::ascend_buf_alloc(n);
        let rwork   = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(in_buf, input, n);
        ascend_std::ascend_pipe_barrier();

        let max_val = ascend_std::ascend_reduce_max_f32(work, in_buf, rwork, n);
        ascend_std::ascend_adds_f32(out_buf, in_buf, 0.0f32 - max_val, n);
        ascend_std::ascend_exp_f32(out_buf, out_buf, n);
        let sum_val = ascend_std::ascend_reduce_sum_f32(work, out_buf, rwork, n);
        ascend_std::ascend_muls_f32(out_buf, out_buf, 1.0f32 / sum_val, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, out_buf, n);
    }
}

The ascend_buf_alloc / ascend_buf_load_f32 / ascend_reduce_max_f32 calls are extern "C" stubs in ascend_std that the MLIR codegen backend recognizes and translates to AscendC API calls (TBuf, DataCopy, ReduceMax, etc.) during C++ code generation. This gives Rust kernels direct access to the NPU’s vector pipeline with zero overhead.

4.4 Beyond Softmax: Activation Function Benchmarks

To validate the breadth of the vector intrinsic API, we benchmarked three additional activation functions — Relu, Sigmoid, and Tanh — each composed from the same primitive operations. Unlike softmax, these activations don’t have dedicated AscendC builtins; instead they are constructed from composable vector primitives:

  • Relu(x) = max(x, 0) → Maxs
  • Sigmoid(x) = 1 / (1 + exp(-x)) → MulsExpAddsReciprocal
  • Tanh(x) = 2 · sigmoid(2x) - 1 → MulsExpAddsReciprocalMulsAdds

For each function, we compare a C++ implementation (TQue pipeline) against the equivalent Rust-style code (TBuf pipeline matching the mlir_to_cpp output):

SizeRelu C++ (ms)Relu Rust (ms)Sigmoid C++ (ms)Sigmoid Rust (ms)Tanh C++ (ms)Tanh Rust (ms)
2560.0780.0750.0750.0750.0750.077
1,0240.0750.0760.0750.0740.0750.076
4,0960.0750.0760.0770.0770.0760.078
16,3840.0830.0830.0860.0860.0850.086

All six kernels perform identically within measurement noise. Relu achieves exact correctness (max_err = 0), while Sigmoid and Tanh achieve max_err < 3e-3 at sizes ≥ 1024. The size=256 correctness issue affects both C++ and Rust equally — it’s an AscendC hardware-level precision artifact at small vector sizes, not a codegen issue.

This confirms that the Rust vector intrinsic API generalizes beyond softmax. For the activation functions tested here — each a composition of AscendC vector primitives — Rust and C++ produce identical performance. We expect this to hold for any kernel composed purely from vector intrinsics, since the codegen maps each Rust intrinsic call 1:1 to the same AscendC C++ call. Cube engine operations (matmul via Mmad) and multi-level buffer hierarchies (L1/L0A/L0B/L0C) are supported at the API level but have not yet been hardware-verified through the full pipeline.

4.5 Formal Equivalence Verification: AscendC vs AscendRS

Performance parity is compelling, but the strongest argument for the Rust codegen pipeline is bitwise equivalence — proving that Rust-generated kernels produce exactly the same numerical results as hand-written AscendC C++ kernels on real NPU hardware.

We selected three representative kernels that cover the most common neural network operation patterns:

  • ReLU — single vector op: output[i] = max(input[i], 0)ascend_maxs_f32
  • Sigmoid — chained vector ops: output[i] = 1/(1 + exp(-input[i]))MulsExpAddsReciprocal
  • Vec Add — binary vector op: z[i] = x[i] + y[i]ascend_add_f32

For each kernel, we compiled two implementations:

  1. AscendC original — idiomatic C++ using the TQue pipeline (EnQue/DeQue implicit synchronization), as a 910B production engineer would write it
  2. AscendRS equivalent — C++ generated from Rust source via the mlir_to_cpp pipeline (TBuf + explicit pipe_barrier(PIPE_ALL))

Both were run on the 310P NPU with identical inputs (256 f32 elements, deterministic PRNG) and compared at three levels:

TestC++ vs CPURS vs CPUC++ vs RS
ReLUPASS (err=0.00)PASS (err=0.00)PASS (err=0.00)
SigmoidPASS (err=2.4e-3)PASS (err=2.4e-3)PASS (err=0.00)
Vec AddPASS (err=0.00)PASS (err=0.00)PASS (err=0.00)

The C++ vs RS column shows bitwise identical output (max error = 0.0) for all three kernels. The NPU produces exactly the same bits whether the kernel was written in C++ or Rust. The small sigmoid CPU difference (2.4e-3) is the NPU’s Exp() vector unit precision vs x86 expf() — it affects both implementations equally and is not a codegen issue.

Here is the Rust sigmoid kernel — four lines of vector intrinsic calls that produce identical NPU output to the 40-line AscendC C++ class:

#[ascend_std::aiv_kernel]
pub unsafe fn sigmoid(input: *const f32, output: *mut f32, len: *const u32) {
    unsafe {
        let n = *len;
        let buf_in = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(buf_in, input, n);
        ascend_std::ascend_pipe_barrier();

        ascend_std::ascend_muls_f32(buf_out, buf_in, -1.0f32, n);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_exp_f32(buf_out, buf_out, n);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_adds_f32(buf_out, buf_out, 1.0f32, n);
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_reciprocal_f32(buf_out, buf_out, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}

A notable discovery during this work: in-place chained vector operations on the 310P require explicit pipe_barrier(PIPE_ALL) between each step. Without barriers between Muls→Exp→Adds→Reciprocal on the same buffer, the next operation reads stale data. This is a hardware synchronization requirement that the Rust codegen pipeline now handles correctly — and the equivalence test serves as a regression test for this behavior.

4.6 The PTO Tile API Pipeline: Higher-Level Abstractions

The mlir_to_cpp path compiles Rust kernels by generating AscendC C++ with explicit TBuf + pipe_barrier patterns — equivalent to what a C++ programmer writes manually. A second codegen path, mlir_to_pto, targets the PTO (Programmable Tile Operations) dialect: a higher-level MLIR representation that lets kernels be expressed as operations on rectangular tiles of data rather than individual vector operations.

In the tile API, a softmax kernel is just four function calls:

#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32) {
    let bid = ascend_std::get_block_idx() as usize;
    let offset = bid * ROWS * COLS;
    let t = tile_load_f32::<ROWS, COLS>(input.wrapping_add(offset));
    let r = tile_softmax_f32::<ROWS, COLS>(t);
    tile_store_f32::<ROWS, COLS>(output.wrapping_add(offset), r);
}

The tile_softmax_f32 call expands at compile time to the standard softmax decomposition (trowmax → trowexpandsub → texp → trowsum → trowexpanddiv). The shape parameters ROWS and COLS are compile-time constants, allowing ptoas (the PTO assembler) to assign optimal UB buffer offsets and synchronization flags automatically.

Compilation Pipeline

Rust source
  → rustc + mlir_to_pto codegen backend
    → PTO-MLIR (.pto)           [ascend_tile_* → pto.trowmax / pto.texp / ...]
      → ptoas --enable-insert-sync
        → AscendC C++ (.cpp)    [TROWMAX / TEXP / TROWEXPANDDIV + auto sync]
          → bisheng (CANN 8.5)
            → AICore kernel binary (.o)

Benchmark Results (Ascend 910B2, dav-c220)

We benchmarked 6 kernel variants covering both 1D (single-row) and 2D (multi-row) tile shapes on an Ascend 910B2 NPU. Each variant processes ROWS × COLS f32 values in a single AICore block, with 1 warmup iteration and 10 timed iterations. All results are verified for correctness against a CPU reference.

ShapeElementsMedian (ms)Max ErrorCorrectness
1×10241,0240.00461.05e-9PASS
1×40964,0960.00631.75e-10PASS
1×81928,1920.00862.62e-10PASS
4×2561,0240.00542.79e-9PASS
16×2564,0960.00493.26e-9PASS
16×5128,1920.00492.79e-9PASS

All six kernels pass correctness checks (max error < 1e-8, row sums = 1.0). The multi-row shapes (16×256, 16×512) are faster than the equivalent single-row shapes (1×4096, 1×8192) at the same element count — wider tiles allow the hardware’s vector pipeline to process more rows in parallel.

Compared to the mlir_to_cpp scalar softmax on the 310P (which ran at ~0.087 ms for 16,384 elements), the PTO tile kernels on the 910B2 run 10–18× faster at similar element counts. This reflects both the architectural advantages of the 910B2 (higher frequency, larger UB) and the efficiency of the PTO tile access pattern (single TLOAD/TSTORE per block vs. per-element loads in scalar code).

Numerical Precision

The PTO path achieves higher numerical precision than the scalar mlir_to_cpp path. Where the 310P scalar kernels showed max_err ≈ 1e-8, the 910B2 tile kernels show max_err ≈ 1e-9 to 1e-10 — an order of magnitude improvement. This comes from the PTO decomposition using hardware reduction instructions (TROWMAX, TROWSUM) that accumulate in higher internal precision before returning a float result.

4.7 Async Rust Kernels: Maintainability and Scheduler Freedom

The tile softmax kernel above is already barrier-free from the programmer’s perspective. But the underlying principle deserves deeper examination — because it motivates the long-term direction of the ascend-rs programming model and explains why the PTO path delivers more than just a cleaner API.

The Barrier Maintenance Problem

Look at the buffer-API kernel from section 4.3. Even at this simple scale, the programmer must:

  1. Allocate named queues for each pipeline stage (TQue<QuePosition::VECIN, 1>)
  2. Issue EnQue/DeQue at every producer/consumer boundary
  3. Insert pipe_barrier(PIPE_ALL) at function exit to drain all in-flight ops
  4. Know the Ascend pipeline model (Mte2 → Vector → Mte1 DMA stages) well enough to place barriers correctly

A missing barrier is a silent data race — no compiler error, no runtime fault at small sizes, a subtle wrong-answer bug at scale. A spurious PIPE_ALL stall is a performance regression that is invisible in correctness tests. As kernels grow — Flash Attention, multi-head attention, fused softmax+dropout — this hand-maintained barrier graph diverges from the actual data dependencies. Bugs compound.

Ownership as Implicit Sequencing

The tile API sidesteps this through Rust’s ownership model:

// Each step consumes its input — you cannot accidentally reuse t_in after softmax
let t_in:  Tile<1, 1024, f32> = tile_load_f32::<1, 1024>(input_ptr);
let t_out: Tile<1, 1024, f32> = tile_softmax_f32::<1, 1024>(t_in);   // t_in moved
tile_store_f32::<1, 1024>(output_ptr, t_out);                          // t_out moved

This encodes the data-flow graph in the type system:

  • tile_load_f32 produces a Tile carrying a logical “Mte2 pending” token
  • tile_softmax_f32 waits for that token, then produces a Tile with a “V pending” token
  • tile_store_f32 waits for the V token, then issues Mte1

mlir_to_pto.rs translates this ownership chain to PTO-MLIR ops with no barrier calls at all (line 503 explicitly suppresses ascend_pipe_barrier). ptoas then sees a clean dependency graph and places set_flag/wait_flag only at the minimal required points.

What Async Rust Would Add

Ownership chains handle sequential pipelines well. For more complex patterns — double-buffering, speculative prefetch, interleaved load-compute-store across multiple tiles — a sequential chain forces an artificial total order on operations that could overlap.

An async-based tile API would express independent ops as concurrent futures:

// Hypothetical async tile API — two independent loads can overlap on Mte2
async fn softmax_kernel(input: *const f32, output: *mut f32) {
    let (t0, t1) = join!(
        tile_load_f32::<1, 1024>(input),
        tile_load_f32::<1, 1024>(input.wrapping_add(1024)),
    ).await;

    let (r0, r1) = join!(
        tile_softmax_f32::<1, 1024>(t0),
        tile_softmax_f32::<1, 1024>(t1),
    ).await;

    tile_store_f32::<1, 1024>(output, r0).await;
    tile_store_f32::<1, 1024>(output.wrapping_add(1024), r1).await;
}

The .await points mark where one stage must wait for another’s result — only exactly where required. join! expresses that the two loads can be issued to the Mte2 DMA engine simultaneously, letting the hardware overlap them.

What This Gives ptoas

The Ascend NPU has five independent hardware pipes: Scalar, Mte1 (UB→GM), Mte2 (GM→UB), Vector, and Cube. With async tile ops, mlir_to_pto.rs emits PTO-MLIR where the only sequencing edges are true data dependencies. ptoas’s --enable-insert-sync then inserts set_flag/wait_flag pairs only where a dst-pipe op consumes a src-pipe op’s output — no other barriers.

For the softmax decomposition, this means:

  • trowmax (Vector) waits for tload (Mte2) → one set_flag(MTE2, V, 0)
  • trowexpandsub → texp → trowsum → trowexpanddiv are all Vector ops with sequential deps → no barriers between them (same pipe, hardware queues enforce order)
  • tstore (Mte1) waits for trowexpanddiv (Vector) → one set_flag(V, MTE1, 0)

Total: 2 fine-grained flags, compared to pipe_barrier(PIPE_ALL) at every step in the buffer-API path. The 16×512 shape reaching 12.9 GB/s is a direct measurement of this — 16 independent row-softmax ops exposed to ptoas as a single wide tile op, letting the scheduler find the optimal overlap.

Current State

LayerStatus
Tile API (sync ownership chain)✅ Working, benchmarked on 910B2
mlir_to_pto.rs barrier suppression✅ Done — ascend_pipe_barrier dropped
ptoas --enable-insert-sync✅ Working — auto-inserts fine-grained sync
Async tile API (tile_join_load, tile_prefetch)✅ Done — tile_join_load_f32 and tile_prefetch_f32 added to ascend_std
Multi-tile double-buffering✅ Done — GEP offset fix in mlir_to_pto.rs; verified on 910B2

Double-Buffering Results (910B2, 2026-04-02)

tile_softmax_double_buf processes two 1×1024 tiles per launch using tile_prefetch_f32 to issue the second load before the first tile’s compute begins. ptoas schedules the two pto.tload ops concurrently on Mte2 because they have distinct partition_view offsets ([%c0,%c0] and [%c1,%c0]) — no data dependency between them.

KernelTiles/launchPer-tile avgPer-tile min
tile_softmax_1x1024 (baseline)10.0055 ms0.0045 ms
tile_softmax_double_buf20.0034 ms0.0025 ms

1.62× per-tile throughput (avg); 1.82× best-case. See Appendix J §J.4 for full kernel source, generated PTO-MLIR, and the two-bug fix in mlir_to_pto.rs that made this possible.