English | 中文版

Appendix J: Step-by-Step Reproducible Examples

This appendix walks through three complete, runnable ascend-rs examples from scratch. Each example includes the full source code, the exact shell commands to build and run it, the expected terminal output, and screenshots from real hardware runs. The goal is to let anyone with an Ascend NPU reproduce every result in this book.

Prerequisites

Hardware and Software

Requirement	Minimum	Tested
Ascend NPU	Ascend 310P / 910B	Ascend 310P3, Ascend 910B2
CANN	8.1.RC1	8.1.RC1 (310P), 8.5.0 (910B)
Rust toolchain	nightly-2025-05-01	nightly-2025-08-04
OS	Linux aarch64 / x86_64	Ubuntu 22.04 aarch64
Driver	≥ 24.1	bundled with CANN

One-time Environment Setup

# 1. Clone the repository
git clone https://github.com/ascend-rs/ascend-rs
cd ascend-rs

# 2. Source the CANN environment (adjust path for your installation)
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
# Or for CANN 8.5 standalone:
# source /usr/local/Ascend/cann-8.5.0/set_env.sh

# 3. Set the target SoC (adjust for your hardware)
export ACLRS_SOC_VERSION=Ascend310P3   # for 310P
# export ACLRS_SOC_VERSION=Ascend910B2  # for 910B2
# export ACLRS_SOC_VERSION=Ascend910_9392  # for 910 (older 9392 variant)

# 4. Verify the NPU is visible
npu-smi info

Expected output of npu-smi info (310P example):

+-------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                 Version: 24.1.rc2                                       |
+------------------+-------------------+-------------------------------------------------+
| NPU   Name       | Health            | Power(W)  Temp(C)   HBM-Usage(MB) Aicore(%)     |
| Chip             |                   | Bus-Id                                           |
+==================+===================+=================================================+
| 0     310P3      | OK                | 14         42       372 / 8192    0              |
| 0                |                   | 0000:82:00.0                                     |
+------------------+-------------------+-------------------------------------------------+

Example 1: Hello World — ACL Device Initialization

The simplest possible ascend-rs program: initialize the ACL runtime, open a device, create a context and stream, print the device descriptor, and exit. This verifies that your driver, CANN, and Rust toolchain are all working together.

Source Code

examples/acl_hello_world/src/main.rs:

use anyhow::Result;
use ascend_rs::prelude::*;
use log::info;
use simple_logger::SimpleLogger;

fn main() -> Result<()> {
    SimpleLogger::new().env().init().ok();

    // Each of these RAII wrappers acquires a resource on construction
    // and releases it automatically on drop. The compiler enforces the
    // correct lifetime nesting: Device < AclContext < AclStream.
    let acl     = Acl::new()?;
    let device  = Device::new(&acl)?;
    let context = AclContext::new(&device)?;
    let stream  = AclStream::new(&context)?;

    info!("Device {} initialized successfully", device.descriptor());
    info!("Context handle: {:p}", context.as_ptr());
    info!("Stream handle:  {:p}", stream.as_ptr());

    // Resources are released in reverse order when they go out of scope.
    Ok(())
}

Build and Run

# From the repository root:
cd examples/acl_hello_world

RUST_LOG=info cargo run --release

Expected Output

2026-03-31T09:14:02Z INFO  [acl_hello_world] Device Ascend310P3 initialized successfully
2026-03-31T09:14:02Z INFO  [acl_hello_world] Context handle: 0x55a7b2c30010
2026-03-31T09:14:02Z INFO  [acl_hello_world] Stream handle:  0x55a7b2c30080

The device name (Ascend310P3, Ascend910B2, etc.) will match the SoC set in ACLRS_SOC_VERSION. If you see Device startup failed the driver is not running — check npu-smi info and ensure the device shows Health: OK.

Screenshot (310P hardware)

$ cd examples/acl_hello_world && RUST_LOG=info cargo run --release
   Compiling acl_hello_world v0.1.0
    Finished `release` profile [optimized] target(s) in 3.2s
     Running `target/release/acl_hello_world`
2026-03-31T09:14:02Z INFO  [acl_hello_world] Device Ascend310P3 initialized successfully
2026-03-31T09:14:02Z INFO  [acl_hello_world] Context handle: 0x55a7b2c30010
2026-03-31T09:14:02Z INFO  [acl_hello_world] Stream handle:  0x55a7b2c30080

What the output tells you:

Device Ascend310P3 initialized successfully — the ACL runtime found the device and the CANN driver stack is functional.
The context and stream handles are non-null kernel objects allocated by the driver; they are freed automatically when main returns.

Example 2: Vector Softmax — Rust Kernel on Real Hardware

This example runs the full softmax kernel from Chapter 4 on real NPU hardware: a 1024-element f32 array passes through max → exp → sum → divide on the NPU vector pipeline, and the result is verified against a CPU reference.

Source Code

Kernel (examples/bench_softmax_rs/kernels/src/lib.rs):

#![feature(no_core)]
#![no_std]
#![no_core]

/// Vectorized row softmax kernel.
///
/// Uses the ascend_std vector intrinsics which the mlir_to_cpp backend
/// translates to AscendC DataCopy / ReduceMax / Exp / Muls / ReduceSum calls.
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len_buf: *const u32) {
    unsafe {
        let n = *len_buf;

        // Allocate UB (Unified Buffer) scratch tiles
        let in_buf  = ascend_std::ascend_buf_alloc(n);
        let out_buf = ascend_std::ascend_buf_alloc(n);
        let work    = ascend_std::ascend_buf_alloc(n);
        let rwork   = ascend_std::ascend_buf_alloc(n);

        // DMA: global memory → UB
        ascend_std::ascend_buf_load_f32(in_buf, input, n);
        ascend_std::ascend_pipe_barrier();  // wait for Mte2 engine

        // Numerically stable softmax: subtract max before exp
        let max_val = ascend_std::ascend_reduce_max_f32(work, in_buf, rwork, n);
        ascend_std::ascend_adds_f32(out_buf, in_buf, 0.0f32 - max_val, n);
        ascend_std::ascend_exp_f32(out_buf, out_buf, n);
        let sum_val = ascend_std::ascend_reduce_sum_f32(work, out_buf, rwork, n);
        ascend_std::ascend_muls_f32(out_buf, out_buf, 1.0f32 / sum_val, n);

        // DMA: UB → global memory
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, out_buf, n);
    }
}

Host (examples/bench_softmax_rs/src/main.rs, abridged):

use ascend_rs::prelude::*;

fn main() -> anyhow::Result<()> {
    let acl     = Acl::new()?;
    let device  = Device::new(&acl)?;
    let context = AclContext::new(&device)?;
    let stream  = AclStream::new(&context)?;

    let n: u32 = 1024;
    let input: Vec<f32> = (0..n as usize)
        .map(|i| ((i as f32) * 0.01).sin() * 3.0)
        .collect();

    // Transfer input to device, allocate output and length buffers
    let mut d_input  = DeviceBuffer::from_slice(&input)?;
    let mut d_output = unsafe { DeviceBuffer::<f32>::uninitialized(n as usize)? };
    let mut d_len    = DeviceBuffer::from_slice(&[n])?;

    // Load and launch the kernel (1 block)
    let kernel_loader = KernelLoader::new()?;
    let kernel = kernel_loader.get_kernel("softmax")?;
    let mut args: [*mut std::ffi::c_void; 3] = [
        d_input.as_mut_ptr() as *mut _,
        d_output.as_mut_ptr() as *mut _,
        d_len.as_mut_ptr() as *mut _,
    ];
    unsafe { kernel.launch(1, &stream, &mut args)?; }
    stream.synchronize()?;

    // Verify against CPU reference
    let output = d_output.to_host()?;
    let sum: f32 = output.iter().sum();
    println!("sum = {:.6}  (expected ≈ 1.0)", sum);
    println!("output[0..4] = {:?}", &output[..4]);

    Ok(())
}

Build and Run

cd examples/bench_softmax_rs

# Build the kernel (triggers the CANN compilation pipeline):
#   Rust source → MLIR → C++ (mlir_to_cpp) → bisheng → .acl.o
RUST_LOG=info cargo run --release -- --csv /tmp/softmax_results.csv

The kernel compilation step (bisheng) takes ~5 seconds on first build; subsequent builds use the cargo cache.

Expected Output

2026-03-31T09:15:44Z INFO  [bench_softmax_rs] Device Ascend310P3 initialized
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] Running softmax benchmark
size=256   pass=true  max_err=1.22e-8  sum=1.000000  rust_vec=0.077ms
size=1024  pass=true  max_err=8.34e-9  sum=1.000000  rust_vec=0.076ms
size=4096  pass=true  max_err=7.11e-9  sum=1.000000  rust_vec=0.079ms
size=16384 pass=true  max_err=6.89e-9  sum=1.000000  rust_vec=0.087ms

Screenshot (310P hardware, full benchmark comparison)

$ RUST_LOG=info cargo run --release -- --csv /tmp/softmax_results.csv
   Compiling bench_softmax_rs v0.1.0
    Finished `release` profile [optimized] target(s) in 8.4s
     Running `target/release/bench_softmax_rs --csv /tmp/softmax_results.csv`
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] Device Ascend310P3 initialized
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] size=256   rust_vec=0.077ms  pass=true  max_err=1.22e-8
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] size=1024  rust_vec=0.076ms  pass=true  max_err=8.34e-9
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] size=4096  rust_vec=0.079ms  pass=true  max_err=7.11e-9
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] size=16384 rust_vec=0.087ms  pass=true  max_err=6.89e-9
CSV written to /tmp/softmax_results.csv

Running the full comparison (Rust vs C++ side-by-side):

# From repository root:
cd benchmarks/softmax
bash bench.sh

=== Softmax Benchmark ===
--- Rust softmax benchmark ---
size=16384  rust_scalar=2.221ms  rust_vec=0.087ms  pass=true
--- C++ softmax benchmark ---
size=16384  cpp_naive=2.073ms    cpp_opt=0.089ms    pass=true

Performance summary (16384 elements):
  Rust vector vs C++ optimized:  0.087ms vs 0.089ms  → Rust is 1.02x faster
  Vector speedup over scalar:    25.5x
  Correctness: all sizes PASS (max_err < 1e-8)

How the Pipeline Works

Each step in the compilation pipeline can be inspected by looking at the intermediate files in kernels/target/:

kernels/target/davinci-huawei-none/release/deps/
├── softmax_kernels.mlir           ← MLIR output from rustc codegen
├── softmax_kernels.mlir.acl.gen.cpp  ← C++ generated by mlir_to_cpp
└── softmax_kernels.acl.o          ← NPU object file from bisheng

The generated C++ (acl.gen.cpp) shows the direct AscendC API calls that the Rust intrinsics compile to:

// Generated from: ascend_std::ascend_exp_f32(out_buf, out_buf, n)
Exp(out_buf_local, out_buf_local, n);
pipe_barrier(PIPE_V);

Example 3: Tile Softmax — PTO Codegen Path on Ascend 910B

This example demonstrates the newer PTO (Programmable Tile Operations) codegen path, which targets the Ascend 910B (dav-c220) matrix pipeline. The tile API expresses 2D tile operations (tile_load, tile_softmax, tile_store) that compile through ptoas — the PTO assembler — rather than the standard C++ codegen.

This is the most advanced example and requires an Ascend 910B device with ptoas available. It demonstrates the complete pipeline:

Rust tile API  →  MLIR  →  PTO-MLIR  →  ptoas  →  CCE C++  →  ccec  →  .acl.o

Source Code

Kernel (examples/tile_softmax/kernels/src/lib.rs):

#![feature(no_core)]
#![no_std]
#![no_core]

use ascend_std::tile::{tile_load_f32, tile_softmax_f32, tile_store_f32, Tile};

/// Row-wise softmax over a ROWS × COLS tile of f32 values.
///
/// The tile API is a 2D abstraction over the NPU's vector engine:
/// - `tile_load_f32`    → PTO `tload` (DMA from global memory to UB tile)
/// - `tile_softmax_f32` → PTO reduction ops: trowmax → trowexpandsub →
///                        texp → trowsum → trowexpanddiv
/// - `tile_store_f32`   → PTO `tstore` (DMA from UB tile to global memory)
///
/// The `ptoas --enable-insert-sync` flag automatically inserts set_flag /
/// wait_flag barriers between tile operations.
#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax(input: *const f32, output: *mut f32) {
    let block_idx = ascend_std::get_block_idx() as usize;
    let offset = block_idx * 1 * 1024;  // ROWS=1, COLS=1024

    // Load tile from global memory
    let t_in: Tile<1, 1024, f32> =
        tile_load_f32::<1, 1024>(input.wrapping_add(offset));

    // Compute softmax: max → shift → exp → sum → divide
    let t_out: Tile<1, 1024, f32> = tile_softmax_f32::<1, 1024>(t_in);

    // Store result to global memory
    tile_store_f32::<1, 1024>(output.wrapping_add(offset), t_out);
}

Host (examples/tile_softmax/src/main.rs, abridged):

use ascend_rs::prelude::*;

fn main() -> anyhow::Result<()> {
    const ROWS: usize = 1;
    const COLS: usize = 1024;

    let acl     = Acl::new()?;
    let device  = Device::new(&acl)?;
    let context = AclContext::new(&device)?;
    let stream  = AclStream::new(&context)?;

    // Sinusoidal input for visual verification
    let input: Vec<f32> = (0..ROWS * COLS)
        .map(|i| ((i as f32) * 0.01).sin() * 3.0)
        .collect();

    let mut d_input  = DeviceBuffer::from_slice(&input)?;
    let mut d_output = unsafe { DeviceBuffer::<f32>::uninitialized(ROWS * COLS)? };

    let kernel_loader = KernelLoader::new()?;
    let kernel = kernel_loader.get_kernel("tile_softmax")?;
    let mut args: [*mut std::ffi::c_void; 2] = [
        d_input.as_mut_ptr() as *mut _,
        d_output.as_mut_ptr() as *mut _,
    ];
    unsafe { kernel.launch(1, &stream, &mut args)?; }  // 1 block
    stream.synchronize()?;

    let output = d_output.to_host()?;
    let sum: f32 = output.iter().sum();
    let max_err = output.iter()
        .zip(softmax_cpu(&input, ROWS, COLS).iter())
        .map(|(a, b)| (a - b).abs())
        .fold(0.0f32, f32::max);

    println!("tile_softmax: max_err={:.4e} sum={:.6} {}",
        max_err, sum,
        if max_err < 1e-5 && (sum - 1.0).abs() < 1e-4 { "PASS" } else { "FAIL" });

    Ok(())
}

Build and Run

# Required environment (Ascend 910B with CANN 8.5 and ptoas)
export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
export ACLRS_SOC_VERSION=Ascend910_9392          # adjust for your SoC
export ACLRS_CODEGEN_PATH=pto                     # enable PTO path
export ACLRS_PTOAS_PATH=/path/to/ptoas            # ptoas assembler binary
export ACLRS_PTO_ISA_PATH=/path/to/pto-isa/include  # pto-isa headers
export LD_LIBRARY_PATH=/data/llvm20/lib:${ACLRS_CANN_PATH}/aarch64-linux/lib64:\
/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common

source ${ACLRS_CANN_PATH}/set_env.sh
export PATH=${ACLRS_CANN_PATH}/tools/ccec_compiler/bin:$PATH

cd examples/tile_softmax
cargo run --release

Compilation Pipeline Trace

The build system prints each step. With RUST_LOG=debug you can see the exact commands:

# Step 1: Rust → MLIR (rustc with custom codegen backend)
rustc --crate-type lib -Z codegen-backend=librustc_codegen_mlir.so ...
  → tile_softmax_kernels.mlir

# Step 2: MLIR → PTO-MLIR (mlir_to_pto.rs)
  → tile_softmax_kernels.acl.pto

# Step 3: PTO-MLIR → CCE C++ (ptoas)
ptoas --enable-insert-sync --pto-arch=a3 tile_softmax_kernels.acl.pto \
      -o tile_softmax_kernels.acl.pto.cpp

# Step 4: CCE C++ → NPU object (ccec)
ccec -c -O3 -x cce -DMEMORY_BASE --cce-aicore-arch=dav-c220-vec \
     -mllvm -cce-aicore-addr-transform \
     -mllvm -cce-aicore-dcci-insert-for-scalar=false \
     -I/path/to/pto-isa/include \
     tile_softmax_kernels.acl.pto.cpp \
     -o tile_softmax_kernels.acl.o

Intermediate Artifacts (Committed)

The intermediate files generated during the verified 2026-04-01 run on Ascend 910B2 are committed to the repository under examples/tile_softmax/artifacts/. You can inspect each stage of the pipeline without installing any tools:

File	Stage	Description
`tile_softmax_kernels.acl.pto`	MLIR → PTO-MLIR	PTO-MLIR dialect emitted by `mlir_to_pto.rs`
`tile_softmax_kernels.acl.pto.cpp`	PTO-MLIR → CCE C++	AscendC C++ generated by `ptoas --enable-insert-sync`
`tile_softmax_kernels.acl.pto.compat-a3.hpp`	CANN 8.5 shim	Compatibility header patched by `pto-compat-cann85.hpp`

For the multi-shape benchmark, see the equivalent artifacts in examples/bench_softmax_tile/artifacts/.

The complete PTO-MLIR output for the 1×1024 softmax kernel (tile_softmax_kernels.acl.pto):

// Generated by ascend-rs mlir_to_pto — DO NOT EDIT
// Compile: ptoas --enable-insert-sync <file.pto> -o <file.cpp>
module {
  func.func @tile_softmax(%arg601: !pto.ptr<f32>, %arg602: !pto.ptr<f32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c1024 = arith.constant 1024 : index
    %pto0 = pto.make_tensor_view %arg601, shape = [%c1, %c1024], strides = [%c1024, %c1] : !pto.tensor_view<?x?xf32>
    %pto1 = pto.partition_view %pto0, offsets = [%c0, %c0], sizes = [%c1, %c1024] : !pto.tensor_view<?x?xf32> -> !pto.partition_tensor_view<1x1024xf32>
    %pto2 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, v_row=1, v_col=1024, blayout=row_major, slayout=none_box, fractal=512, pad=0>
    pto.tload ins(%pto1 : !pto.partition_tensor_view<1x1024xf32>) outs(%pto2 : !pto.tile_buf<...>)
    // scratch tile for trowmax
    %pto3 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=8, cols=1, ...>  // row-max result
    %pto4 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...>  // scratch
    %pto5 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...>  // shifted
    %pto6 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...>  // exp result
    %pto7 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=8, cols=1, ...>   // row-sum result
    %pto8 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...>  // final output
    // softmax decomposition:
    pto.trowmax ins(%pto2, %pto4 : ...) outs(%pto3 : ...)           // Step 1: max per row
    pto.trowexpandsub ins(%pto2, %pto3 : ...) outs(%pto5 : ...)     // Step 2: x - max
    pto.texp ins(%pto5 : ...) outs(%pto6 : ...)                     // Step 3: exp(x - max)
    pto.trowsum ins(%pto6, %pto4 : ...) outs(%pto7 : ...)           // Step 4: sum
    pto.trowexpanddiv ins(%pto6, %pto7 : ...) outs(%pto8 : ...)     // Step 5: / sum
    %pto9 = pto.make_tensor_view %arg602, shape = [%c1, %c1024], strides = [%c1024, %c1] : !pto.tensor_view<?x?xf32>
    %pto10 = pto.partition_view %pto9, offsets = [%c0, %c0], sizes = [%c1, %c1024] : !pto.tensor_view<?x?xf32> -> !pto.partition_tensor_view<1x1024xf32>
    pto.tstore ins(%pto8 : ...) outs(%pto10 : ...)
    return
  }
}

After ptoas --enable-insert-sync, the CCE C++ kernel entry point (tile_softmax_kernels.acl.pto.cpp, excerpt):

extern "C" __global__ AICORE void tile_softmax(__gm__ float* v1, __gm__ float* v2) {
  // ptoas allocates UB tiles at compile-time offsets (v8..v14)
  Tile<TileType::Vec, float, 1, 1024, BLayout::RowMajor, ...> v18;  // input tile
  TLOAD(v18, v17);
  set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);   // auto-inserted sync

  // Softmax reduction ops:
  wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
  TROWMAX(v20, v18, v23);      pipe_barrier(PIPE_V);
  TROWEXPANDSUB(v24, v18, v20); pipe_barrier(PIPE_V);
  TEXP(v25, v24);              pipe_barrier(PIPE_V);
  TROWSUM(v27, v25, v23);      pipe_barrier(PIPE_V);
  TROWEXPANDDIV(v30, v25, v27);

  set_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);   // auto-inserted sync
  wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
  TSTORE(v33, v30);
  pipe_barrier(PIPE_ALL);
}

The __global__ keyword marks this as a host-callable entry point. Without it, ccec compiles the function successfully but the runtime cannot dispatch it (symptom: MTE DDR address out of range, error code 0x800000). This was a non-obvious bug fixed in commit 04c80ac6.

Expected Output

2026-04-01T12:17:35Z INFO  [tile_softmax] tile_softmax test: ROWS=1, COLS=1024, n=1024
2026-04-01T12:17:35Z INFO  [tile_softmax] Device Ascend910_9392 initialized
2026-04-01T12:17:35Z INFO  [tile_softmax] Launching tile_softmax kernel (1 block, 1x1024 f32)...
2026-04-01T12:17:36Z INFO  [tile_softmax] tile_softmax: max_err=1.8626e-9 sum=1.000000 sum_ok=true PASS
2026-04-01T12:17:36Z INFO  [tile_softmax] tile_softmax PASSED

The max_err=1.8626e-9 result was recorded on 2026-04-01 on Ascend 910B2 hardware (Ascend910_9392, dav-c220). The PTO tile reduction instructions (TROWMAX, TROWSUM) accumulate with higher internal precision before returning f32, achieving ~10× better numerical accuracy than the scalar mlir_to_cpp path (which yields max_err ≈ 1e-8 on the same data).

What Makes This Different from Example 2

	Example 2 (Vector Softmax)	Example 3 (Tile Softmax)
Codegen path	`mlir_to_cpp` → `bisheng`	`mlir_to_pto` → `ptoas` → `ccec`
Abstraction	Scalar intrinsics (`ascend_reduce_max_f32`)	2D tile ops (`tile_softmax_f32`)
Target hardware	310P or 910B (vector engine)	910B (dav-c220, a2a3 path)
Intermediate format	AscendC C++	PTO-MLIR dialect
Barriers	Manual (`ascend_pipe_barrier`)	Auto-inserted by `ptoas --enable-insert-sync`
Parallelism model	1 block, scalar loops	1 block, 2D tile
Verified max_err	~1e-8 (310P hardware)	~1.9e-9 (910B2 hardware, 2026-04-01)

Example 4: Double-Buffer Tile Softmax

Extends Example 3 to process two tiles per kernel launch using tile_prefetch_f32, overlapping Mte2 DMA (tile 1 load) with Vector compute (tile 0 softmax). See §4.7 for the performance results.

Source Code

Kernel (examples/tile_softmax_double_buf/kernels/src/lib.rs):

#![feature(no_core)]
#![no_std]
#![no_core]

use ascend_std::tile::{
    tile_load_f32, tile_prefetch_f32, tile_softmax_f32, tile_store_f32, Tile,
};

#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax_double_buf(input: *const f32, output: *mut f32) {
    const ROWS: usize = 1;
    const COLS: usize = 1024;
    const TILE_ELEMS: usize = ROWS * COLS;

    // --- Prologue: issue both loads before any compute ---
    // t0 loads tile 0 (offset 0); t1 prefetches tile 1 (offset TILE_ELEMS).
    let t0: Tile<ROWS, COLS, f32> = tile_load_f32::<ROWS, COLS>(input);
    let t1: Tile<ROWS, COLS, f32> =
        tile_prefetch_f32::<ROWS, COLS>(input.wrapping_add(TILE_ELEMS));

    // --- Compute tile 0 (Mte2 for t1 can overlap this on the hardware) ---
    let r0: Tile<ROWS, COLS, f32> = tile_softmax_f32::<ROWS, COLS>(t0);

    // --- Compute tile 1 ---
    let r1: Tile<ROWS, COLS, f32> = tile_softmax_f32::<ROWS, COLS>(t1);

    // --- Store results ---
    tile_store_f32::<ROWS, COLS>(output, r0);
    tile_store_f32::<ROWS, COLS>(output.wrapping_add(TILE_ELEMS), r1);
}

The move-ownership pattern enforces the pipeline at compile time: t0 is consumed by tile_softmax_f32 before t1 is used, so there is no data race. tile_prefetch_f32 is identical to tile_load_f32 at the hardware level; the different name documents the programmer’s intent.

Build and Run

# Same environment as Example 3 (Ascend 910B with CANN 8.5 and ptoas)
export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
export ACLRS_SOC_VERSION=Ascend910_9392
export ACLRS_CODEGEN_PATH=pto
export ACLRS_PTOAS_PATH=/path/to/ptoas
export ACLRS_PTO_ISA_PATH=/path/to/pto-isa/include
export LD_LIBRARY_PATH=/data/llvm20/lib:${ACLRS_CANN_PATH}/aarch64-linux/lib64:\
/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common
source ${ACLRS_CANN_PATH}/set_env.sh
export PATH=${ACLRS_CANN_PATH}/tools/ccec_compiler/bin:$PATH

cd examples/tile_softmax_double_buf
cargo run --release

Generated PTO-MLIR

The key difference from Example 3 is that the two loads produce distinct partition_view ops with different row offsets:

// tile 0: load from row 0
%pto1 = pto.partition_view %pto0, offsets = [%c0, %c0], sizes = [%c1, %c1024] : ...
pto.tload ins(%pto1 : ...) outs(%pto2 : ...)

// tile 1: load from row 1 (offset 1024 elements = row 1 with cols=1024)
%pto3 = pto.partition_view %pto0, offsets = [%c1, %c0], sizes = [%c1, %c1024] : ...
pto.tload ins(%pto3 : ...) outs(%pto4 : ...)

// softmax(t0) — Vector pipe; Mte2 can overlap with tload above
pto.trowmax ins(%pto2, ...) outs(...)
pto.trowexpandsub ...
pto.texp ...
pto.trowsum ...
pto.trowexpanddiv ins(...) outs(%pto10 : ...)

// softmax(t1)
pto.trowmax ins(%pto4, ...) outs(...)
...
pto.trowexpanddiv ins(...) outs(%pto16 : ...)

// stores — both at row 0 and row 1 of output
%pto18 = pto.partition_view %pto17, offsets = [%c0, %c0], ...
pto.tstore ins(%pto10 : ...) outs(%pto18 : ...)
%pto19 = pto.partition_view %pto17, offsets = [%c1, %c0], ...
pto.tstore ins(%pto16 : ...) outs(%pto19 : ...)

With offsets=[%c0,%c0] and offsets=[%c1,%c0] encoding different rows, ptoas recognises the two tload ops as accessing independent memory regions and schedules them concurrently on the Mte2 pipe.

Expected Output

2026-04-02T06:14:07Z INFO  [tile_softmax_double_buf] double_buf 2×(1×1024): total avg=0.0068ms min=0.0049ms max=0.0140ms | per-tile avg=0.0034ms min=0.0024ms | max_err=3.26e-9 PASS

Raw results: examples/tile_softmax_double_buf/results/bench_double_buf_910b2_2026-04-02.csv.

The GEP Offset Bug Fix

Before this example could work correctly, mlir_to_pto.rs had two bugs:

Bug 1 — make_pv always emitted offsets=[%c0,%c0]: The GEP index was tracked in gep_offsets but never passed to make_pv. Fixed by adding elem_offset: u32 to make_pv and converting it to (row_off, col_off) using cols as stride.

Bug 2 — Pattern 3 alias chain was flattened: The load-from-alloca pattern (Pattern 3) called ctx.resolve_ptr(&stored) before inserting the alias, which skipped the intermediate GEP node (%gep → %arg0) where gep_offsets[%gep] = 1024 was recorded. Fixed by storing the immediate alias without resolving first, so resolve_offset can traverse the full chain.

Troubleshooting

`Device startup failed`

The NPU driver is not running or the device is in a fault state. Check:

npu-smi info          # look for Health: OK (not Critical)
npu-smi reset -i 0    # reset device 0 (requires root)

`Could not determine ASCEND_HOME_PATH`

ACLRS_CANN_PATH is not set or the path doesn’t exist:

export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
# verify it exists:
ls $ACLRS_CANN_PATH/tools/ccec_compiler/bin/bisheng

`ptoas assembler not found`

Set ACLRS_PTOAS_PATH to the full path of the ptoas binary:

export ACLRS_PTOAS_PATH=/path/to/ptoas/build/tools/ptoas/ptoas

ptoas is part of the pto-isa project and is only required for the PTO codegen path (Example 3).

`ccec PTO compilation failed: set_mask_count does not support target feature`

This means the wrong --cce-aicore-arch was used. Ensure:

ACLRS_SOC_VERSION is set correctly for your chip
ascend-rs is on the claude_code or main branch (fix committed in d45ab4e3 and adbf7294)

`error: definition of type 'bfloat16_t' conflicts with typedef`

Your ccec version already defines bfloat16_t. This was fixed in commit adbf7294. Update to the latest branch.

Correctness check fails (`max_err > 1e-5`)

For the vector softmax on 310P: expected max_err < 1e-8 (hardware f32 math)
For the tile softmax on 910B: expected max_err < 1e-9 (PTO reduction instructions use higher internal precision; verified result is max_err=1.86e-9)
Values larger than 1e-5 may indicate the wrong SoC version is set, causing mismatched UB buffer size assumptions, or a missing __global__ on the kernel entry point (fixed in commit 04c80ac6)

Summary: Pipeline Comparison at a Glance

Example 1: Hello World
  Rust host code  →  cargo build  →  binary  →  ACL runtime  →  NPU device
  (No kernel — pure host/driver interaction)

Example 2: Vector Softmax (mlir_to_cpp path)
  Rust kernel  →  rustc  →  MLIR  →  mlir_to_cpp  →  AscendC C++
                 →  bisheng  →  .acl.o  →  KernelLoader  →  NPU execution

Example 3: Tile Softmax (PTO path)
  Rust kernel  →  rustc  →  MLIR  →  mlir_to_pto  →  PTO-MLIR dialect
                 →  ptoas  →  CCE C++  →  ccec  →  .acl.o
                 →  KernelLoader  →  NPU execution

All three pipelines share the same host-side runtime (ascend_rs::prelude::*): Acl, Device, AclContext, AclStream, DeviceBuffer, KernelLoader. The only difference is in how the .acl.o kernel binary is produced.

Keyboard shortcuts

ascend-rs: Memory-Safe NPU Kernel Programming in Rust