English | 中文版
Appendix J: Step-by-Step Reproducible Examples
This appendix walks through three complete, runnable ascend-rs examples from scratch. Each example includes the full source code, the exact shell commands to build and run it, the expected terminal output, and screenshots from real hardware runs. The goal is to let anyone with an Ascend NPU reproduce every result in this book.
Prerequisites
Hardware and Software
| Requirement | Minimum | Tested |
|---|---|---|
| Ascend NPU | Ascend 310P / 910B | Ascend 310P3, Ascend 910B2 |
| CANN | 8.1.RC1 | 8.1.RC1 (310P), 8.5.0 (910B) |
| Rust toolchain | nightly-2025-05-01 | nightly-2025-08-04 |
| OS | Linux aarch64 / x86_64 | Ubuntu 22.04 aarch64 |
| Driver | ≥ 24.1 | bundled with CANN |
One-time Environment Setup
# 1. Clone the repository
git clone https://github.com/ascend-rs/ascend-rs
cd ascend-rs
# 2. Source the CANN environment (adjust path for your installation)
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
# Or for CANN 8.5 standalone:
# source /usr/local/Ascend/cann-8.5.0/set_env.sh
# 3. Set the target SoC (adjust for your hardware)
export ACLRS_SOC_VERSION=Ascend310P3 # for 310P
# export ACLRS_SOC_VERSION=Ascend910B2 # for 910B2
# export ACLRS_SOC_VERSION=Ascend910_9392 # for 910 (older 9392 variant)
# 4. Verify the NPU is visible
npu-smi info
Expected output of npu-smi info (310P example):
+-------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2 Version: 24.1.rc2 |
+------------------+-------------------+-------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) HBM-Usage(MB) Aicore(%) |
| Chip | | Bus-Id |
+==================+===================+=================================================+
| 0 310P3 | OK | 14 42 372 / 8192 0 |
| 0 | | 0000:82:00.0 |
+------------------+-------------------+-------------------------------------------------+
Example 1: Hello World — ACL Device Initialization
The simplest possible ascend-rs program: initialize the ACL runtime, open a device, create a context and stream, print the device descriptor, and exit. This verifies that your driver, CANN, and Rust toolchain are all working together.
Source Code
examples/acl_hello_world/src/main.rs:
use anyhow::Result;
use ascend_rs::prelude::*;
use log::info;
use simple_logger::SimpleLogger;
fn main() -> Result<()> {
SimpleLogger::new().env().init().ok();
// Each of these RAII wrappers acquires a resource on construction
// and releases it automatically on drop. The compiler enforces the
// correct lifetime nesting: Device < AclContext < AclStream.
let acl = Acl::new()?;
let device = Device::new(&acl)?;
let context = AclContext::new(&device)?;
let stream = AclStream::new(&context)?;
info!("Device {} initialized successfully", device.descriptor());
info!("Context handle: {:p}", context.as_ptr());
info!("Stream handle: {:p}", stream.as_ptr());
// Resources are released in reverse order when they go out of scope.
Ok(())
}
Build and Run
# From the repository root:
cd examples/acl_hello_world
RUST_LOG=info cargo run --release
Expected Output
2026-03-31T09:14:02Z INFO [acl_hello_world] Device Ascend310P3 initialized successfully
2026-03-31T09:14:02Z INFO [acl_hello_world] Context handle: 0x55a7b2c30010
2026-03-31T09:14:02Z INFO [acl_hello_world] Stream handle: 0x55a7b2c30080
The device name (Ascend310P3, Ascend910B2, etc.) will match the SoC set in
ACLRS_SOC_VERSION. If you see Device startup failed the driver is not
running — check npu-smi info and ensure the device shows Health: OK.
Screenshot (310P hardware)
$ cd examples/acl_hello_world && RUST_LOG=info cargo run --release
Compiling acl_hello_world v0.1.0
Finished `release` profile [optimized] target(s) in 3.2s
Running `target/release/acl_hello_world`
2026-03-31T09:14:02Z INFO [acl_hello_world] Device Ascend310P3 initialized successfully
2026-03-31T09:14:02Z INFO [acl_hello_world] Context handle: 0x55a7b2c30010
2026-03-31T09:14:02Z INFO [acl_hello_world] Stream handle: 0x55a7b2c30080
What the output tells you:
Device Ascend310P3 initialized successfully— the ACL runtime found the device and the CANN driver stack is functional.- The context and stream handles are non-null kernel objects allocated by the
driver; they are freed automatically when
mainreturns.
Example 2: Vector Softmax — Rust Kernel on Real Hardware
This example runs the full softmax kernel from Chapter 4 on real NPU hardware:
a 1024-element f32 array passes through max → exp → sum → divide on the NPU
vector pipeline, and the result is verified against a CPU reference.
Source Code
Kernel (examples/bench_softmax_rs/kernels/src/lib.rs):
#![feature(no_core)]
#![no_std]
#![no_core]
/// Vectorized row softmax kernel.
///
/// Uses the ascend_std vector intrinsics which the mlir_to_cpp backend
/// translates to AscendC DataCopy / ReduceMax / Exp / Muls / ReduceSum calls.
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len_buf: *const u32) {
unsafe {
let n = *len_buf;
// Allocate UB (Unified Buffer) scratch tiles
let in_buf = ascend_std::ascend_buf_alloc(n);
let out_buf = ascend_std::ascend_buf_alloc(n);
let work = ascend_std::ascend_buf_alloc(n);
let rwork = ascend_std::ascend_buf_alloc(n);
// DMA: global memory → UB
ascend_std::ascend_buf_load_f32(in_buf, input, n);
ascend_std::ascend_pipe_barrier(); // wait for Mte2 engine
// Numerically stable softmax: subtract max before exp
let max_val = ascend_std::ascend_reduce_max_f32(work, in_buf, rwork, n);
ascend_std::ascend_adds_f32(out_buf, in_buf, 0.0f32 - max_val, n);
ascend_std::ascend_exp_f32(out_buf, out_buf, n);
let sum_val = ascend_std::ascend_reduce_sum_f32(work, out_buf, rwork, n);
ascend_std::ascend_muls_f32(out_buf, out_buf, 1.0f32 / sum_val, n);
// DMA: UB → global memory
ascend_std::ascend_pipe_barrier();
ascend_std::ascend_buf_store_f32(output, out_buf, n);
}
}
Host (examples/bench_softmax_rs/src/main.rs, abridged):
use ascend_rs::prelude::*;
fn main() -> anyhow::Result<()> {
let acl = Acl::new()?;
let device = Device::new(&acl)?;
let context = AclContext::new(&device)?;
let stream = AclStream::new(&context)?;
let n: u32 = 1024;
let input: Vec<f32> = (0..n as usize)
.map(|i| ((i as f32) * 0.01).sin() * 3.0)
.collect();
// Transfer input to device, allocate output and length buffers
let mut d_input = DeviceBuffer::from_slice(&input)?;
let mut d_output = unsafe { DeviceBuffer::<f32>::uninitialized(n as usize)? };
let mut d_len = DeviceBuffer::from_slice(&[n])?;
// Load and launch the kernel (1 block)
let kernel_loader = KernelLoader::new()?;
let kernel = kernel_loader.get_kernel("softmax")?;
let mut args: [*mut std::ffi::c_void; 3] = [
d_input.as_mut_ptr() as *mut _,
d_output.as_mut_ptr() as *mut _,
d_len.as_mut_ptr() as *mut _,
];
unsafe { kernel.launch(1, &stream, &mut args)?; }
stream.synchronize()?;
// Verify against CPU reference
let output = d_output.to_host()?;
let sum: f32 = output.iter().sum();
println!("sum = {:.6} (expected ≈ 1.0)", sum);
println!("output[0..4] = {:?}", &output[..4]);
Ok(())
}
Build and Run
cd examples/bench_softmax_rs
# Build the kernel (triggers the CANN compilation pipeline):
# Rust source → MLIR → C++ (mlir_to_cpp) → bisheng → .acl.o
RUST_LOG=info cargo run --release -- --csv /tmp/softmax_results.csv
The kernel compilation step (bisheng) takes ~5 seconds on first build;
subsequent builds use the cargo cache.
Expected Output
2026-03-31T09:15:44Z INFO [bench_softmax_rs] Device Ascend310P3 initialized
2026-03-31T09:15:44Z INFO [bench_softmax_rs] Running softmax benchmark
size=256 pass=true max_err=1.22e-8 sum=1.000000 rust_vec=0.077ms
size=1024 pass=true max_err=8.34e-9 sum=1.000000 rust_vec=0.076ms
size=4096 pass=true max_err=7.11e-9 sum=1.000000 rust_vec=0.079ms
size=16384 pass=true max_err=6.89e-9 sum=1.000000 rust_vec=0.087ms
Screenshot (310P hardware, full benchmark comparison)
$ RUST_LOG=info cargo run --release -- --csv /tmp/softmax_results.csv
Compiling bench_softmax_rs v0.1.0
Finished `release` profile [optimized] target(s) in 8.4s
Running `target/release/bench_softmax_rs --csv /tmp/softmax_results.csv`
2026-03-31T09:15:44Z INFO [bench_softmax_rs] Device Ascend310P3 initialized
2026-03-31T09:15:44Z INFO [bench_softmax_rs] size=256 rust_vec=0.077ms pass=true max_err=1.22e-8
2026-03-31T09:15:44Z INFO [bench_softmax_rs] size=1024 rust_vec=0.076ms pass=true max_err=8.34e-9
2026-03-31T09:15:44Z INFO [bench_softmax_rs] size=4096 rust_vec=0.079ms pass=true max_err=7.11e-9
2026-03-31T09:15:44Z INFO [bench_softmax_rs] size=16384 rust_vec=0.087ms pass=true max_err=6.89e-9
CSV written to /tmp/softmax_results.csv
Running the full comparison (Rust vs C++ side-by-side):
# From repository root:
cd benchmarks/softmax
bash bench.sh
=== Softmax Benchmark ===
--- Rust softmax benchmark ---
size=16384 rust_scalar=2.221ms rust_vec=0.087ms pass=true
--- C++ softmax benchmark ---
size=16384 cpp_naive=2.073ms cpp_opt=0.089ms pass=true
Performance summary (16384 elements):
Rust vector vs C++ optimized: 0.087ms vs 0.089ms → Rust is 1.02x faster
Vector speedup over scalar: 25.5x
Correctness: all sizes PASS (max_err < 1e-8)
How the Pipeline Works
Each step in the compilation pipeline can be inspected by looking at the
intermediate files in kernels/target/:
kernels/target/davinci-huawei-none/release/deps/
├── softmax_kernels.mlir ← MLIR output from rustc codegen
├── softmax_kernels.mlir.acl.gen.cpp ← C++ generated by mlir_to_cpp
└── softmax_kernels.acl.o ← NPU object file from bisheng
The generated C++ (acl.gen.cpp) shows the direct AscendC API calls that the
Rust intrinsics compile to:
// Generated from: ascend_std::ascend_exp_f32(out_buf, out_buf, n)
Exp(out_buf_local, out_buf_local, n);
pipe_barrier(PIPE_V);
Example 3: Tile Softmax — PTO Codegen Path on Ascend 910B
This example demonstrates the newer PTO (Programmable Tile Operations)
codegen path, which targets the Ascend 910B (dav-c220) matrix pipeline. The
tile API expresses 2D tile operations (tile_load, tile_softmax,
tile_store) that compile through ptoas — the PTO assembler — rather than
the standard C++ codegen.
This is the most advanced example and requires an Ascend 910B device with
ptoas available. It demonstrates the complete pipeline:
Rust tile API → MLIR → PTO-MLIR → ptoas → CCE C++ → ccec → .acl.o
Source Code
Kernel (examples/tile_softmax/kernels/src/lib.rs):
#![feature(no_core)]
#![no_std]
#![no_core]
use ascend_std::tile::{tile_load_f32, tile_softmax_f32, tile_store_f32, Tile};
/// Row-wise softmax over a ROWS × COLS tile of f32 values.
///
/// The tile API is a 2D abstraction over the NPU's vector engine:
/// - `tile_load_f32` → PTO `tload` (DMA from global memory to UB tile)
/// - `tile_softmax_f32` → PTO reduction ops: trowmax → trowexpandsub →
/// texp → trowsum → trowexpanddiv
/// - `tile_store_f32` → PTO `tstore` (DMA from UB tile to global memory)
///
/// The `ptoas --enable-insert-sync` flag automatically inserts set_flag /
/// wait_flag barriers between tile operations.
#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax(input: *const f32, output: *mut f32) {
let block_idx = ascend_std::get_block_idx() as usize;
let offset = block_idx * 1 * 1024; // ROWS=1, COLS=1024
// Load tile from global memory
let t_in: Tile<1, 1024, f32> =
tile_load_f32::<1, 1024>(input.wrapping_add(offset));
// Compute softmax: max → shift → exp → sum → divide
let t_out: Tile<1, 1024, f32> = tile_softmax_f32::<1, 1024>(t_in);
// Store result to global memory
tile_store_f32::<1, 1024>(output.wrapping_add(offset), t_out);
}
Host (examples/tile_softmax/src/main.rs, abridged):
use ascend_rs::prelude::*;
fn main() -> anyhow::Result<()> {
const ROWS: usize = 1;
const COLS: usize = 1024;
let acl = Acl::new()?;
let device = Device::new(&acl)?;
let context = AclContext::new(&device)?;
let stream = AclStream::new(&context)?;
// Sinusoidal input for visual verification
let input: Vec<f32> = (0..ROWS * COLS)
.map(|i| ((i as f32) * 0.01).sin() * 3.0)
.collect();
let mut d_input = DeviceBuffer::from_slice(&input)?;
let mut d_output = unsafe { DeviceBuffer::<f32>::uninitialized(ROWS * COLS)? };
let kernel_loader = KernelLoader::new()?;
let kernel = kernel_loader.get_kernel("tile_softmax")?;
let mut args: [*mut std::ffi::c_void; 2] = [
d_input.as_mut_ptr() as *mut _,
d_output.as_mut_ptr() as *mut _,
];
unsafe { kernel.launch(1, &stream, &mut args)?; } // 1 block
stream.synchronize()?;
let output = d_output.to_host()?;
let sum: f32 = output.iter().sum();
let max_err = output.iter()
.zip(softmax_cpu(&input, ROWS, COLS).iter())
.map(|(a, b)| (a - b).abs())
.fold(0.0f32, f32::max);
println!("tile_softmax: max_err={:.4e} sum={:.6} {}",
max_err, sum,
if max_err < 1e-5 && (sum - 1.0).abs() < 1e-4 { "PASS" } else { "FAIL" });
Ok(())
}
Build and Run
# Required environment (Ascend 910B with CANN 8.5 and ptoas)
export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
export ACLRS_SOC_VERSION=Ascend910_9392 # adjust for your SoC
export ACLRS_CODEGEN_PATH=pto # enable PTO path
export ACLRS_PTOAS_PATH=/path/to/ptoas # ptoas assembler binary
export ACLRS_PTO_ISA_PATH=/path/to/pto-isa/include # pto-isa headers
export LD_LIBRARY_PATH=/data/llvm20/lib:${ACLRS_CANN_PATH}/aarch64-linux/lib64:\
/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common
source ${ACLRS_CANN_PATH}/set_env.sh
export PATH=${ACLRS_CANN_PATH}/tools/ccec_compiler/bin:$PATH
cd examples/tile_softmax
cargo run --release
Compilation Pipeline Trace
The build system prints each step. With RUST_LOG=debug you can see the exact
commands:
# Step 1: Rust → MLIR (rustc with custom codegen backend)
rustc --crate-type lib -Z codegen-backend=librustc_codegen_mlir.so ...
→ tile_softmax_kernels.mlir
# Step 2: MLIR → PTO-MLIR (mlir_to_pto.rs)
→ tile_softmax_kernels.acl.pto
# Step 3: PTO-MLIR → CCE C++ (ptoas)
ptoas --enable-insert-sync --pto-arch=a3 tile_softmax_kernels.acl.pto \
-o tile_softmax_kernels.acl.pto.cpp
# Step 4: CCE C++ → NPU object (ccec)
ccec -c -O3 -x cce -DMEMORY_BASE --cce-aicore-arch=dav-c220-vec \
-mllvm -cce-aicore-addr-transform \
-mllvm -cce-aicore-dcci-insert-for-scalar=false \
-I/path/to/pto-isa/include \
tile_softmax_kernels.acl.pto.cpp \
-o tile_softmax_kernels.acl.o
Intermediate Artifacts (Committed)
The intermediate files generated during the verified 2026-04-01 run on Ascend 910B2
are committed to the repository under examples/tile_softmax/artifacts/. You can
inspect each stage of the pipeline without installing any tools:
| File | Stage | Description |
|---|---|---|
tile_softmax_kernels.acl.pto | MLIR → PTO-MLIR | PTO-MLIR dialect emitted by mlir_to_pto.rs |
tile_softmax_kernels.acl.pto.cpp | PTO-MLIR → CCE C++ | AscendC C++ generated by ptoas --enable-insert-sync |
tile_softmax_kernels.acl.pto.compat-a3.hpp | CANN 8.5 shim | Compatibility header patched by pto-compat-cann85.hpp |
For the multi-shape benchmark, see the equivalent artifacts in
examples/bench_softmax_tile/artifacts/.
The complete PTO-MLIR output for the 1×1024 softmax kernel
(tile_softmax_kernels.acl.pto):
// Generated by ascend-rs mlir_to_pto — DO NOT EDIT
// Compile: ptoas --enable-insert-sync <file.pto> -o <file.cpp>
module {
func.func @tile_softmax(%arg601: !pto.ptr<f32>, %arg602: !pto.ptr<f32>) {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c1024 = arith.constant 1024 : index
%pto0 = pto.make_tensor_view %arg601, shape = [%c1, %c1024], strides = [%c1024, %c1] : !pto.tensor_view<?x?xf32>
%pto1 = pto.partition_view %pto0, offsets = [%c0, %c0], sizes = [%c1, %c1024] : !pto.tensor_view<?x?xf32> -> !pto.partition_tensor_view<1x1024xf32>
%pto2 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, v_row=1, v_col=1024, blayout=row_major, slayout=none_box, fractal=512, pad=0>
pto.tload ins(%pto1 : !pto.partition_tensor_view<1x1024xf32>) outs(%pto2 : !pto.tile_buf<...>)
// scratch tile for trowmax
%pto3 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=8, cols=1, ...> // row-max result
%pto4 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...> // scratch
%pto5 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...> // shifted
%pto6 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...> // exp result
%pto7 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=8, cols=1, ...> // row-sum result
%pto8 = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...> // final output
// softmax decomposition:
pto.trowmax ins(%pto2, %pto4 : ...) outs(%pto3 : ...) // Step 1: max per row
pto.trowexpandsub ins(%pto2, %pto3 : ...) outs(%pto5 : ...) // Step 2: x - max
pto.texp ins(%pto5 : ...) outs(%pto6 : ...) // Step 3: exp(x - max)
pto.trowsum ins(%pto6, %pto4 : ...) outs(%pto7 : ...) // Step 4: sum
pto.trowexpanddiv ins(%pto6, %pto7 : ...) outs(%pto8 : ...) // Step 5: / sum
%pto9 = pto.make_tensor_view %arg602, shape = [%c1, %c1024], strides = [%c1024, %c1] : !pto.tensor_view<?x?xf32>
%pto10 = pto.partition_view %pto9, offsets = [%c0, %c0], sizes = [%c1, %c1024] : !pto.tensor_view<?x?xf32> -> !pto.partition_tensor_view<1x1024xf32>
pto.tstore ins(%pto8 : ...) outs(%pto10 : ...)
return
}
}
After ptoas --enable-insert-sync, the CCE C++ kernel entry point
(tile_softmax_kernels.acl.pto.cpp,
excerpt):
extern "C" __global__ AICORE void tile_softmax(__gm__ float* v1, __gm__ float* v2) {
// ptoas allocates UB tiles at compile-time offsets (v8..v14)
Tile<TileType::Vec, float, 1, 1024, BLayout::RowMajor, ...> v18; // input tile
TLOAD(v18, v17);
set_flag(PIPE_MTE2, PIPE_V, EVENT_ID0); // auto-inserted sync
// Softmax reduction ops:
wait_flag(PIPE_MTE2, PIPE_V, EVENT_ID0);
TROWMAX(v20, v18, v23); pipe_barrier(PIPE_V);
TROWEXPANDSUB(v24, v18, v20); pipe_barrier(PIPE_V);
TEXP(v25, v24); pipe_barrier(PIPE_V);
TROWSUM(v27, v25, v23); pipe_barrier(PIPE_V);
TROWEXPANDDIV(v30, v25, v27);
set_flag(PIPE_V, PIPE_MTE3, EVENT_ID0); // auto-inserted sync
wait_flag(PIPE_V, PIPE_MTE3, EVENT_ID0);
TSTORE(v33, v30);
pipe_barrier(PIPE_ALL);
}
The __global__ keyword marks this as a host-callable entry point. Without it,
ccec compiles the function successfully but the runtime cannot dispatch it
(symptom: MTE DDR address out of range, error code 0x800000). This was a
non-obvious bug fixed in commit 04c80ac6.
Expected Output
2026-04-01T12:17:35Z INFO [tile_softmax] tile_softmax test: ROWS=1, COLS=1024, n=1024
2026-04-01T12:17:35Z INFO [tile_softmax] Device Ascend910_9392 initialized
2026-04-01T12:17:35Z INFO [tile_softmax] Launching tile_softmax kernel (1 block, 1x1024 f32)...
2026-04-01T12:17:36Z INFO [tile_softmax] tile_softmax: max_err=1.8626e-9 sum=1.000000 sum_ok=true PASS
2026-04-01T12:17:36Z INFO [tile_softmax] tile_softmax PASSED
The max_err=1.8626e-9 result was recorded on 2026-04-01 on Ascend 910B2 hardware (Ascend910_9392, dav-c220). The PTO tile reduction instructions (TROWMAX, TROWSUM) accumulate with higher internal precision before returning f32, achieving ~10× better numerical accuracy than the scalar mlir_to_cpp path (which yields max_err ≈ 1e-8 on the same data).
What Makes This Different from Example 2
| Example 2 (Vector Softmax) | Example 3 (Tile Softmax) | |
|---|---|---|
| Codegen path | mlir_to_cpp → bisheng | mlir_to_pto → ptoas → ccec |
| Abstraction | Scalar intrinsics (ascend_reduce_max_f32) | 2D tile ops (tile_softmax_f32) |
| Target hardware | 310P or 910B (vector engine) | 910B (dav-c220, a2a3 path) |
| Intermediate format | AscendC C++ | PTO-MLIR dialect |
| Barriers | Manual (ascend_pipe_barrier) | Auto-inserted by ptoas --enable-insert-sync |
| Parallelism model | 1 block, scalar loops | 1 block, 2D tile |
| Verified max_err | ~1e-8 (310P hardware) | ~1.9e-9 (910B2 hardware, 2026-04-01) |
Example 4: Double-Buffer Tile Softmax
Extends Example 3 to process two tiles per kernel launch using tile_prefetch_f32, overlapping Mte2 DMA (tile 1 load) with Vector compute (tile 0 softmax). See §4.7 for the performance results.
Source Code
Kernel (examples/tile_softmax_double_buf/kernels/src/lib.rs):
#![feature(no_core)]
#![no_std]
#![no_core]
use ascend_std::tile::{
tile_load_f32, tile_prefetch_f32, tile_softmax_f32, tile_store_f32, Tile,
};
#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax_double_buf(input: *const f32, output: *mut f32) {
const ROWS: usize = 1;
const COLS: usize = 1024;
const TILE_ELEMS: usize = ROWS * COLS;
// --- Prologue: issue both loads before any compute ---
// t0 loads tile 0 (offset 0); t1 prefetches tile 1 (offset TILE_ELEMS).
let t0: Tile<ROWS, COLS, f32> = tile_load_f32::<ROWS, COLS>(input);
let t1: Tile<ROWS, COLS, f32> =
tile_prefetch_f32::<ROWS, COLS>(input.wrapping_add(TILE_ELEMS));
// --- Compute tile 0 (Mte2 for t1 can overlap this on the hardware) ---
let r0: Tile<ROWS, COLS, f32> = tile_softmax_f32::<ROWS, COLS>(t0);
// --- Compute tile 1 ---
let r1: Tile<ROWS, COLS, f32> = tile_softmax_f32::<ROWS, COLS>(t1);
// --- Store results ---
tile_store_f32::<ROWS, COLS>(output, r0);
tile_store_f32::<ROWS, COLS>(output.wrapping_add(TILE_ELEMS), r1);
}
The move-ownership pattern enforces the pipeline at compile time: t0 is consumed by tile_softmax_f32 before t1 is used, so there is no data race. tile_prefetch_f32 is identical to tile_load_f32 at the hardware level; the different name documents the programmer’s intent.
Build and Run
# Same environment as Example 3 (Ascend 910B with CANN 8.5 and ptoas)
export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
export ACLRS_SOC_VERSION=Ascend910_9392
export ACLRS_CODEGEN_PATH=pto
export ACLRS_PTOAS_PATH=/path/to/ptoas
export ACLRS_PTO_ISA_PATH=/path/to/pto-isa/include
export LD_LIBRARY_PATH=/data/llvm20/lib:${ACLRS_CANN_PATH}/aarch64-linux/lib64:\
/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common
source ${ACLRS_CANN_PATH}/set_env.sh
export PATH=${ACLRS_CANN_PATH}/tools/ccec_compiler/bin:$PATH
cd examples/tile_softmax_double_buf
cargo run --release
Generated PTO-MLIR
The key difference from Example 3 is that the two loads produce distinct partition_view ops with different row offsets:
// tile 0: load from row 0
%pto1 = pto.partition_view %pto0, offsets = [%c0, %c0], sizes = [%c1, %c1024] : ...
pto.tload ins(%pto1 : ...) outs(%pto2 : ...)
// tile 1: load from row 1 (offset 1024 elements = row 1 with cols=1024)
%pto3 = pto.partition_view %pto0, offsets = [%c1, %c0], sizes = [%c1, %c1024] : ...
pto.tload ins(%pto3 : ...) outs(%pto4 : ...)
// softmax(t0) — Vector pipe; Mte2 can overlap with tload above
pto.trowmax ins(%pto2, ...) outs(...)
pto.trowexpandsub ...
pto.texp ...
pto.trowsum ...
pto.trowexpanddiv ins(...) outs(%pto10 : ...)
// softmax(t1)
pto.trowmax ins(%pto4, ...) outs(...)
...
pto.trowexpanddiv ins(...) outs(%pto16 : ...)
// stores — both at row 0 and row 1 of output
%pto18 = pto.partition_view %pto17, offsets = [%c0, %c0], ...
pto.tstore ins(%pto10 : ...) outs(%pto18 : ...)
%pto19 = pto.partition_view %pto17, offsets = [%c1, %c0], ...
pto.tstore ins(%pto16 : ...) outs(%pto19 : ...)
With offsets=[%c0,%c0] and offsets=[%c1,%c0] encoding different rows, ptoas recognises the two tload ops as accessing independent memory regions and schedules them concurrently on the Mte2 pipe.
Expected Output
2026-04-02T06:14:07Z INFO [tile_softmax_double_buf] double_buf 2×(1×1024): total avg=0.0068ms min=0.0049ms max=0.0140ms | per-tile avg=0.0034ms min=0.0024ms | max_err=3.26e-9 PASS
Raw results: examples/tile_softmax_double_buf/results/bench_double_buf_910b2_2026-04-02.csv.
The GEP Offset Bug Fix
Before this example could work correctly, mlir_to_pto.rs had two bugs:
Bug 1 — make_pv always emitted offsets=[%c0,%c0]:
The GEP index was tracked in gep_offsets but never passed to make_pv. Fixed by adding elem_offset: u32 to make_pv and converting it to (row_off, col_off) using cols as stride.
Bug 2 — Pattern 3 alias chain was flattened:
The load-from-alloca pattern (Pattern 3) called ctx.resolve_ptr(&stored) before inserting the alias, which skipped the intermediate GEP node (%gep → %arg0) where gep_offsets[%gep] = 1024 was recorded. Fixed by storing the immediate alias without resolving first, so resolve_offset can traverse the full chain.
Troubleshooting
Device startup failed
The NPU driver is not running or the device is in a fault state. Check:
npu-smi info # look for Health: OK (not Critical)
npu-smi reset -i 0 # reset device 0 (requires root)
Could not determine ASCEND_HOME_PATH
ACLRS_CANN_PATH is not set or the path doesn’t exist:
export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
# verify it exists:
ls $ACLRS_CANN_PATH/tools/ccec_compiler/bin/bisheng
ptoas assembler not found
Set ACLRS_PTOAS_PATH to the full path of the ptoas binary:
export ACLRS_PTOAS_PATH=/path/to/ptoas/build/tools/ptoas/ptoas
ptoas is part of the pto-isa
project and is only required for the PTO codegen path (Example 3).
ccec PTO compilation failed: set_mask_count does not support target feature
This means the wrong --cce-aicore-arch was used. Ensure:
ACLRS_SOC_VERSIONis set correctly for your chip- ascend-rs is on the
claude_codeormainbranch (fix committed ind45ab4e3andadbf7294)
error: definition of type 'bfloat16_t' conflicts with typedef
Your ccec version already defines bfloat16_t. This was fixed in commit
adbf7294. Update to the latest branch.
Correctness check fails (max_err > 1e-5)
- For the vector softmax on 310P: expected max_err < 1e-8 (hardware f32 math)
- For the tile softmax on 910B: expected max_err < 1e-9 (PTO reduction instructions use higher internal precision; verified result is max_err=1.86e-9)
- Values larger than 1e-5 may indicate the wrong SoC version is set, causing
mismatched UB buffer size assumptions, or a missing
__global__on the kernel entry point (fixed in commit04c80ac6)
Summary: Pipeline Comparison at a Glance
Example 1: Hello World
Rust host code → cargo build → binary → ACL runtime → NPU device
(No kernel — pure host/driver interaction)
Example 2: Vector Softmax (mlir_to_cpp path)
Rust kernel → rustc → MLIR → mlir_to_cpp → AscendC C++
→ bisheng → .acl.o → KernelLoader → NPU execution
Example 3: Tile Softmax (PTO path)
Rust kernel → rustc → MLIR → mlir_to_pto → PTO-MLIR dialect
→ ptoas → CCE C++ → ccec → .acl.o
→ KernelLoader → NPU execution
All three pipelines share the same host-side runtime (ascend_rs::prelude::*):
Acl, Device, AclContext, AclStream, DeviceBuffer, KernelLoader.
The only difference is in how the .acl.o kernel binary is produced.