English | 中文版

附录 J：可复现的分步示例

本附录通过三个完整、可运行的 ascend-rs 示例，带你从零开始逐步操作。每个示例均包含完整源代码、精确的构建与运行命令、预期终端输出，以及真实硬件运行截图，使任何拥有昇腾 NPU 的人都能复现本书中的所有结果。

前提条件

硬件与软件要求

要求	最低配置	测试环境
昇腾 NPU	Ascend 310P / 910B	Ascend 310P3、Ascend 910B2
CANN	8.1.RC1	8.1.RC1（310P）、8.5.0（910B）
Rust 工具链	nightly-2025-05-01	nightly-2025-08-04
操作系统	Linux aarch64 / x86_64	Ubuntu 22.04 aarch64
驱动	≥ 24.1	随 CANN 附带

一次性环境配置

# 1. 克隆仓库
git clone https://github.com/ascend-rs/ascend-rs
cd ascend-rs

# 2. 初始化 CANN 环境（根据你的实际安装路径调整）
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
# 或者对于独立安装的 CANN 8.5：
# source /usr/local/Ascend/cann-8.5.0/set_env.sh

# 3. 设置目标 SoC（根据你的硬件调整）
export ACLRS_SOC_VERSION=Ascend310P3   # 310P
# export ACLRS_SOC_VERSION=Ascend910B2  # 910B2
# export ACLRS_SOC_VERSION=Ascend910_9392  # 旧版 910（9392 变体）

# 4. 验证 NPU 是否可见
npu-smi info

npu-smi info 预期输出（310P 示例）：

+-------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                 Version: 24.1.rc2                                       |
+------------------+-------------------+-------------------------------------------------+
| NPU   Name       | Health            | Power(W)  Temp(C)   HBM-Usage(MB) Aicore(%)     |
| Chip             |                   | Bus-Id                                           |
+==================+===================+=================================================+
| 0     310P3      | OK                | 14         42       372 / 8192    0              |
| 0                |                   | 0000:82:00.0                                     |
+------------------+-------------------+-------------------------------------------------+

示例一：Hello World — ACL 设备初始化

最简单的 ascend-rs 程序：初始化 ACL 运行时、打开设备、创建上下文与流、打印设备描述符后退出。这一步验证驱动、CANN 和 Rust 工具链能否协同工作。

源代码

examples/acl_hello_world/src/main.rs：

use anyhow::Result;
use ascend_rs::prelude::*;
use log::info;
use simple_logger::SimpleLogger;

fn main() -> Result<()> {
    SimpleLogger::new().env().init().ok();

    // 每个 RAII 包装器在构造时申请资源，在 drop 时自动释放。
    // 编译器强制执行正确的生命周期嵌套：Device < AclContext < AclStream。
    let acl     = Acl::new()?;
    let device  = Device::new(&acl)?;
    let context = AclContext::new(&device)?;
    let stream  = AclStream::new(&context)?;

    info!("设备 {} 初始化成功", device.descriptor());
    info!("Context 句柄：{:p}", context.as_ptr());
    info!("Stream  句柄：{:p}", stream.as_ptr());

    // 变量离开作用域时，资源按逆序自动释放。
    Ok(())
}

构建与运行

# 从仓库根目录执行：
cd examples/acl_hello_world

RUST_LOG=info cargo run --release

预期输出

2026-03-31T09:14:02Z INFO  [acl_hello_world] 设备 Ascend310P3 初始化成功
2026-03-31T09:14:02Z INFO  [acl_hello_world] Context 句柄：0x55a7b2c30010
2026-03-31T09:14:02Z INFO  [acl_hello_world] Stream  句柄：0x55a7b2c30080

设备名称（Ascend310P3、Ascend910B2 等）与 ACLRS_SOC_VERSION 中设置的 SoC 对应。若出现 Device startup failed，说明驱动未运行——请检查 npu-smi info 中设备 Health 是否为 OK。

截图（310P 真实硬件）

$ cd examples/acl_hello_world && RUST_LOG=info cargo run --release
   Compiling acl_hello_world v0.1.0
    Finished `release` profile [optimized] target(s) in 3.2s
     Running `target/release/acl_hello_world`
2026-03-31T09:14:02Z INFO  [acl_hello_world] 设备 Ascend310P3 初始化成功
2026-03-31T09:14:02Z INFO  [acl_hello_world] Context 句柄：0x55a7b2c30010
2026-03-31T09:14:02Z INFO  [acl_hello_world] Stream  句柄：0x55a7b2c30080

输出解读：

设备 Ascend310P3 初始化成功——ACL 运行时找到设备，CANN 驱动栈正常工作。
Context 和 Stream 句柄是驱动分配的非空内核对象；main 函数返回时自动释放。

示例二：向量 Softmax — 在真实硬件上运行 Rust 内核

本示例在真实 NPU 硬件上运行第 4 章的完整 softmax 内核：1024 个 f32 元素经过 max → exp → sum → divide 在 NPU 向量流水线上处理，结果与 CPU 参考值比对验证。

源代码

内核（examples/bench_softmax_rs/kernels/src/lib.rs）：

#![allow(unused)]
#![feature(no_core)]
#![no_std]
#![no_core]

fn main() {
/// 向量化行 softmax 内核。
///
/// 使用 ascend_std 向量本征函数，mlir_to_cpp 后端将其翻译为
/// AscendC DataCopy / ReduceMax / Exp / Muls / ReduceSum 调用。
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len_buf: *const u32) {
    unsafe {
        let n = *len_buf;

        // 在统一缓冲区（UB）分配临时 Tile
        let in_buf  = ascend_std::ascend_buf_alloc(n);
        let out_buf = ascend_std::ascend_buf_alloc(n);
        let work    = ascend_std::ascend_buf_alloc(n);
        let rwork   = ascend_std::ascend_buf_alloc(n);

        // DMA：全局内存 → UB
        ascend_std::ascend_buf_load_f32(in_buf, input, n);
        ascend_std::ascend_pipe_barrier();  // 等待 Mte2 引擎

        // 数值稳定 softmax：先减最大值再求 exp
        let max_val = ascend_std::ascend_reduce_max_f32(work, in_buf, rwork, n);
        ascend_std::ascend_adds_f32(out_buf, in_buf, 0.0f32 - max_val, n);
        ascend_std::ascend_exp_f32(out_buf, out_buf, n);
        let sum_val = ascend_std::ascend_reduce_sum_f32(work, out_buf, rwork, n);
        ascend_std::ascend_muls_f32(out_buf, out_buf, 1.0f32 / sum_val, n);

        // DMA：UB → 全局内存
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, out_buf, n);
    }
}
}

宿主端（examples/bench_softmax_rs/src/main.rs，精简版）：

use ascend_rs::prelude::*;

fn main() -> anyhow::Result<()> {
    let acl     = Acl::new()?;
    let device  = Device::new(&acl)?;
    let context = AclContext::new(&device)?;
    let stream  = AclStream::new(&context)?;

    let n: u32 = 1024;
    let input: Vec<f32> = (0..n as usize)
        .map(|i| ((i as f32) * 0.01).sin() * 3.0)
        .collect();

    // 将输入传输到设备，分配输出和长度缓冲区
    let mut d_input  = DeviceBuffer::from_slice(&input)?;
    let mut d_output = unsafe { DeviceBuffer::<f32>::uninitialized(n as usize)? };
    let mut d_len    = DeviceBuffer::from_slice(&[n])?;

    // 加载并启动内核（1 个 block）
    let kernel_loader = KernelLoader::new()?;
    let kernel = kernel_loader.get_kernel("softmax")?;
    let mut args: [*mut std::ffi::c_void; 3] = [
        d_input.as_mut_ptr() as *mut _,
        d_output.as_mut_ptr() as *mut _,
        d_len.as_mut_ptr() as *mut _,
    ];
    unsafe { kernel.launch(1, &stream, &mut args)?; }
    stream.synchronize()?;

    // 与 CPU 参考值比对验证
    let output = d_output.to_host()?;
    let sum: f32 = output.iter().sum();
    println!("sum = {:.6}  （期望 ≈ 1.0）", sum);
    println!("output[0..4] = {:?}", &output[..4]);

    Ok(())
}

构建与运行

cd examples/bench_softmax_rs

# 构建内核（触发 CANN 编译流水线）：
#   Rust 源码 → MLIR → C++（mlir_to_cpp）→ bisheng → .acl.o
RUST_LOG=info cargo run --release -- --csv /tmp/softmax_results.csv

首次构建时内核编译步骤（bisheng）约需 5 秒，后续构建使用 cargo 缓存。

预期输出

2026-03-31T09:15:44Z INFO  [bench_softmax_rs] 设备 Ascend310P3 已初始化
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] 运行 softmax 基准测试
size=256   pass=true  max_err=1.22e-8  sum=1.000000  rust_vec=0.077ms
size=1024  pass=true  max_err=8.34e-9  sum=1.000000  rust_vec=0.076ms
size=4096  pass=true  max_err=7.11e-9  sum=1.000000  rust_vec=0.079ms
size=16384 pass=true  max_err=6.89e-9  sum=1.000000  rust_vec=0.087ms

截图（310P 真实硬件，完整基准对比）

$ RUST_LOG=info cargo run --release -- --csv /tmp/softmax_results.csv
   Compiling bench_softmax_rs v0.1.0
    Finished `release` profile [optimized] target(s) in 8.4s
     Running `target/release/bench_softmax_rs --csv /tmp/softmax_results.csv`
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] 设备 Ascend310P3 已初始化
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] size=256   rust_vec=0.077ms  pass=true  max_err=1.22e-8
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] size=1024  rust_vec=0.076ms  pass=true  max_err=8.34e-9
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] size=4096  rust_vec=0.079ms  pass=true  max_err=7.11e-9
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] size=16384 rust_vec=0.087ms  pass=true  max_err=6.89e-9
CSV 已写入 /tmp/softmax_results.csv

运行完整对比（Rust 与 C++ 并排）：

# 从仓库根目录执行：
cd benchmarks/softmax
bash bench.sh

=== Softmax 基准测试 ===
--- Rust softmax 基准 ---
size=16384  rust_scalar=2.221ms  rust_vec=0.087ms  pass=true
--- C++ softmax 基准 ---
size=16384  cpp_naive=2.073ms    cpp_opt=0.089ms    pass=true

性能摘要（16384 元素）：
  Rust 向量 vs C++ 优化：  0.087ms vs 0.089ms  → Rust 快 1.02x
  向量 vs 标量加速比：     25.5x
  正确性：所有尺寸均 PASS（max_err < 1e-8）

编译流水线原理

每个编译步骤的中间文件保存在 kernels/target/ 中，可供检查：

kernels/target/davinci-huawei-none/release/deps/
├── softmax_kernels.mlir              ← rustc codegen 输出的 MLIR
├── softmax_kernels.mlir.acl.gen.cpp  ← mlir_to_cpp 生成的 C++
└── softmax_kernels.acl.o             ← bisheng 生成的 NPU 目标文件

生成的 C++（acl.gen.cpp）展示了 Rust 本征函数对应的 AscendC API 调用：

// 由 ascend_std::ascend_exp_f32(out_buf, out_buf, n) 生成
Exp(out_buf_local, out_buf_local, n);
pipe_barrier(PIPE_V);

示例三：Tile Softmax — 昇腾 910B 上的 PTO 编译路径

本示例演示较新的 PTO（可编程 Tile 操作） 编译路径，面向昇腾 910B（dav-c220）矩阵流水线。Tile API 以 tile_load、tile_softmax、tile_store 等二维 Tile 操作来表达计算，通过 ptoas（PTO 汇编器）编译，而非标准 C++ 编译路径。

这是三个示例中最先进的一个，需要配备 ptoas 的昇腾 910B 设备。它展示了完整流水线：

Rust Tile API  →  MLIR  →  PTO-MLIR  →  ptoas  →  CCE C++  →  ccec  →  .acl.o

源代码

内核（examples/tile_softmax/kernels/src/lib.rs）：

#![allow(unused)]
#![feature(no_core)]
#![no_std]
#![no_core]

fn main() {
use ascend_std::tile::{tile_load_f32, tile_softmax_f32, tile_store_f32, Tile};

/// 对 ROWS × COLS 的 f32 Tile 执行逐行 softmax。
///
/// Tile API 是 NPU 向量引擎的二维抽象：
/// - `tile_load_f32`    → PTO `tload`（DMA：全局内存 → UB Tile）
/// - `tile_softmax_f32` → PTO 规约操作序列：trowmax → trowexpandsub →
///                        texp → trowsum → trowexpanddiv
/// - `tile_store_f32`   → PTO `tstore`（DMA：UB Tile → 全局内存）
///
/// `ptoas --enable-insert-sync` 标志会在 Tile 操作之间自动插入
/// set_flag / wait_flag 屏障。
#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax(input: *const f32, output: *mut f32) {
    let block_idx = ascend_std::get_block_idx() as usize;
    let offset = block_idx * 1 * 1024;  // ROWS=1, COLS=1024

    // 从全局内存加载 Tile
    let t_in: Tile<1, 1024, f32> =
        tile_load_f32::<1, 1024>(input.wrapping_add(offset));

    // 计算 softmax：max → shift → exp → sum → divide
    let t_out: Tile<1, 1024, f32> = tile_softmax_f32::<1, 1024>(t_in);

    // 将结果存回全局内存
    tile_store_f32::<1, 1024>(output.wrapping_add(offset), t_out);
}
}

宿主端（examples/tile_softmax/src/main.rs，精简版）：

use ascend_rs::prelude::*;

fn main() -> anyhow::Result<()> {
    const ROWS: usize = 1;
    const COLS: usize = 1024;

    let acl     = Acl::new()?;
    let device  = Device::new(&acl)?;
    let context = AclContext::new(&device)?;
    let stream  = AclStream::new(&context)?;

    // 正弦波输入，便于可视化验证
    let input: Vec<f32> = (0..ROWS * COLS)
        .map(|i| ((i as f32) * 0.01).sin() * 3.0)
        .collect();

    let mut d_input  = DeviceBuffer::from_slice(&input)?;
    let mut d_output = unsafe { DeviceBuffer::<f32>::uninitialized(ROWS * COLS)? };

    let kernel_loader = KernelLoader::new()?;
    let kernel = kernel_loader.get_kernel("tile_softmax")?;
    let mut args: [*mut std::ffi::c_void; 2] = [
        d_input.as_mut_ptr() as *mut _,
        d_output.as_mut_ptr() as *mut _,
    ];
    unsafe { kernel.launch(1, &stream, &mut args)?; }  // 1 个 block
    stream.synchronize()?;

    let output = d_output.to_host()?;
    let sum: f32 = output.iter().sum();
    let max_err = output.iter()
        .zip(softmax_cpu(&input, ROWS, COLS).iter())
        .map(|(a, b)| (a - b).abs())
        .fold(0.0f32, f32::max);

    println!("tile_softmax: max_err={:.4e} sum={:.6} {}",
        max_err, sum,
        if max_err < 1e-5 && (sum - 1.0).abs() < 1e-4 { "PASS" } else { "FAIL" });

    Ok(())
}

构建与运行

# 必要环境（配备 CANN 8.5 和 ptoas 的昇腾 910B）
export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
export ACLRS_SOC_VERSION=Ascend910_9392          # 根据你的 SoC 调整
export ACLRS_CODEGEN_PATH=pto                     # 启用 PTO 路径
export ACLRS_PTOAS_PATH=/path/to/ptoas            # ptoas 汇编器路径
export ACLRS_PTO_ISA_PATH=/path/to/pto-isa/include  # pto-isa 头文件路径
export LD_LIBRARY_PATH=/data/llvm20/lib:${ACLRS_CANN_PATH}/aarch64-linux/lib64:\
/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common

source ${ACLRS_CANN_PATH}/set_env.sh
export PATH=${ACLRS_CANN_PATH}/tools/ccec_compiler/bin:$PATH

cd examples/tile_softmax
cargo run --release

编译流水线追踪

构建系统会打印每个步骤。开启 RUST_LOG=debug 可查看完整命令：

# 第一步：Rust → MLIR（使用自定义 codegen 后端的 rustc）
rustc --crate-type lib -Z codegen-backend=librustc_codegen_mlir.so ...
  → tile_softmax_kernels.mlir

# 第二步：MLIR → PTO-MLIR（mlir_to_pto.rs）
  → tile_softmax_kernels.acl.pto

# 第三步：PTO-MLIR → CCE C++（ptoas）
ptoas --enable-insert-sync --pto-arch=a3 tile_softmax_kernels.acl.pto \
      -o tile_softmax_kernels.acl.pto.cpp

# 第四步：CCE C++ → NPU 目标文件（ccec）
ccec -c -O3 -x cce -DMEMORY_BASE --cce-aicore-arch=dav-c220-vec \
     -mllvm -cce-aicore-addr-transform \
     -mllvm -cce-aicore-dcci-insert-for-scalar=false \
     -I/path/to/pto-isa/include \
     tile_softmax_kernels.acl.pto.cpp \
     -o tile_softmax_kernels.acl.o

中间文件

cargo build --release 完成后，可在 kernels/target/davinci-huawei-none/release/deps/ 中查看 softmax 分解的 PTO-MLIR 方言：

; tile_softmax_kernels.acl.pto  — PTO-MLIR 方言（摘录）
module {
  func.func @ascend_tile_softmax_f32(
      %input:  !pto.ptr<f32>,
      %output: !pto.ptr<f32>) {

    ; --- tload：全局内存 → UB Tile ---
    %c0   = arith.constant 0 : index
    %cR   = arith.constant 1 : index
    %cC   = arith.constant 1024 : index
    %tv_in = pto.make_tensor_view %input,
               shape=[%cR, %cC] strides=[%cC, %c1]
               : !pto.tensor_view<1x1024xf32>
    %pv_in = pto.partition_view %tv_in,
               offsets=[%c0, %c0], sizes=[%cR, %cC]
               : !pto.tensor_view<1x1024xf32> -> !pto.partition_tensor_view<1x1024xf32>
    %tile_in = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...>
    pto.tload ins(%pv_in : ...) outs(%tile_in : ...)

    ; --- softmax 分解 ---
    %tmp_max = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1, ...>
    %row_max = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1, ...>
    pto.trowmax ins(%tile_in, %tmp_max : ...) outs(%row_max : ...)    ; 第一步：求最大值

    %shifted = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...>
    pto.trowexpandsub ins(%tile_in, %row_max : ...) outs(%shifted : ...)  ; 第二步：x-max

    pto.texp ins(%shifted : ...) outs(%shifted : ...)                  ; 第三步：exp

    %tmp_sum = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1, ...>
    %row_sum = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1, ...>
    pto.trowsum ins(%shifted, %tmp_sum : ...) outs(%row_sum : ...)     ; 第四步：求和

    %result  = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...>
    pto.trowexpanddiv ins(%shifted, %row_sum : ...) outs(%result : ...)  ; 第五步：÷ sum

    ; --- tstore：UB Tile → 全局内存 ---
    pto.tstore ins(%result : ...) outs(%pv_out : ...)
    return
  }
}

预期输出

2026-03-31T18:32:35Z INFO  [tile_softmax] tile_softmax 测试：ROWS=1, COLS=1024, n=1024
2026-03-31T18:32:35Z INFO  [tile_softmax] 设备 Ascend910_9392 已初始化
2026-03-31T18:32:35Z INFO  [tile_softmax] 启动 tile_softmax 内核（1 block，1×1024 f32）...
2026-03-31T18:32:35Z INFO  [tile_softmax] tile_softmax: max_err=2.38e-7 sum=1.000000 sum_ok=true PASS
2026-03-31T18:32:35Z INFO  [tile_softmax] tile_softmax PASSED

关于硬件可用性的说明： 运行这些测试的 910c 服务器偶尔会进入硬件故障状态（Device startup failed）。此时编译流水线仍可成功完成——只有运行时执行受阻。PTO 编译结果（1960 字节的 .acl.o 文件）已在 dav-c220-vec 上手动验证编译正确。

与示例二的核心差异

	示例二（向量 Softmax）	示例三（Tile Softmax）
编译路径	`mlir_to_cpp` → `bisheng`	`mlir_to_pto` → `ptoas` → `ccec`
抽象层级	标量本征函数（`ascend_reduce_max_f32`）	二维 Tile 操作（`tile_softmax_f32`）
目标硬件	310P 或 910B（向量引擎）	910B（dav-c220，a2a3 路径）
中间格式	AscendC C++	PTO-MLIR 方言
同步屏障	手动（`ascend_pipe_barrier`）	`ptoas --enable-insert-sync` 自动插入
并行模型	1 block，标量循环	1 block，二维 Tile

示例四：双缓冲 Tile Softmax

在示例三基础上扩展为单次启动处理两个 tile，使用 tile_prefetch_f32 使 Mte2 加载（tile 1）与 Vector 计算（tile 0 softmax）形成重叠。性能数据见第 4.7 节。

源码

内核（examples/tile_softmax_double_buf/kernels/src/lib.rs）：

#![allow(unused)]
#![feature(no_core)]
#![no_std]
#![no_core]

fn main() {
use ascend_std::tile::{
    tile_load_f32, tile_prefetch_f32, tile_softmax_f32, tile_store_f32, Tile,
};

#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax_double_buf(input: *const f32, output: *mut f32) {
    const ROWS: usize = 1;
    const COLS: usize = 1024;
    const TILE_ELEMS: usize = ROWS * COLS;

    // --- 序言：在任何计算开始前发起两次加载 ---
    let t0: Tile<ROWS, COLS, f32> = tile_load_f32::<ROWS, COLS>(input);
    let t1: Tile<ROWS, COLS, f32> =
        tile_prefetch_f32::<ROWS, COLS>(input.wrapping_add(TILE_ELEMS));

    // --- 计算 tile 0（硬件上 t1 的 Mte2 加载可与此重叠）---
    let r0: Tile<ROWS, COLS, f32> = tile_softmax_f32::<ROWS, COLS>(t0);

    // --- 计算 tile 1 ---
    let r1: Tile<ROWS, COLS, f32> = tile_softmax_f32::<ROWS, COLS>(t1);

    // --- 存储结果 ---
    tile_store_f32::<ROWS, COLS>(output, r0);
    tile_store_f32::<ROWS, COLS>(output.wrapping_add(TILE_ELEMS), r1);
}
}

生成的 PTO-MLIR

与示例三的关键区别在于：两次加载会生成具有不同行偏移的 partition_view 操作：

// tile 0：从第 0 行加载
%pto1 = pto.partition_view %pto0, offsets = [%c0, %c0], sizes = [%c1, %c1024] : ...
pto.tload ins(%pto1 : ...) outs(%pto2 : ...)

// tile 1：从第 1 行加载（偏移 1024 个元素 = cols=1024 时的第 1 行）
%pto3 = pto.partition_view %pto0, offsets = [%c1, %c0], sizes = [%c1, %c1024] : ...
pto.tload ins(%pto3 : ...) outs(%pto4 : ...)

// softmax(t0) — Vector 流水；Mte2 可与上面的 tload 重叠
pto.trowmax ...
pto.trowexpanddiv ins(...) outs(%pto10 : ...)

// softmax(t1)
pto.trowmax ...
pto.trowexpanddiv ins(...) outs(%pto16 : ...)

// 存储——输出的第 0 行和第 1 行
%pto18 = pto.partition_view %pto17, offsets = [%c0, %c0], ...
pto.tstore ins(%pto10 : ...) outs(%pto18 : ...)
%pto19 = pto.partition_view %pto17, offsets = [%c1, %c0], ...
pto.tstore ins(%pto16 : ...) outs(%pto19 : ...)

预期输出

2026-04-02T06:14:07Z INFO  [tile_softmax_double_buf] double_buf 2×(1×1024): total avg=0.0068ms min=0.0049ms max=0.0140ms | per-tile avg=0.0034ms min=0.0024ms | max_err=3.26e-9 PASS

原始数据：examples/tile_softmax_double_buf/results/bench_double_buf_910b2_2026-04-02.csv。

常见问题排查

`Device startup failed`

NPU 驱动未运行或设备处于故障状态。请检查：

npu-smi info          # 查看 Health 是否为 OK（而非 Critical）
npu-smi reset -i 0    # 重置设备 0（需要 root 权限）

`Could not determine ASCEND_HOME_PATH`

ACLRS_CANN_PATH 未设置或路径不存在：

export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
# 验证路径是否存在：
ls $ACLRS_CANN_PATH/tools/ccec_compiler/bin/bisheng

`ptoas assembler not found`

将 ACLRS_PTOAS_PATH 设置为 ptoas 二进制文件的完整路径：

export ACLRS_PTOAS_PATH=/path/to/ptoas/build/tools/ptoas/ptoas

ptoas 是 pto-isa 项目的组成部分，仅 PTO 编译路径（示例三）需要。

`ccec PTO compilation failed: set_mask_count does not support target feature`

使用了错误的 --cce-aicore-arch。请确认：

ACLRS_SOC_VERSION 与你的芯片匹配
ascend-rs 位于 claude_code 或 main 分支（修复已提交至 d45ab4e3 和 adbf7294）

`error: definition of type 'bfloat16_t' conflicts with typedef`

你的 ccec 版本已定义 bfloat16_t。此问题已在提交 adbf7294 中修复。请更新到最新分支。

正确性检查失败（`max_err > 1e-5`）

310P 上的向量 softmax：期望 max_err < 1e-8（硬件 f32 精度）
910B 上的 tile softmax：期望 max_err < 1e-5（PTO 规约精度）
超出此范围可能说明 SoC 版本设置错误，导致 UB 缓冲区大小假设不匹配

总览：三条编译路径对比

示例一：Hello World
  Rust 宿主代码  →  cargo build  →  可执行文件  →  ACL 运行时  →  NPU 设备
  （无内核——纯宿主/驱动交互）

示例二：向量 Softmax（mlir_to_cpp 路径）
  Rust 内核  →  rustc  →  MLIR  →  mlir_to_cpp  →  AscendC C++
             →  bisheng  →  .acl.o  →  KernelLoader  →  NPU 执行

示例三：Tile Softmax（PTO 路径）
  Rust 内核  →  rustc  →  MLIR  →  mlir_to_pto  →  PTO-MLIR 方言
             →  ptoas  →  CCE C++  →  ccec  →  .acl.o
             →  KernelLoader  →  NPU 执行

三条路径共享同一套宿主端运行时（ascend_rs::prelude::*）：Acl、Device、AclContext、AclStream、DeviceBuffer、KernelLoader。唯一的区别在于 .acl.o 内核二进制文件的生成方式。

Keyboard shortcuts

ascend-rs：Rust 内存安全的 NPU 内核编程