English | 中文版

附录 D：生态系统集成——工作流、演示与漏洞防护

Python 生态系统中的 NPU 编程工具（TileLang、PyTorch、Triton、PyPTO）通常直接调用 bisheng 编译器将 AscendC C++ 编译为 NPU 二进制文件。这条路径绕过了所有硬件级验证——编译器本身不检查同步屏障是否存在、缓冲区是否超出物理 SRAM、入口点注解是否正确。本附录展示 ascend_compile 如何作为集成中枢，为每个工具提供编译前验证，并用具体的代码示例说明它捕获的漏洞。

D.1 `ascend_compile` 集成中枢

ascend_compile 提供 4 种接口，适配不同的集成场景：

接口	形式	典型使用方
Rust API	`ascend_compile::compile()`	ascend-rs 内部
C ABI	`libascend_compile.so`（FFI 导出）	PyTorch 昇腾后端
CLI	`ascend-compile kernel.cpp --soc Ascend910B3`	脚本、CI 流水线
Python 封装	`ascend_compile.py`（ctypes 封装 C ABI）	TileLang、Triton、PyPTO

在调用 bisheng 编译器之前，ascend_compile 执行 3 项编译前验证检查：

检查 1：入口点检查 — 内核源码必须包含 __aicore__ 注解。缺少此注解的函数不会被编译为 NPU 设备代码。

检查 2：DMA/同步屏障检查 — 扫描 DataCopy、copy_gm_to_ubuf 等 DMA 模式，若存在 DMA 但无 pipe_barrier() / set_flag / wait_flag：

310P 目标：报错误（310P 无自动同步，缺少屏障必然导致挂起）
910B 目标：报警告（编译器自动同步可能处理，但显式屏障更安全）

检查 3：缓冲区大小检查 — 解析 InitBuffer 调用中的数值参数（支持 256 * 1024 等乘法表达式），对照目标硬件的实际统一缓冲区（UB）限制验证：

910B：192 KB（196,608 字节）
310P：256 KB（262,144 字节）

这 3 项检查均为轻量级字符串扫描，无需执行编译，为流水线增加不到 1ms 的开销。

D.2 TileLang 集成

说明： ascend_compile 验证层（D.1）目前可直接用于任何 C++ 内核源码。D.2–D.5 中描述的“ascend-rs 缓解“工作流是架构设计方案，展示每个工具如何将 Rust 而非 C++ 作为目标。Rust 内核示例可通过 MLIR 后端编译，但端到端集成（工具 → Rust → MLIR → C++ → NPU）尚未在任何上游工具中实现。这些章节描述的是可行路径，而非已交付功能。

工作流：TileLang 从 Python DSL 生成 AscendC C++ 源码 → 用 ascend_compile.compile_kernel() 替换裸露的 subprocess.run(bisheng, ...)，获得编译前验证。

演示：

from ascend_compile import compile_kernel

# TileLang 从 Python DSL 生成的 C++ 源码
kernel_source = '''
#include "kernel_operator.h"
extern "C" __global__ __aicore__ void tilelang_matmul(
    GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR workspace) {
    AscendC::GlobalTensor<half> aGm;
    aGm.SetGlobalBuffer((__gm__ half*)a);
    // DMA 加载
    AscendC::DataCopy(aLocal, aGm, {1, 32, 0, 0});
    // 计算
    AscendC::Mmad(cLocal, aLocal, bLocal, 16, 16, 16);
    // DMA 存储
    AscendC::DataCopy(cGm, cLocal, {1, 32, 0, 0});
}
'''

# 带验证的编译 — 捕获缺失的 pipe_barrier！
try:
    binary = compile_kernel(
        kernel_source,
        soc="Ascend310P1",    # 310P 需要显式屏障
        shared=True,
        validate=True,
    )
except RuntimeError as e:
    print(f"捕获到: {e}")
    # "validation failed:
    #   error: line 8: DMA operations found but no pipe_barrier/sync
    #   — required on Ascend310P1"

漏洞：无 ascend_compile 时，TileLang 的裸露 subprocess.run(bisheng) 会成功编译此内核。在 310P 上，内核会静默挂起 — DMA 完成后计算单元从 UB 读取陈旧数据，因为 DMA 与计算之间没有 pipe_barrier(PIPE_ALL)。这是附录 C 的漏洞模式 V6（缺失同步）。ascend_compile 在编译期捕获此问题。

ascend-rs 缓解：ascend_compile 能检测缺失的屏障，而 ascend-rs 从根本上消除此漏洞类别。在更安全的工作流中，TileLang 的 Python DSL 生成 Rust 内核而非 C++ — ascend-rs 代码生成器随后产生带有构造保证屏障的 C++：

#![allow(unused)]
fn main() {
// Rust 内核：TileLang DSL → ascend-rs 而非原始 C++
#[ascend_std::aiv_kernel]
pub unsafe fn tilelang_softmax(input: *const f32, output: *mut f32, n_ptr: *const u32) {
    unsafe {
        let n = *n_ptr;
        let buf_in  = ascend_std::ascend_buf_alloc(n);
        let buf_out = ascend_std::ascend_buf_alloc(n);
        let work    = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(buf_in, input, n);
        ascend_std::ascend_pipe_barrier();  // 代码生成器也会在 DMA 后自动插入

        // kernel_ops::softmax_f32 内含 4 个 pipe_barrier() 调用 —
        // 不可能遗忘其中任何一个
        ascend_std::kernel_ops::softmax_f32(buf_out, buf_in, work, n);

        ascend_std::ascend_pipe_barrier();  // 代码生成器也会在 DMA 前自动插入
        ascend_std::ascend_buf_store_f32(output, buf_out, n);
    }
}
}

kernel_ops::softmax_f32 组合算子展开为 ReduceMax → Adds → Exp → ReduceSum → Muls，每一步之间都有 pipe_barrier(PIPE_ALL)。此外，MLIR→C++ 代码生成器（mlir_to_cpp.rs）会在每次 DMA 加载之后和每次 DMA 存储之前自动插入 pipe_barrier(PIPE_ALL) — 即使程序员遗漏了显式调用，也提供第二层防护。结果：同步 Bug 在 ascend-rs 内核中结构性不可能发生，而不仅仅是被检测到。

D.3 PyTorch 集成

工作流：torch.compile 配合昇腾后端生成 AscendC C++ 内核 → 通过 C ABI（libascend_compile.so）或 Python 封装调用 ascend_compile，获得缓冲区大小验证。

演示：

import torch

# 第 1 步：定义使用自定义昇腾内核的模型
@torch.compile(backend="ascend")
def fused_gelu(x):
    return x * 0.5 * (1.0 + torch.tanh(
        0.7978845608 * (x + 0.044715 * x ** 3)))

# 第 2 步：昇腾后端生成 AscendC C++
from ascend_compile import compile_kernel

generated_cpp = '''
#include "kernel_operator.h"
extern "C" __global__ __aicore__ void gelu_kernel(
    GM_ADDR input, GM_ADDR output, GM_ADDR workspace) {
    AscendC::TPipe pipe;
    pipe.InitBuffer(inQueue, 1, 300000);  // 300KB > 910B 的 192KB UB 限制！
}
'''

try:
    binary = compile_kernel(generated_cpp, soc="Ascend910B3")
except RuntimeError as e:
    print(f"捕获到: {e}")
    # "validation failed:
    #   error: line 6: InitBuffer size 300000 bytes exceeds
    #   Ascend910B3 UB limit of 196608 bytes"

漏洞：无 ascend_compile 时，超出 NPU 统一缓冲区的缓冲区大小会正常编译，但在运行时引发硬件异常 — 内核写入超出物理 SRAM 边界，可能破坏其他核心的数据。这是 C++ 编译器无法捕获的硬件级缓冲区溢出。ascend_compile 对照目标实际 UB 限制验证 InitBuffer 大小。

ascend-rs 缓解：在更安全的工作流中，torch.compile 的昇腾后端生成 Rust 内核而非 C++。缓冲区管理通过 ascend_buf_alloc() 返回的类型化新类型 ID（UbBuf、L1Buf、L0aBuf 等）实现 — 非原始指针，非 FreeTensor 句柄。新类型防止混用不同存储层级的缓冲区（例如，将 L0aBuf 传递给 UB 向量操作会导致编译错误）。代码生成器将这些 ID 转换为 AscendC TBuf<TPosition::VECCALC> 对象，大小由内核数据流分析计算：

#![allow(unused)]
fn main() {
// Rust 内核：torch.compile → ascend-rs 而非原始 C++
#[ascend_std::aiv_kernel]
pub unsafe fn fused_gelu(input: *const f32, output: *mut f32, n_ptr: *const u32) {
    unsafe {
        let n = *n_ptr;
        // 类型化缓冲区 ID (UbBuf) — 无指针算术，无大小错误
        let buf = ascend_std::ascend_buf_alloc(n);
        let tmp = ascend_std::ascend_buf_alloc(n);
        let work = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(buf, input, n);
        ascend_std::ascend_pipe_barrier();

        // 通过组合算子实现 GELU：x * sigmoid(1.702 * x)
        ascend_std::kernel_ops::gelu_f32(tmp, buf, work, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, tmp, n);
    }
}
}

代码生成器从内核的 ascend_buf_alloc(n) 调用和目标的 UB 限制确定 InitBuffer 大小 — 如果 n 个元素超出 UB 容量，可自动对计算进行分块。程序员无需手动计算缓冲区大小，也不会向 InitBuffer 传递原始字节数。结果：缓冲区溢出在设计上被消除，而不仅仅是被检测到。

D.4 Triton 集成

工作流：Triton IR → 昇腾后端降级为 AscendC C++ → ascend_compile 处理最终编译并验证入口点注解。

演示：

from ascend_compile import compile_kernel

# Triton 后端将 GPU 内核降级为 AscendC C++
# 但入口点注解错误（常见的 GPU→NPU 移植错误）
triton_generated = '''
extern "C" __global__ void vector_add(  // 缺少 __aicore__！
    GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace) {
    AscendC::GlobalTensor<float> xGm;
    xGm.SetGlobalBuffer((__gm__ float*)x);
}
'''

try:
    binary = compile_kernel(triton_generated, soc="Ascend910B3")
except RuntimeError as e:
    print(f"捕获到: {e}")
    # "validation failed:
    #   error: no __aicore__ entry point found"

漏洞：__aicore__ 属性指示编译器为 NPU 的 AI Core 生成代码，而非宿主机 CPU。缺少此属性时，bisheng 可能将函数编译为宿主机函数，或生成在 NPU 上启动时因调用约定和寄存器分配错误而崩溃的二进制文件。这是静默的、灾难性的故障：二进制文件存在、可加载，但计算出垃圾值或挂起。

ascend-rs 缓解：在更安全的工作流中，Triton-Ascend 后端将 Triton IR 降级为带有 #[aiv_kernel] 标注的 Rust 内核。代码生成器无条件地发出正确的 MLIR 属性（hacc.entry、hacc.function_kind = #hacc.function_kind<DEVICE>）和带有 __global__ 和 __aicore__ 的 C++ 入口点：

#![allow(unused)]
fn main() {
// Rust 内核：Triton IR → ascend-rs 而非原始 C++
#[ascend_std::aiv_kernel]  // ← 在代码生成器中触发自动 __aicore__
pub unsafe fn vector_add(
    x: *const f32, y: *const f32, z: *mut f32, n_ptr: *const u32,
) {
    unsafe {
        let n = *n_ptr;
        let bx = ascend_std::ascend_buf_alloc(n);
        let by = ascend_std::ascend_buf_alloc(n);

        ascend_std::ascend_buf_load_f32(bx, x, n);
        ascend_std::ascend_buf_load_f32(by, y, n);
        ascend_std::ascend_pipe_barrier();

        ascend_std::ascend_add_f32(bx, bx, by, n);

        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(z, bx, n);
    }
}
}

declare.rs 中的代码生成器检测到 #[aiv_kernel] 属性后无条件添加 MLIR 入口点属性。Rust 内核函数不存在不带 __aicore__ 注解即可编译的代码路径 — 该属性由编译器而非程序员施加。这将一个容易出现人为错误的注解任务转化为自动的、工具链保证的属性。

D.5 PyPTO 集成

工作流：PyPTO 的 PTO 虚拟指令集（约 90 条指令）编译为 AscendC C++ → ascend_compile 验证缓冲区分配并编译。

演示：

from ascend_compile import compile_kernel

# PyPTO 从 tile 级 Python 操作生成的 C++
pypto_generated = '''
#include "kernel_operator.h"
extern "C" __global__ __aicore__ void pypto_tile_op(
    GM_ADDR input, GM_ADDR output, GM_ADDR workspace) {
    AscendC::TPipe pipe;
    // PyPTO 为双缓冲 tile 分配了 512KB
    pipe.InitBuffer(inQueue, 2, 256 * 1024);  // 2 x 256KB = 512KB
    // 但 910B UB 总共只有 192KB！

    AscendC::LocalTensor<float> aLocal = inQueue.DeQue();
    AscendC::DataCopy(outputGm, aLocal, {1, 64, 0, 0});
    pipe_barrier(PIPE_ALL);
}
'''

try:
    binary = compile_kernel(pypto_generated, soc="Ascend910B3")
except RuntimeError as e:
    print(f"捕获到: {e}")
    # "validation failed:
    #   error: line 6: InitBuffer size 262144 bytes exceeds
    #   Ascend910B3 UB limit of 196608 bytes"

漏洞：PyPTO 的 tile 调度器优化吞吐量，可能分配超过目标物理 SRAM 的 tile。无目标感知验证时，编译出的内核会尝试使用超出实际存在的统一缓冲区，导致内核自身缓冲区之间或相邻 AI Core 上共驻内核之间的内存损坏。ascend_compile 能捕获此问题，因为它知道每个目标的确切 UB 大小（910B 为 192 KB、310P 为 256 KB）。

ascend-rs 缓解：在更安全的工作流中，PyPTO 的 tile 级操作映射为 ascend-rs kernel_ops 组合算子。缓冲区分配使用 ascend_buf_alloc(n) 以元素计数（非字节大小）— 代码生成器从元素计数和数据类型计算物理 InitBuffer 字节数，并在代码生成阶段对照目标的 UB 限制进行验证：

#![allow(unused)]
fn main() {
// Rust 内核：PyPTO tile 操作 → ascend-rs 而非原始 C++
#[ascend_std::aiv_kernel]
pub unsafe fn pypto_tile_matmul(
    a: *const u16, b: *const u16, c: *mut f32, n_ptr: *const u32,
) {
    unsafe {
        let n = *n_ptr;
        // 类型化缓冲区分配 — 代码生成器映射到带有正确 TPosition 的 TBuf
        let l1_a  = ascend_std::ascend_buf_alloc_l1(n);   // L1 缓冲区
        let l0a   = ascend_std::ascend_buf_alloc_l0a(n);  // L0A 缓冲区（Cube 输入 A）
        let l0b   = ascend_std::ascend_buf_alloc_l0b(n);  // L0B 缓冲区（Cube 输入 B）
        let l0c   = ascend_std::ascend_buf_alloc_l0c(n);  // L0C 缓冲区（Cube 输出）

        // 每个 alloc 在代码生成器中映射到特定的 TBuf<TPosition::*>
        // L0A → TBuf<TPosition::A1>，L0B → TBuf<TPosition::B1> 等
        // 混用位置在生成的 C++ 中是编译错误
        ascend_std::ascend_mmad_f16(l0c, l0a, l0b, n, n, n, 1);
    }
}
}

代码生成器为 L0A 发出 TBuf<TPosition::A1>，为 L0B 发出 TBuf<TPosition::B1>，为 L0C 发出 TBuf<TPosition::CO1> — AscendC 类型系统强制 L0A 缓冲区不能传递给 L0B 操作，反之亦然。结合基于元素计数（非原始字节数）的分配方式，缓冲区大小错误在代码生成阶段即被捕获，而非在硬件运行时。PyPTO 的 tile 调度器可以面向 ascend-rs 内核，确信缓冲区位置和大小约束由类型系统强制执行。

D.6 检测与结构性缓解对比

ascend_compile 检测 C++ 代码中的漏洞；ascend-rs 消除整个漏洞类别。下表对比两个层次的防御：

工具	漏洞	`ascend_compile` 检测	ascend-rs 结构性缓解
TileLang	V6：缺失同步屏障	310P 上 `DataCopy` 无 `pipe_barrier` 报错	`kernel_ops` 组合算子内嵌所有屏障；代码生成器自动插入 DMA 屏障
PyTorch	缓冲区大小溢出	`InitBuffer` > 目标 UB 限制报错	`ascend_buf_alloc(n)` 使用元素计数；代码生成器计算字节大小
Triton	缺少 `__aicore__` 入口	源码中未找到 `__aicore__` 报错	`#[aiv_kernel]` 在代码生成器中触发无条件的 `hacc.entry` 属性
PyPTO	缓冲区超出 UB 限制	`InitBuffer` > 目标 UB 限制报错	类型化 `TBuf<TPosition::*>` 位置；基于元素计数的分配

两个层次互为补充。ascend_compile 验证对任何 C++ 内核源码有效，无论其来源——目前即可保护整个生态系统。ascend-rs 缓解更进一步，使漏洞在通过其 Rust→MLIR→C++ 流水线编写的内核中结构性不可能发生。采用 ascend-rs 作为后端的工具将自动获得两个层次的防护。截至本文撰写时，ascend_compile 验证已可供集成使用；ascend-rs Rust 后端是一个架构选项，工具开发者可在未来版本中采用。

这 3 项验证检查是轻量级的（字符串扫描，无需编译），为编译流水线增加不到 1ms 的开销。在 NPU 上，挂起的内核不会产生栈跟踪、核心转储或错误信息 — 只有超时。ascend_compile 将这些不透明的运行时故障转化为带有行号和目标特定解释的可操作编译期错误。

D.7 PyTorch 金标准值测试

除了作为编译集成的下游消费者，PyTorch 还在 ascend-rs 的正确性验证中扮演金标准参考的角色。tests/kernel_correctness/golden/generate.py 使用 PyTorch 和 NumPy 为 6 个类别生成参考输出：

# tests/kernel_correctness/golden/generate.py
import torch
import torch.nn.functional as F

# 生成 conv2d 参考输出
torch.manual_seed(42)
x = torch.randn(1, 3, 7, 7)
w = torch.randn(8, 3, 3, 3)
y = F.conv2d(x, w, stride=1, padding=0)
# → conv_golden.json：由 `cargo test -p kernel_correctness` 加载使用

6 个类别的金标准值分布：

类别	JSON 文件	测试用例数
卷积	`conv_golden.json`	16
索引	`index_golden.json`	14
池化	`pooling_golden.json`	12
矩阵乘法	`matmul_golden.json`	13
缩放	`resize_golden.json`	8
杂项	`misc_golden.json`	9
总计		72

Rust 测试套件通过 cargo test -p kernel_correctness 加载这些 JSON 文件，将 Rust 内核的 CPU 模拟输出与 PyTorch 参考值逐元素对比，容差为 1e-5。

漏洞防护：通过将 Rust 内核输出与 PyTorch 参考值对比，在部署前捕获错误实现。例如，存在 off-by-one 索引错误（附录 C 的 V2：未检查越界）的 gather 内核会产生偏离 PyTorch 参考值的错误输出 — 金标准值测试能够在 CI 中自动捕获此类缺陷，无需访问实际 NPU 硬件。

Keyboard shortcuts

ascend-rs：Rust 内存安全的 NPU 内核编程