English | 中文版

11. 用 Rust 安全卫士捕获 ptoas 的盲区

摘要：PTO-MLIR 编译器 ptoas 是昇腾 NPU 立方路径的下降工具。它会根据自身 dialect 规则校验输入 MLIR，但不会再次校验自身 PlanMemoryPass 的输出——该 pass 为每一个 tile 在 UB、L1、L0A/L0B/L0C、FB 上分配具体字节范围。放置完成之后，错误放置就会一路幸存到 codegen。本章构建一个小型 Rust crate pto_to_rust，把 ptoas 的 stage-2 plan 重建为带类型的 Rust 值,在其上执行六项安全检查,并把违规信息以原始 .acl.pto 文件作为定位点回报出来。最后用两个手写 smoke kernel 做端到端演示:它们在 ptoas 0.26 上返回 rc=0,但在实际硬件上会静默地损坏数据。

本章使用的版本:ptoas 0.26(CANN 8.5.0,Ascend 910B2 测试机上安装在 /usr/local/bin/ptoas-bin/ptoas)、pto_to_rust 0.1.0(tag pto_checks,commit f41b29b1)、rustc 1.91.0-nightly (f34ba774c 2025-08-03)。所有数值结果在这些版本下都能精确复现;更新版本的 ptoas 可能改变放置决策,因此具体的字节偏移会变化。

11.1 为什么 ptoas 需要外部卫士

ptoas 是一个分阶段 lowering 的编译器:输入 PTO-MLIR(tile dialect),输出 bisheng 可消费的 AscendC C++。内部流水线里最关键的一个 pass 是 PlanMemoryPass——在此点,每一个抽象的 pto.alloc_tile 都被具体化为 (address_space, offset, rows, cols, dtype, blayout, slayout) 记录。这之后,IR 仍然是 MLIR,ptoas --print-after-all 可以把它 dump 出来,但 ptoas 本身并不会再去校验以下几项——这些不变量,只要手里有 post-pass 后的 plan,就能轻而易举地验证。

它默默跳过的六条不变量:

#	不变量	违反时的故障模式
1	两个活跃、形状不同的 tile 不得在同一地址空间中占用重叠字节	运行期静默覆盖;kernel 输出错误数据
2	每个地址空间的高水位字节使用量不得超过设备容量(`DeviceSpec`)	SRAM 溢出;kernel 崩溃或损坏邻近 tile
3	`pto.tmatmul` 操作数必须位于正确的 L0 子空间(lhs∈Left、rhs∈Right、acc∈Acc)且 dtype 三元组在立方单元接受集合内	描述符垃圾数据;在某些 CANN 版本下数值错误
4	ptoas 描述符上限:OUTER < 2²⁴,ROW < 2¹⁶	描述符被截断;N 维错误
5	分配的 tile 都应该被使用	浪费 UB 预算——不是 bug,但是 ptoas 从不提及的“正确性气味“
6	tile 线性使用:写之后,下一次写之前应至少有一次读(通告性,flatten 循环)	死写;上一次的值丢失

本章的其余部分,构建能够强制执行全部六项、最小化的工具,并用真实违例来证明它的价值。

11.2 设计:三步、三件 artifact

该卫士围绕一个刻意简单的流水线设计。每一步产出一件 artifact,供下一步消费;每件 artifact 都是纯文本,人可以在任意中间态读取。

  [第 1 步]               [第 2 步]                      [第 3 步]
┌──────────────┐   .pto   ┌──────────────┐   plan.rs   ┌───────────────┐   报告     ┌────────────────┐
│  ptoas       │ ───────▶ │ pto_to_rust::│ ──────────▶ │ pto_to_rust:: │ ─────────▶ │ pto-diff CLI   │
│ --print-...  │          │ parse_stage2 │             │   check_all   │            │ (人类可读输出)  │
└──────────────┘          └──────────────┘             └───────────────┘            └────────────────┘
 PlanMemoryPass            类型化 Rust                 SafetyReport                  error/warn 行
 之后的 MLIR               `Plan { funcs }`            { violations }               file:line:kind:msg

Dump stage-2 PTO-MLIR。运行 ptoas --print-after-all <file.acl.pto>,保留 IR Dump After PlanMemoryPass 之后的最后一个 module。此 IR 对每一个 tile 都带有具体的 (offset, size) 注释——正是卫士所需要的。
解析为带类型的 Rust。pto_to_rust::parse_stage2(&str) -> Plan 把 MLIR 文本转成 Plan { arch, funcs: Vec<PlanFunc> },其中每个 PlanFunc 有 BTreeMap<Ssa, TileSlotX> 记录具体 tile slot,以及引用它们的 Vec<PlanOp>。自此,Rust 的类型系统接管;解析器一旦接受,后续所有推理都在静态类型值上进行。
跑 check_all 并把违规映射回 .acl.pto。SafetyReport::check_all(&plan, &device_spec) 跑完上面六项检查,产出 SafetyReport { violations: Vec<SafetyViolation> }。pto-diff CLI 拿到原始 .acl.pto 路径,前置到每条违规消息前,输出形如 file: severity: [kind] func: message 的行——可 diff、可 grep,看起来就是一条编译器诊断。

关键设计决策在第 1 步:与其用 Rust 重写 PlanMemoryPass(数月工程,永远跟 ptoas 对不齐),卫士信任 ptoas 的放置结果,只校验放置结果上必然成立的不变量。这让 pto_to_rust 保持在 600 行 Rust 以内,同时对真实 bug 足够锋利。

11.3 以 `smoke_tstore_fp_v1.acl.pto` 走一遍三步流程

11.3.1 Kernel 背景

smoke_tstore_fp_v1.acl.pto 是一个 47 行的手写 kernel:把 [M,N] 的 f32 累加器经过一个 pto.tstore_fp(融合反量化存回)下沉到 GM,同时使用一个 f16 的 scaling tile 用于 per-channel scale。它被 ptoas 接受并返回 rc=0——但在实际 910B2 上,生成的 kernel 会:(a) 静默越过 scaling 空间容量上限,(b) 让 scaling tile 使用非默认的 RowMajor 布局,该布局在 fb-dequant 路径上未被支持。两个问题都在原始 .acl.pto 上无法静态识别,但都能从 post-PlanMemoryPass 的 plan 上精确识别。

11.3.2 手动跑三步

$ /usr/local/bin/ptoas-bin/ptoas \
    --print-after-all /tmp/smoke_tstore_fp_v1.acl.pto \
    -o /tmp/out.cpp 2> /tmp/stage2.dump
$ echo "ptoas rc=$?"
ptoas rc=0

# 抽出最后一块 "IR Dump After PlanMemoryPass"
$ awk '/IR Dump After PlanMemoryPass/{flag=1; next} flag' /tmp/stage2.dump > /tmp/stage2.mlir
$ wc -l /tmp/stage2.mlir
74 /tmp/stage2.mlir

# 第 2 步 —— 解析为带类型的 Rust(通过 pto-diff 调用库)
# 第 3 步 —— 跑检查并输出诊断
$ ./target/release/pto-diff /tmp/stage2.mlir
/tmp/stage2.mlir: error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
/tmp/stage2.mlir: warn: [op-constraint] m: pto.tstore_fp: scaling tile `%11` has slayout RowMajor, typical is none_box
/tmp/stage2.mlir: 1 error(s), 1 warning(s)

两条诊断,都是真实的。error 直接决定 kernel 的正确性(SRAM 溢出);warning 决定它的可用性(fb-dequant 被静默丢弃)。两条诊断在 ptoas 的输出中都没有。

11.3.3 用一条命令跑完三步

为方便起见,pto-diff 提供 --from-pto,一键跑完:

$ ./target/release/pto-diff --from-pto /tmp/smoke_tstore_fp_v1.acl.pto
/tmp/smoke_tstore_fp_v1.acl.pto: error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
/tmp/smoke_tstore_fp_v1.acl.pto: warn: [op-constraint] m: pto.tstore_fp: scaling tile `%11` has slayout RowMajor, typical is none_box
/tmp/smoke_tstore_fp_v1.acl.pto: 1 error(s), 1 warning(s)

每一行开头的文件路径是原始 .acl.pto,而不是中间 dump——IDE 或 git diff 视图能直接跳到正确位置。这就是映射回原文件这一步:虽然检查跑在 post-PlanMemoryPass 的 Plan 上,但诊断可以重新贴标到任何上游 artifact。

11.3.4 每个诊断字段的含义

/tmp/smoke_tstore_fp_v1.acl.pto: error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
├──────────────── 定位 ──────────┤  │     │             │
                                    │     │             └── module 中的函数名
                                    │     └─── SafetyKind 标签(aliasing/capacity/op-constraint/
                                    │         matmul-bounds/dead-tile/linear-use)
                                    └── 严重性(error=kernel 错;warn=疑似 bug,通告性)

消息中的 DeviceSpec(Ascend910B2 (CANN 8.5))是本次检查使用的容量表。用 pto-diff --device spec.toml 可以传入自定义规格以针对其他 SoC 版本。

11.4 第二个 kernel:aliasing 与 dead tile

同一套三步流程,作用于 smoke_tdequant_v3.acl.pto,会浮现两种不同的违规——说明卫士的能力具有一般性。

$ ./target/release/pto-diff --from-pto /tmp/smoke_tdequant_v3.acl.pto
/tmp/smoke_tdequant_v3.acl.pto: error: [aliasing] m: slots %7 and %5 overlap in vec at [1024, 5120) and [4096, 4352)
/tmp/smoke_tdequant_v3.acl.pto: warn: [dead-tile] m: slot `%3` allocated in vec at offset 8192 but never used
/tmp/smoke_tdequant_v3.acl.pto: 1 error(s), 1 warning(s)

Aliasing(error)。%5 是 16×64 i8 tile,放置于 UB offset 4096,长度 1024 B。%7 是 16×64 f32 tile,放置于 UB offset 1024,长度 4096 B。它们的字节区间 [4096,4352) 与 [1024,5120) 在 [4096, 4352) 重叠——f32 tile 的 256 字节就是 i8 tile。PlanMemoryPass 因为 liveness 分析认定二者不共存而故意复用了这块区域,但二者形状不同,卫士因此把这次复用从“故意“降级为“可能是 bug“。在本例中确实是 bug:在 op 调度中二者同时活跃。
Dead tile(warning)。%3 被分配,但从未被任何 op 读取或写入——浪费了 4 KiB 的 UB 预算。ptoas 既不回收也不警告。

两个 kernel 都能通过 ptoas 产出可运行的 .cpp。两个都会在硬件上静默出错。卫士在编译期把故障显形,早于 ccec、bisheng,也早于漫长的 NPU 上“改—编—跑“循环。

11.5 把卫士的违规映射回 ptoas

因为卫士跑在 ptoas 自身的输出(stage-2 MLIR)上,它找到的每一条违规,都是某个上游 patch 的具体候选项:

卫士检查	如何折叠回 ptoas
`[aliasing]`	新增一个 `VerifyAfterPlanMemoryPass`——按地址空间把 slots 按 offset 排序后 pair 扫描。卫士在 `check_aliasing` 中的 sort-and-scan 实现(每个空间 `O(n log n)`,实践中 `n < 64`)几乎可以原样移植。
`[capacity]`	已在 `PlanMemoryPass` 自身可知——它就是该 pass 计算出来的数值。pass 末尾加一行 `assert(high_water <= cap)` 就能把运行期崩溃变成编译期报错。
`[op-constraint]` lhs/rhs/acc	`pto.tmatmul` / `pto.tmatmul.acc` / `pto.tstore_fp` 上的 op verifier。ptoas 已有 op verifier 基础设施;每项大约 10 行。
`[matmul-bounds]`	跑在 plan 上的 stage-2 verifier。描述符上限知识(OUTER<2²⁴、ROW<2¹⁶)已存在于 lowering,把它暴露给 verifier 只是一次重构,不是新分析。
`[dead-tile]`	廉价的 post-pass:对每个 slot,检查其 SSA 是否出现在任何 op 的 `reads() ∪ writes()`。只发 warning;并非每个 dead tile 都是 bug。
`[linear-use]`	通告性启发式;要晋升为硬规则,需要作用域感知分析(当前 `scf.for` 会被 flatten)。

把前四项折叠进 ptoas,会让卫士在那些检查上变得冗余——而这正是目的。卫士之所以存在,是为了示范:哪些不变量可以在不重写 ptoas 的前提下达成编译期保证;并在上游支持到位之前,给用户一个兜底。

11.6 端到端复现脚本

仓库里的 blog/mdbook/scripts/ch11_safety_demo.sh 一键跑完整套演示,非交互式:它构建 pto-diff、把两个 smoke .acl.pto 放进 /tmp、在每个上面跑卫士,并原样打印预期诊断。

$ bash blog/mdbook/scripts/ch11_safety_demo.sh
== Tool versions ==
ptoas 0.26
pto_to_rust 0.1.0  (tag pto_checks, commit f41b29b1)
rustc 1.91.0-nightly

== Demo 1: smoke_tstore_fp_v1 ==
ptoas rc=0
oracle findings:
  error: [capacity] m: scaling high-water 4352 B exceeds capacity 4096 B (on Ascend910B2 (CANN 8.5))
  warn:  [op-constraint] m: pto.tstore_fp: scaling tile `%11` has slayout RowMajor, typical is none_box

== Demo 2: smoke_tdequant_v3 ==
ptoas rc=0
oracle findings:
  error: [aliasing] m: slots %7 and %5 overlap in vec at [1024, 5120) and [4096, 4352)
  warn:  [dead-tile] m: slot `%3` allocated in vec at offset 8192 but never used

== Summary ==
ptoas accepted both files with rc=0.
Oracle found 2 errors + 2 warnings across the two files.

脚本只读(除 /tmp 之外不写任何文件),只要 ptoas 在 PATH 上,卫士二进制已构建在 target/release/pto-diff,就能跑。在 910B2 测试机上整个 demo 两秒内跑完。

11.7 局限与非目标

卫士信任 ptoas 的放置结果。 若 PlanMemoryPass 给出错误偏移(ptoas 的 bug),卫士要么漏掉违规,要么报出错误字节区间。目标不是去二次审核 ptoas 的分配器,而是用一组独立的不变量校验其输出。
循环被 flatten。 check_linear_use 会折叠 scf.for 主体——每次迭代合法地重写同一个 tile,可能被误报成 WAW。正因如此,该检查是 Severity::Warning,不是 Error。作用域感知的 liveness 分析可以解除该限制,但 pass 会更复杂。
DeviceSpec 按 SoC 分。 内置规格是 Ascend910B2 (CANN 8.5)。其他 SoC 版本(Ascend 910_9392、310P3、即将发布的 910C)有不同的容量与 dtype 规则;它们可表为 TOML 文件,通过 --device 传入。

11.8 本章在大图景中的位置

卫士是一个小工具——600 多行 Rust,两个 smoke kernel,一个 bash 脚本——但它体现了本书反复出现的一个主题:把 Rust 的类型系统引入加速器工具链,能把隐藏的正确性故障转化为编译期错误。第 4 章在 kernel 源码层面做过一次;第 6 章为整个 MKB 语料做过一次;这一章表明同样的思路适用于厂商 PTO 编译器的中间 IR。鉴于 ptoas 在 910B2 的 M 流水线立方路径上是关键一环,即便只在两个手写 smoke 上早早抓到 4 个真实 bug,其价值也足以抵消 600 行代码的成本。

Keyboard shortcuts

ascend-rs：Rust 内存安全的 NPU 内核编程