English | 中文版
7. End-to-End Pipeline Walkthrough
Let’s trace the complete journey from source code to NPU execution during a single cargo run.
7.1 Compilation Phase
graph TD
A["Rust Kernel Source<br/>kernels/src/lib.rs"] -->|"rustc + rustc_codegen_mlir"| B["Rust MIR<br/>Type-checked, monomorphized"]
B -->|"builder_methods.rs:<br/>MIR ops → MLIR ops"| C["MLIR Modules<br/>LLVM · Arith · CF dialects<br/>hacc.entry attribute"]
C -->|"compile_ascend.rs:<br/>merge all modules"| D["Merged MLIR<br/>kernel code + ascend_std deps"]
D -->|"mlir_to_cpp"| E["Generated C++<br/>AscendC class with TBuf,<br/>DataCopy, ReduceMax, Exp, ..."]
E --> F["ascend_compile crate<br/>Target abstraction · Validation<br/>Bisheng invocation · C ABI + CLI"]
F -->|"310P: --cce-aicore-arch=dav-m200"| G["NPU Binary · kernel.acl.o<br/>Ascend 310P machine code"]
F -->|"910B: --cce-aicore-arch=dav-c220"| H["NPU Binary · kernel.acl.o<br/>Ascend 910B machine code<br/>(413 tests verified)"]
7.1.1 The ascend_compile Compilation Hub
The ascend_compile crate (crates/ascend_compile/) is a standalone compilation library that decouples kernel compilation from the rustc_codegen_mlir backend. Any C++ kernel generator — ascend-rs’s own MLIR-to-C++ pipeline, the PyPTO / PTO-MLIR path we integrate against today, or future frontends such as TileLang, Triton, or PyTorch — can use it to compile AscendC kernels:
graph TD
A1["ascend-rs<br/>Rust→MLIR→C++"] --> E["AscendC C++ kernel source"]
A5["PyPTO / PTO-MLIR<br/>mlir_to_pto → ptoas<br/>(integrated)"] ==> E
A2["TileLang<br/>Python DSL→AscendC (planned)"] -.-> E
A3["Triton<br/>GPU kernel compiler (planned)"] -.-> E
A4["PyTorch<br/>torch.compile (planned)"] -.-> E
E --> F["ascend_compile<br/><br/>Rust API · C ABI · CLI · Python<br/><br/>3 validation passes<br/>Dual flag paths · 310P + 910B<br/>Object or shared library output"]
F --> G["NPU Binary · .o / .so"]
PyPTO is not a future plan — it is the tile-level path we already ship. The mlir_to_pto backend in rustc_codegen_mlir emits PTO-MLIR (pto.tmatmul, pto.tadd, pto.tstore_fp, cube-unit placement via PlanMemoryPass), which is lowered by ptoas 0.26 (CANN 8.5.0) into AscendC C++ and then handed to ascend_compile. Concretely, on Ascend 910B2:
- PTO softmax passes on-device with
max_err 1.86e-9(matching hand-tuned AscendC); - The four DeepSeek-R1-Distill-Qwen-1.5B decode matmuls run on emitter-built PTO and beat
aclnnMatmulby 1.75–2.98×, lifting end-to-end decode from 53.4 → 72.4 tok/s, then to 114–187 tok/s after f16 / fused / cached-executor work (see Chapter 10); - The PTO safety oracle (
pto_to_rust, tagpto_checks) catches stage-2 placement bugs that ptoas itself accepts withrc=0(Chapter 11).
The bold edge from PyPTO / PTO-MLIR is therefore not a planned integration — it is the path through which our most performant 910B2 kernels reach the device today. Dashed edges remain planned frontends.
7.2 Runtime Phase
graph TD
subgraph Host["Host CPU"]
H1["Acl::new()"] --> H2["Device::new"]
H2 --> H3["AclContext"]
H3 --> H4["AclStream"]
H4 --> H5["DeviceBuffer::from_slice()"]
H5 --> H6["kernel.launch()"]
H6 --> H7["stream.sync()"]
H7 --> H8["z_device.to_host()"]
H8 --> H9["Verify results"]
H9 --> H10["RAII Drop · auto-clean"]
end
subgraph Device["NPU Device"]
D1["AI Core 0<br/>block_idx=0<br/>Process x 0..8"]
D2["AI Core 1<br/>block_idx=1<br/>Process x 8..16"]
D3["Device Memory<br/>x: Input A · y: Input B<br/>z: Output = A * B"]
end
H4 -.->|"stream binds"| D3
H5 -.->|"Host → Device copy"| D3
H6 -.->|"Kernel execution"| D1
H6 -.->|"Kernel execution"| D2
H7 -.->|"Completion signal"| Device
H8 -.->|"Device → Host transfer"| D3
H10 -.->|"Resources freed"| Device
7.3 Memory Safety Guarantees
Throughout this process, ascend-rs provides the following compile-time safety guarantees:
| Safety Issue | C++ Approach | ascend-rs Approach |
|---|---|---|
| Device memory leak | Manual aclrtFree | Drop on DeviceBuffer<T> |
| Wrong deallocation order | Programmer convention | Lifetime system prevents at compile time |
| Use-after-free stream | No check | Compile error |
| Send unsafe type to device | No check | DeviceSend trait bound |
| Forgetting to synchronize | Silent data corruption | Type system extensible to enforce |