Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

7. End-to-End Pipeline Walkthrough

Let’s trace the complete journey from source code to NPU execution during a single cargo run.

7.1 Compilation Phase

graph TD
    A["Rust Kernel Source<br/>kernels/src/lib.rs"] -->|"rustc + rustc_codegen_mlir"| B["Rust MIR<br/>Type-checked, monomorphized"]
    B -->|"builder_methods.rs:<br/>MIR ops → MLIR ops"| C["MLIR Modules<br/>LLVM · Arith · CF dialects<br/>hacc.entry attribute"]
    C -->|"compile_ascend.rs:<br/>merge all modules"| D["Merged MLIR<br/>kernel code + ascend_std deps"]
    D -->|"mlir_to_cpp"| E["Generated C++<br/>AscendC class with TBuf,<br/>DataCopy, ReduceMax, Exp, ..."]
    E --> F["ascend_compile crate<br/>Target abstraction · Validation<br/>Bisheng invocation · C ABI + CLI"]
    F -->|"310P: --cce-aicore-arch=dav-m200"| G["NPU Binary · kernel.acl.o<br/>Ascend 310P machine code"]
    F -->|"910B: --cce-aicore-arch=dav-c220"| H["NPU Binary · kernel.acl.o<br/>Ascend 910B machine code<br/>(413 tests verified)"]

7.1.1 The ascend_compile Compilation Hub

The ascend_compile crate (crates/ascend_compile/) is a standalone compilation library that decouples kernel compilation from the rustc_codegen_mlir backend. Any C++ kernel generator — ascend-rs’s own MLIR-to-C++ pipeline, the PyPTO / PTO-MLIR path we integrate against today, or future frontends such as TileLang, Triton, or PyTorch — can use it to compile AscendC kernels:

graph TD
    A1["ascend-rs<br/>Rust→MLIR→C++"] --> E["AscendC C++ kernel source"]
    A5["PyPTO / PTO-MLIR<br/>mlir_to_pto → ptoas<br/>(integrated)"] ==> E
    A2["TileLang<br/>Python DSL→AscendC (planned)"] -.-> E
    A3["Triton<br/>GPU kernel compiler (planned)"] -.-> E
    A4["PyTorch<br/>torch.compile (planned)"] -.-> E
    E --> F["ascend_compile<br/><br/>Rust API · C ABI · CLI · Python<br/><br/>3 validation passes<br/>Dual flag paths · 310P + 910B<br/>Object or shared library output"]
    F --> G["NPU Binary · .o / .so"]

PyPTO is not a future plan — it is the tile-level path we already ship. The mlir_to_pto backend in rustc_codegen_mlir emits PTO-MLIR (pto.tmatmul, pto.tadd, pto.tstore_fp, cube-unit placement via PlanMemoryPass), which is lowered by ptoas 0.26 (CANN 8.5.0) into AscendC C++ and then handed to ascend_compile. Concretely, on Ascend 910B2:

  • PTO softmax passes on-device with max_err 1.86e-9 (matching hand-tuned AscendC);
  • The four DeepSeek-R1-Distill-Qwen-1.5B decode matmuls run on emitter-built PTO and beat aclnnMatmul by 1.75–2.98×, lifting end-to-end decode from 53.4 → 72.4 tok/s, then to 114–187 tok/s after f16 / fused / cached-executor work (see Chapter 10);
  • The PTO safety oracle (pto_to_rust, tag pto_checks) catches stage-2 placement bugs that ptoas itself accepts with rc=0 (Chapter 11).

The bold edge from PyPTO / PTO-MLIR is therefore not a planned integration — it is the path through which our most performant 910B2 kernels reach the device today. Dashed edges remain planned frontends.

7.2 Runtime Phase

graph TD
    subgraph Host["Host CPU"]
        H1["Acl::new()"] --> H2["Device::new"]
        H2 --> H3["AclContext"]
        H3 --> H4["AclStream"]
        H4 --> H5["DeviceBuffer::from_slice()"]
        H5 --> H6["kernel.launch()"]
        H6 --> H7["stream.sync()"]
        H7 --> H8["z_device.to_host()"]
        H8 --> H9["Verify results"]
        H9 --> H10["RAII Drop · auto-clean"]
    end
    subgraph Device["NPU Device"]
        D1["AI Core 0<br/>block_idx=0<br/>Process x 0..8"]
        D2["AI Core 1<br/>block_idx=1<br/>Process x 8..16"]
        D3["Device Memory<br/>x: Input A · y: Input B<br/>z: Output = A * B"]
    end
    H4 -.->|"stream binds"| D3
    H5 -.->|"Host → Device copy"| D3
    H6 -.->|"Kernel execution"| D1
    H6 -.->|"Kernel execution"| D2
    H7 -.->|"Completion signal"| Device
    H8 -.->|"Device → Host transfer"| D3
    H10 -.->|"Resources freed"| Device

7.3 Memory Safety Guarantees

Throughout this process, ascend-rs provides the following compile-time safety guarantees:

Safety IssueC++ Approachascend-rs Approach
Device memory leakManual aclrtFreeDrop on DeviceBuffer<T>
Wrong deallocation orderProgrammer conventionLifetime system prevents at compile time
Use-after-free streamNo checkCompile error
Send unsafe type to deviceNo checkDeviceSend trait bound
Forgetting to synchronizeSilent data corruptionType system extensible to enforce