Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

3. Going Deeper: Writing NPU Kernels in Rust

Hello World demonstrated host-side safety. But ascend-rs has a bigger vision: using Rust on the device side too. This means writing NPU kernel code in Rust, not C++.

Let’s walk through a complete vector multiplication (vec_mul) example to demonstrate this.

3.1 The Rust Kernel

This is the Rust code that runs on the NPU:

#![allow(unused)]
fn main() {
// kernels/src/lib.rs

// Key: #![no_core] indicates a completely bare-metal environment
#![feature(no_core)]
#![no_std]
#![no_core]

/// Element-wise vector multiplication: z[i] = x[i] * y[i]
///
/// #[ascend_std::aiv_kernel] marks this function as an NPU kernel entry point
#[ascend_std::aiv_kernel]
pub unsafe fn mul(x: *const u16, y: *const u16, z: *mut u16) {
    unsafe {
        // Total elements = 16, divide work evenly across parallel blocks
        let block_size = 16usize / ascend_std::get_block_num();
        let start = ascend_std::get_block_idx() * block_size;
        let mut i = start;
        loop {
            // Multiply element-wise and write to output
            *z.wrapping_add(i) = *x.wrapping_add(i) * *y.wrapping_add(i);

            i = i + 1;
            if i == block_size + start {
                break;
            }
        }
    }
}
}

Several things worth noting about this code:

#![no_core] environment: The NPU has no operating system or standard library. ascend_std provides a minimal reimplementation of Rust’s core types (Copy, Clone, Add, Mul, etc.) so that Rust code can compile in a bare-metal environment.

#[ascend_std::aiv_kernel]: This attribute macro marks the function as an AIV (Ascend Instruction Vector) kernel entry point. It expands to #[unsafe(no_mangle)] (so the host can look up the symbol by name) and #[ascend::aiv_kernel] (so the MLIR codegen backend recognizes it and adds the hacc.entry attribute).

NPU parallel model: Similar to CUDA’s block/thread model, the Ascend NPU uses blocks and sub-blocks to organize parallel computation. get_block_idx() and get_block_num() provide execution context so the kernel knows which data slice to process.

3.2 The Host Code

The host code handles data transfer, kernel loading, and result verification:

// src/main.rs
use ascend_rs::prelude::*;

fn main() -> anyhow::Result<()> {
    // ── Phase 1: Initialization ──
    let acl = Acl::new()?;
    let device = Device::new(&acl)?;
    let context = AclContext::new(&device)?;
    let stream = AclStream::new(&context)?;

    // ── Phase 2: Data preparation ──
    let x_host = common::read_buf_from_file::<u16>("test_data/input_x.bin");
    let y_host = common::read_buf_from_file::<u16>("test_data/input_y.bin");

    // Allocate device memory with HugeFirst policy (prefer huge pages for TLB efficiency)
    let mut x_device = DeviceBuffer::from_slice_with_policy(
        x_host.as_slice(), AclrtMemMallocPolicy::HugeFirst
    )?;
    let mut y_device = DeviceBuffer::from_slice_with_policy(
        y_host.as_slice(), AclrtMemMallocPolicy::HugeFirst
    )?;
    let mut z_device = unsafe {
        DeviceBuffer::<u16>::uninitialized_with_policy(
            x_host.len(), AclrtMemMallocPolicy::HugeFirst
        )?
    };

    // ── Phase 3: Kernel execution ──
    unsafe {
        // KernelLoader loads NPU binary from build.rs compilation artifacts
        let kernel_loader = KernelLoader::new()?;

        // Get kernel handle by symbol name "mul"
        let kernel = kernel_loader.get_kernel("mul")?;

        // Launch kernel with 2 parallel blocks
        let block_dim: u32 = 2;
        let mut args = [
            x_device.as_mut_ptr() as *mut _,
            y_device.as_mut_ptr() as *mut _,
            z_device.as_mut_ptr() as *mut _,
        ];
        kernel.launch(block_dim, &stream, &mut args)?;
    }

    // ── Phase 4: Synchronize and verify ──
    stream.synchronize()?;
    let res = z_device.to_host()?;

    for (idx, elem) in res.iter().enumerate() {
        let expected = x_host[idx].wrapping_mul(y_host[idx]);
        assert_eq!(*elem, expected);
    }

    Ok(())
}

3.3 The Build System

build.rs bridges the Rust toolchain and the CANN compiler:

// build.rs
use ascend_rs_builder::KernelBuilder;
use std::path::PathBuf;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    println!("cargo:rerun-if-changed=kernels");
    ascend_rs_builder::add_ascend_link_args()?;

    let out_path = PathBuf::from(std::env::var("OUT_DIR").unwrap());
    let kernel = out_path.join("kernel.o");

    // Detects "kernels" is a directory → triggers Rust kernel compilation pipeline
    KernelBuilder::new("kernels").copy_to(&kernel).build()?;
    Ok(())
}

When KernelBuilder detects the input is a directory (containing Cargo.toml), it:

  1. Runs cargo build targeting nvptx64-nvidia-cuda
  2. Specifies -Zcodegen-backend=rustc_codegen_mlir for the custom codegen backend
  3. The backend translates Rust MIR to MLIR
  4. The mlir_to_cpp pass converts MLIR into C++ source with AscendC API calls (DMA, vector ops, pipe barriers)
  5. Invokes bisheng (CANN C++ compiler) to compile the generated C++ into NPU binary (.acl.o)

Steps 4–5 are key: although CANN includes bishengir-compile (an MLIR-native compiler for 910B), the production pipeline uses the mlir_to_cpp path for all targets (both 310P and 910B). This C++ codegen approach provides access to the full AscendC feature set — DMA operations via DataCopy, TPipe infrastructure, and vector intrinsics. When the Rust kernel calls functions like ascend_reduce_max_f32, the mlir_to_cpp pass recognizes these in the MLIR and emits the corresponding AscendC vector operations (ReduceMax, Exp, etc.). All 522 tests passing on 910B3 hardware use this path.