Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

Appendix F: Performance Benchmarks

This appendix provides an interactive comparison of AscendC C++ (hand-optimized reference kernels) versus ascend-rs (Rust-generated) kernel performance across different NPU targets.

Methodology

  • Wall-clock timing: clock_gettime(CLOCK_MONOTONIC) around kernel launch + aclrtSynchronizeStream
  • Iterations: 1 warmup + 10 timed, median reported
  • Compilation: Both C++ and Rust kernels compiled with bisheng at -O2
  • Ratio: Rust time / C++ time (< 1.0 = Rust is faster)

Interactive Results

Note: If the interactive table does not render (e.g., in PDF), see the static table below.

Static Summary

KernelSizeTargetC++ (ms)Rust (ms)Ratio
relu256310P0.0780.0750.96x
relu1024310P0.0750.0761.01x
relu4096310P0.0750.0761.01x
relu16384310P0.0830.0831.00x
sigmoid256310P0.0750.0751.00x
sigmoid1024310P0.0750.0740.99x
sigmoid4096310P0.0770.0771.00x
sigmoid16384310P0.0860.0861.00x
softmax256310P0.0780.0770.99x
softmax1024310P0.0770.0760.99x
softmax4096310P0.0790.0791.00x
softmax16384310P0.0890.0870.98x
tanh256310P0.0750.0771.03x
tanh1024310P0.0750.0761.01x
tanh4096310P0.0760.0781.03x
tanh16384310P0.0850.0861.01x
gelu256910B30.0230.0190.83x
gelu1024910B30.0220.0190.86x
gelu4096910B30.0230.0190.83x
gelu16384910B30.0240.0230.96x
relu256910B30.0300.0301.00x
relu1024910B30.0280.0281.00x
relu4096910B30.0290.0260.90x
relu16384910B30.0290.0311.07x
sigmoid256910B30.0280.0281.00x
sigmoid1024910B30.0280.0240.86x
sigmoid4096910B30.0290.0280.97x
sigmoid16384910B30.0290.0301.03x
softmax256910B30.0310.0321.03x
softmax1024910B30.0310.0311.00x
softmax4096910B30.0210.0211.00x
tanh256910B30.0290.0301.03x
tanh1024910B30.0280.0260.93x
tanh4096910B30.0280.0281.00x
tanh16384910B30.0290.0301.03x

Benchmarks collected on Ascend 910B3 and 310P hardware. Auto-generated from kernels.db.