Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

附录 F:性能基准测试

本附录提供了 AscendC C++(手工优化的参考内核)与 ascend-rs(Rust 生成)内核在不同 NPU 目标上的交互式性能比较。

测试方法

  • 挂钟计时:在内核启动 + aclrtSynchronizeStream 周围使用 clock_gettime(CLOCK_MONOTONIC)
  • 迭代次数:1 次预热 + 10 次计时,取中位数
  • 编译:C++ 和 Rust 内核均使用 bisheng-O2 下编译
  • 比率:Rust 时间 / C++ 时间(< 1.0 = Rust 更快)

交互式结果

注意:如果交互式表格未渲染(例如 PDF 中),请参见下方的静态表格。

静态汇总

内核大小目标C++ (ms)Rust (ms)比率
relu256310P0.0780.0750.96x
relu1024310P0.0750.0761.01x
relu4096310P0.0750.0761.01x
relu16384310P0.0830.0831.00x
sigmoid256310P0.0750.0751.00x
sigmoid1024310P0.0750.0740.99x
sigmoid4096310P0.0770.0771.00x
sigmoid16384310P0.0860.0861.00x
softmax256310P0.0780.0770.99x
softmax1024310P0.0770.0760.99x
softmax4096310P0.0790.0791.00x
softmax16384310P0.0890.0870.98x
tanh256310P0.0750.0771.03x
tanh1024310P0.0750.0761.01x
tanh4096310P0.0760.0781.03x
tanh16384310P0.0850.0861.01x
gelu256910B30.0230.0190.83x
gelu1024910B30.0220.0190.86x
gelu4096910B30.0230.0190.83x
gelu16384910B30.0240.0230.96x
relu256910B30.0300.0301.00x
relu1024910B30.0280.0281.00x
relu4096910B30.0290.0260.90x
relu16384910B30.0290.0311.07x
sigmoid256910B30.0280.0281.00x
sigmoid1024910B30.0280.0240.86x
sigmoid4096910B30.0290.0280.97x
sigmoid16384910B30.0290.0301.03x
softmax256910B30.0310.0321.03x
softmax1024910B30.0310.0311.00x
softmax4096910B30.0210.0211.00x
tanh256910B30.0290.0301.03x
tanh1024910B30.0280.0260.93x
tanh4096910B30.0280.0281.00x
tanh16384910B30.0290.0301.03x

基准测试在 Ascend 910B3 和 310P 硬件上采集。由 kernels.db 自动生成。