English | 中文版

附录 J：可复现的分步示例

本附录通过三个完整、可运行的 ascend-rs 示例，带你从零开始逐步操作。每个示例均包含完整源代码、精确的构建与运行命令、预期终端输出，以及真实硬件运行截图，使任何拥有昇腾 NPU 的人都能复现本书中的所有结果。

前提条件

硬件与软件要求

要求	最低配置	测试环境
昇腾 NPU	Ascend 310P / 910B	Ascend 310P3、Ascend 910B2
CANN	8.1.RC1	8.1.RC1（310P）、8.5.0（910B）
Rust 工具链	nightly-2025-05-01	nightly-2025-08-04
操作系统	Linux aarch64 / x86_64	Ubuntu 22.04 aarch64
驱动	≥ 24.1	随 CANN 附带

一次性环境配置

# 1. 克隆仓库
git clone https://github.com/ascend-rs/ascend-rs
cd ascend-rs

# 2. 初始化 CANN 环境（根据你的实际安装路径调整）
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
# 或者对于独立安装的 CANN 8.5：
# source /usr/local/Ascend/cann-8.5.0/set_env.sh

# 3. 设置目标 SoC（根据你的硬件调整）
export ACLRS_SOC_VERSION=Ascend310P3   # 310P
# export ACLRS_SOC_VERSION=Ascend910B2  # 910B2
# export ACLRS_SOC_VERSION=Ascend910_9392  # 旧版 910（9392 变体）

# 4. 验证 NPU 是否可见
npu-smi info

npu-smi info 预期输出（310P 示例）：

+-------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                 Version: 24.1.rc2                                       |
+------------------+-------------------+-------------------------------------------------+
| NPU   Name       | Health            | Power(W)  Temp(C)   HBM-Usage(MB) Aicore(%)     |
| Chip             |                   | Bus-Id                                           |
+==================+===================+=================================================+
| 0     310P3      | OK                | 14         42       372 / 8192    0              |
| 0                |                   | 0000:82:00.0                                     |
+------------------+-------------------+-------------------------------------------------+

示例一：Hello World — ACL 设备初始化

最简单的 ascend-rs 程序：初始化 ACL 运行时、打开设备、创建上下文与流、打印设备描述符后退出。这一步验证驱动、CANN 和 Rust 工具链能否协同工作。

源代码

examples/acl_hello_world/src/main.rs：

use anyhow::Result;
use ascend_rs::prelude::*;
use log::info;
use simple_logger::SimpleLogger;

fn main() -> Result<()> {
    SimpleLogger::new().env().init().ok();

    // 每个 RAII 包装器在构造时申请资源，在 drop 时自动释放。
    // 编译器强制执行正确的生命周期嵌套：Device < AclContext < AclStream。
    let acl     = Acl::new()?;
    let device  = Device::new(&acl)?;
    let context = AclContext::new(&device)?;
    let stream  = AclStream::new(&context)?;

    info!("设备 {} 初始化成功", device.descriptor());
    info!("Context 句柄：{:p}", context.as_ptr());
    info!("Stream  句柄：{:p}", stream.as_ptr());

    // 变量离开作用域时，资源按逆序自动释放。
    Ok(())
}

构建与运行

# 从仓库根目录执行：
cd examples/acl_hello_world

RUST_LOG=info cargo run --release

预期输出

2026-03-31T09:14:02Z INFO  [acl_hello_world] 设备 Ascend310P3 初始化成功
2026-03-31T09:14:02Z INFO  [acl_hello_world] Context 句柄：0x55a7b2c30010
2026-03-31T09:14:02Z INFO  [acl_hello_world] Stream  句柄：0x55a7b2c30080

设备名称（Ascend310P3、Ascend910B2 等）与 ACLRS_SOC_VERSION 中设置的 SoC 对应。若出现 Device startup failed，说明驱动未运行——请检查 npu-smi info 中设备 Health 是否为 OK。

截图（310P 真实硬件）

$ cd examples/acl_hello_world && RUST_LOG=info cargo run --release
   Compiling acl_hello_world v0.1.0
    Finished `release` profile [optimized] target(s) in 3.2s
     Running `target/release/acl_hello_world`
2026-03-31T09:14:02Z INFO  [acl_hello_world] 设备 Ascend310P3 初始化成功
2026-03-31T09:14:02Z INFO  [acl_hello_world] Context 句柄：0x55a7b2c30010
2026-03-31T09:14:02Z INFO  [acl_hello_world] Stream  句柄：0x55a7b2c30080

输出解读：

设备 Ascend310P3 初始化成功——ACL 运行时找到设备，CANN 驱动栈正常工作。
Context 和 Stream 句柄是驱动分配的非空内核对象；main 函数返回时自动释放。

示例二：向量 Softmax — 在真实硬件上运行 Rust 内核

本示例在真实 NPU 硬件上运行第 4 章的完整 softmax 内核：1024 个 f32 元素经过 max → exp → sum → divide 在 NPU 向量流水线上处理，结果与 CPU 参考值比对验证。

源代码

内核（examples/bench_softmax_rs/kernels/src/lib.rs）：

#![feature(no_core)]
#![no_std]
#![no_core]

/// 向量化行 softmax 内核。
///
/// 使用 ascend_std 向量本征函数，mlir_to_cpp 后端将其翻译为
/// AscendC DataCopy / ReduceMax / Exp / Muls / ReduceSum 调用。
#[ascend_std::aiv_kernel]
pub unsafe fn softmax(input: *const f32, output: *mut f32, len_buf: *const u32) {
    unsafe {
        let n = *len_buf;

        // 在统一缓冲区（UB）分配临时 Tile
        let in_buf  = ascend_std::ascend_buf_alloc(n);
        let out_buf = ascend_std::ascend_buf_alloc(n);
        let work    = ascend_std::ascend_buf_alloc(n);
        let rwork   = ascend_std::ascend_buf_alloc(n);

        // DMA：全局内存 → UB
        ascend_std::ascend_buf_load_f32(in_buf, input, n);
        ascend_std::ascend_pipe_barrier();  // 等待 Mte2 引擎

        // 数值稳定 softmax：先减最大值再求 exp
        let max_val = ascend_std::ascend_reduce_max_f32(work, in_buf, rwork, n);
        ascend_std::ascend_adds_f32(out_buf, in_buf, 0.0f32 - max_val, n);
        ascend_std::ascend_exp_f32(out_buf, out_buf, n);
        let sum_val = ascend_std::ascend_reduce_sum_f32(work, out_buf, rwork, n);
        ascend_std::ascend_muls_f32(out_buf, out_buf, 1.0f32 / sum_val, n);

        // DMA：UB → 全局内存
        ascend_std::ascend_pipe_barrier();
        ascend_std::ascend_buf_store_f32(output, out_buf, n);
    }
}

宿主端（examples/bench_softmax_rs/src/main.rs，精简版）：

use ascend_rs::prelude::*;

fn main() -> anyhow::Result<()> {
    let acl     = Acl::new()?;
    let device  = Device::new(&acl)?;
    let context = AclContext::new(&device)?;
    let stream  = AclStream::new(&context)?;

    let n: u32 = 1024;
    let input: Vec<f32> = (0..n as usize)
        .map(|i| ((i as f32) * 0.01).sin() * 3.0)
        .collect();

    // 将输入传输到设备，分配输出和长度缓冲区
    let mut d_input  = DeviceBuffer::from_slice(&input)?;
    let mut d_output = unsafe { DeviceBuffer::<f32>::uninitialized(n as usize)? };
    let mut d_len    = DeviceBuffer::from_slice(&[n])?;

    // 加载并启动内核（1 个 block）
    let kernel_loader = KernelLoader::new()?;
    let kernel = kernel_loader.get_kernel("softmax")?;
    let mut args: [*mut std::ffi::c_void; 3] = [
        d_input.as_mut_ptr() as *mut _,
        d_output.as_mut_ptr() as *mut _,
        d_len.as_mut_ptr() as *mut _,
    ];
    unsafe { kernel.launch(1, &stream, &mut args)?; }
    stream.synchronize()?;

    // 与 CPU 参考值比对验证
    let output = d_output.to_host()?;
    let sum: f32 = output.iter().sum();
    println!("sum = {:.6}  （期望 ≈ 1.0）", sum);
    println!("output[0..4] = {:?}", &output[..4]);

    Ok(())
}

构建与运行

cd examples/bench_softmax_rs

# 构建内核（触发 CANN 编译流水线）：
#   Rust 源码 → MLIR → C++（mlir_to_cpp）→ bisheng → .acl.o
RUST_LOG=info cargo run --release -- --csv /tmp/softmax_results.csv

首次构建时内核编译步骤（bisheng）约需 5 秒，后续构建使用 cargo 缓存。

预期输出

2026-03-31T09:15:44Z INFO  [bench_softmax_rs] 设备 Ascend310P3 已初始化
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] 运行 softmax 基准测试
size=256   pass=true  max_err=1.22e-8  sum=1.000000  rust_vec=0.077ms
size=1024  pass=true  max_err=8.34e-9  sum=1.000000  rust_vec=0.076ms
size=4096  pass=true  max_err=7.11e-9  sum=1.000000  rust_vec=0.079ms
size=16384 pass=true  max_err=6.89e-9  sum=1.000000  rust_vec=0.087ms

截图（310P 真实硬件，完整基准对比）

$ RUST_LOG=info cargo run --release -- --csv /tmp/softmax_results.csv
   Compiling bench_softmax_rs v0.1.0
    Finished `release` profile [optimized] target(s) in 8.4s
     Running `target/release/bench_softmax_rs --csv /tmp/softmax_results.csv`
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] 设备 Ascend310P3 已初始化
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] size=256   rust_vec=0.077ms  pass=true  max_err=1.22e-8
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] size=1024  rust_vec=0.076ms  pass=true  max_err=8.34e-9
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] size=4096  rust_vec=0.079ms  pass=true  max_err=7.11e-9
2026-03-31T09:15:44Z INFO  [bench_softmax_rs] size=16384 rust_vec=0.087ms  pass=true  max_err=6.89e-9
CSV 已写入 /tmp/softmax_results.csv

运行完整对比（Rust 与 C++ 并排）：

# 从仓库根目录执行：
cd benchmarks/softmax
bash bench.sh

=== Softmax 基准测试 ===
--- Rust softmax 基准 ---
size=16384  rust_scalar=2.221ms  rust_vec=0.087ms  pass=true
--- C++ softmax 基准 ---
size=16384  cpp_naive=2.073ms    cpp_opt=0.089ms    pass=true

性能摘要（16384 元素）：
  Rust 向量 vs C++ 优化：  0.087ms vs 0.089ms  → Rust 快 1.02x
  向量 vs 标量加速比：     25.5x
  正确性：所有尺寸均 PASS（max_err < 1e-8）

编译流水线原理

每个编译步骤的中间文件保存在 kernels/target/ 中，可供检查：

kernels/target/davinci-huawei-none/release/deps/
├── softmax_kernels.mlir              ← rustc codegen 输出的 MLIR
├── softmax_kernels.mlir.acl.gen.cpp  ← mlir_to_cpp 生成的 C++
└── softmax_kernels.acl.o             ← bisheng 生成的 NPU 目标文件

生成的 C++（acl.gen.cpp）展示了 Rust 本征函数对应的 AscendC API 调用：

// 由 ascend_std::ascend_exp_f32(out_buf, out_buf, n) 生成
Exp(out_buf_local, out_buf_local, n);
pipe_barrier(PIPE_V);

示例三：Tile Softmax — 昇腾 910B 上的 PTO 编译路径

本示例演示较新的 PTO（可编程 Tile 操作） 编译路径，面向昇腾 910B（dav-c220）矩阵流水线。Tile API 以 tile_load、tile_softmax、tile_store 等二维 Tile 操作来表达计算，通过 ptoas（PTO 汇编器）编译，而非标准 C++ 编译路径。

这是三个示例中最先进的一个，需要配备 ptoas 的昇腾 910B 设备。它展示了完整流水线：

Rust Tile API  →  MLIR  →  PTO-MLIR  →  ptoas  →  CCE C++  →  ccec  →  .acl.o

源代码

内核（examples/tile_softmax/kernels/src/lib.rs）：

#![feature(no_core)]
#![no_std]
#![no_core]

use ascend_std::tile::{tile_load_f32, tile_softmax_f32, tile_store_f32, Tile};

/// 对 ROWS × COLS 的 f32 Tile 执行逐行 softmax。
///
/// Tile API 是 NPU 向量引擎的二维抽象：
/// - `tile_load_f32`    → PTO `tload`（DMA：全局内存 → UB Tile）
/// - `tile_softmax_f32` → PTO 规约操作序列：trowmax → trowexpandsub →
///                        texp → trowsum → trowexpanddiv
/// - `tile_store_f32`   → PTO `tstore`（DMA：UB Tile → 全局内存）
///
/// `ptoas --enable-insert-sync` 标志会在 Tile 操作之间自动插入
/// set_flag / wait_flag 屏障。
#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax(input: *const f32, output: *mut f32) {
    let block_idx = ascend_std::get_block_idx() as usize;
    let offset = block_idx * 1 * 1024;  // ROWS=1, COLS=1024

    // 从全局内存加载 Tile
    let t_in: Tile<1, 1024, f32> =
        tile_load_f32::<1, 1024>(input.wrapping_add(offset));

    // 计算 softmax：max → shift → exp → sum → divide
    let t_out: Tile<1, 1024, f32> = tile_softmax_f32::<1, 1024>(t_in);

    // 将结果存回全局内存
    tile_store_f32::<1, 1024>(output.wrapping_add(offset), t_out);
}

宿主端（examples/tile_softmax/src/main.rs，精简版）：

use ascend_rs::prelude::*;

fn main() -> anyhow::Result<()> {
    const ROWS: usize = 1;
    const COLS: usize = 1024;

    let acl     = Acl::new()?;
    let device  = Device::new(&acl)?;
    let context = AclContext::new(&device)?;
    let stream  = AclStream::new(&context)?;

    // 正弦波输入，便于可视化验证
    let input: Vec<f32> = (0..ROWS * COLS)
        .map(|i| ((i as f32) * 0.01).sin() * 3.0)
        .collect();

    let mut d_input  = DeviceBuffer::from_slice(&input)?;
    let mut d_output = unsafe { DeviceBuffer::<f32>::uninitialized(ROWS * COLS)? };

    let kernel_loader = KernelLoader::new()?;
    let kernel = kernel_loader.get_kernel("tile_softmax")?;
    let mut args: [*mut std::ffi::c_void; 2] = [
        d_input.as_mut_ptr() as *mut _,
        d_output.as_mut_ptr() as *mut _,
    ];
    unsafe { kernel.launch(1, &stream, &mut args)?; }  // 1 个 block
    stream.synchronize()?;

    let output = d_output.to_host()?;
    let sum: f32 = output.iter().sum();
    let max_err = output.iter()
        .zip(softmax_cpu(&input, ROWS, COLS).iter())
        .map(|(a, b)| (a - b).abs())
        .fold(0.0f32, f32::max);

    println!("tile_softmax: max_err={:.4e} sum={:.6} {}",
        max_err, sum,
        if max_err < 1e-5 && (sum - 1.0).abs() < 1e-4 { "PASS" } else { "FAIL" });

    Ok(())
}

构建与运行

# 必要环境（配备 CANN 8.5 和 ptoas 的昇腾 910B）
export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
export ACLRS_SOC_VERSION=Ascend910_9392          # 根据你的 SoC 调整
export ACLRS_CODEGEN_PATH=pto                     # 启用 PTO 路径
export ACLRS_PTOAS_PATH=/path/to/ptoas            # ptoas 汇编器路径
export ACLRS_PTO_ISA_PATH=/path/to/pto-isa/include  # pto-isa 头文件路径
export LD_LIBRARY_PATH=/data/llvm20/lib:${ACLRS_CANN_PATH}/aarch64-linux/lib64:\
/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common

source ${ACLRS_CANN_PATH}/set_env.sh
export PATH=${ACLRS_CANN_PATH}/tools/ccec_compiler/bin:$PATH

cd examples/tile_softmax
cargo run --release

编译流水线追踪

构建系统会打印每个步骤。开启 RUST_LOG=debug 可查看完整命令：

# 第一步：Rust → MLIR（使用自定义 codegen 后端的 rustc）
rustc --crate-type lib -Z codegen-backend=librustc_codegen_mlir.so ...
  → tile_softmax_kernels.mlir

# 第二步：MLIR → PTO-MLIR（mlir_to_pto.rs）
  → tile_softmax_kernels.acl.pto

# 第三步：PTO-MLIR → CCE C++（ptoas）
ptoas --enable-insert-sync --pto-arch=a3 tile_softmax_kernels.acl.pto \
      -o tile_softmax_kernels.acl.pto.cpp

# 第四步：CCE C++ → NPU 目标文件（ccec）
ccec -c -O3 -x cce -DMEMORY_BASE --cce-aicore-arch=dav-c220-vec \
     -mllvm -cce-aicore-addr-transform \
     -mllvm -cce-aicore-dcci-insert-for-scalar=false \
     -I/path/to/pto-isa/include \
     tile_softmax_kernels.acl.pto.cpp \
     -o tile_softmax_kernels.acl.o

中间文件

cargo build --release 完成后，可在 kernels/target/davinci-huawei-none/release/deps/ 中查看 softmax 分解的 PTO-MLIR 方言：

; tile_softmax_kernels.acl.pto  — PTO-MLIR 方言（摘录）
module {
  func.func @ascend_tile_softmax_f32(
      %input:  !pto.ptr<f32>,
      %output: !pto.ptr<f32>) {

    ; --- tload：全局内存 → UB Tile ---
    %c0   = arith.constant 0 : index
    %cR   = arith.constant 1 : index
    %cC   = arith.constant 1024 : index
    %tv_in = pto.make_tensor_view %input,
               shape=[%cR, %cC] strides=[%cC, %c1]
               : !pto.tensor_view<1x1024xf32>
    %pv_in = pto.partition_view %tv_in,
               offsets=[%c0, %c0], sizes=[%cR, %cC]
               : !pto.tensor_view<1x1024xf32> -> !pto.partition_tensor_view<1x1024xf32>
    %tile_in = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...>
    pto.tload ins(%pv_in : ...) outs(%tile_in : ...)

    ; --- softmax 分解 ---
    %tmp_max = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1, ...>
    %row_max = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1, ...>
    pto.trowmax ins(%tile_in, %tmp_max : ...) outs(%row_max : ...)    ; 第一步：求最大值

    %shifted = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...>
    pto.trowexpandsub ins(%tile_in, %row_max : ...) outs(%shifted : ...)  ; 第二步：x-max

    pto.texp ins(%shifted : ...) outs(%shifted : ...)                  ; 第三步：exp

    %tmp_sum = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1, ...>
    %row_sum = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1, ...>
    pto.trowsum ins(%shifted, %tmp_sum : ...) outs(%row_sum : ...)     ; 第四步：求和

    %result  = pto.alloc_tile : !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=1024, ...>
    pto.trowexpanddiv ins(%shifted, %row_sum : ...) outs(%result : ...)  ; 第五步：÷ sum

    ; --- tstore：UB Tile → 全局内存 ---
    pto.tstore ins(%result : ...) outs(%pv_out : ...)
    return
  }
}

预期输出

2026-03-31T18:32:35Z INFO  [tile_softmax] tile_softmax 测试：ROWS=1, COLS=1024, n=1024
2026-03-31T18:32:35Z INFO  [tile_softmax] 设备 Ascend910_9392 已初始化
2026-03-31T18:32:35Z INFO  [tile_softmax] 启动 tile_softmax 内核（1 block，1×1024 f32）...
2026-03-31T18:32:35Z INFO  [tile_softmax] tile_softmax: max_err=2.38e-7 sum=1.000000 sum_ok=true PASS
2026-03-31T18:32:35Z INFO  [tile_softmax] tile_softmax PASSED

关于硬件可用性的说明： 运行这些测试的 910c 服务器偶尔会进入硬件故障状态（Device startup failed）。此时编译流水线仍可成功完成——只有运行时执行受阻。PTO 编译结果（1960 字节的 .acl.o 文件）已在 dav-c220-vec 上手动验证编译正确。

与示例二的核心差异

	示例二（向量 Softmax）	示例三（Tile Softmax）
编译路径	`mlir_to_cpp` → `bisheng`	`mlir_to_pto` → `ptoas` → `ccec`
抽象层级	标量本征函数（`ascend_reduce_max_f32`）	二维 Tile 操作（`tile_softmax_f32`）
目标硬件	310P 或 910B（向量引擎）	910B（dav-c220，a2a3 路径）
中间格式	AscendC C++	PTO-MLIR 方言
同步屏障	手动（`ascend_pipe_barrier`）	`ptoas --enable-insert-sync` 自动插入
并行模型	1 block，标量循环	1 block，二维 Tile

示例四：双缓冲 Tile Softmax

在示例三基础上扩展为单次启动处理两个 tile，使用 tile_prefetch_f32 使 Mte2 加载（tile 1）与 Vector 计算（tile 0 softmax）形成重叠。性能数据见第 4.7 节。

源码

内核（examples/tile_softmax_double_buf/kernels/src/lib.rs）：

#![feature(no_core)]
#![no_std]
#![no_core]

use ascend_std::tile::{
    tile_load_f32, tile_prefetch_f32, tile_softmax_f32, tile_store_f32, Tile,
};

#[ascend_std::aiv_kernel]
pub unsafe fn tile_softmax_double_buf(input: *const f32, output: *mut f32) {
    const ROWS: usize = 1;
    const COLS: usize = 1024;
    const TILE_ELEMS: usize = ROWS * COLS;

    // --- 序言：在任何计算开始前发起两次加载 ---
    let t0: Tile<ROWS, COLS, f32> = tile_load_f32::<ROWS, COLS>(input);
    let t1: Tile<ROWS, COLS, f32> =
        tile_prefetch_f32::<ROWS, COLS>(input.wrapping_add(TILE_ELEMS));

    // --- 计算 tile 0（硬件上 t1 的 Mte2 加载可与此重叠）---
    let r0: Tile<ROWS, COLS, f32> = tile_softmax_f32::<ROWS, COLS>(t0);

    // --- 计算 tile 1 ---
    let r1: Tile<ROWS, COLS, f32> = tile_softmax_f32::<ROWS, COLS>(t1);

    // --- 存储结果 ---
    tile_store_f32::<ROWS, COLS>(output, r0);
    tile_store_f32::<ROWS, COLS>(output.wrapping_add(TILE_ELEMS), r1);
}

生成的 PTO-MLIR

与示例三的关键区别在于：两次加载会生成具有不同行偏移的 partition_view 操作：

// tile 0：从第 0 行加载
%pto1 = pto.partition_view %pto0, offsets = [%c0, %c0], sizes = [%c1, %c1024] : ...
pto.tload ins(%pto1 : ...) outs(%pto2 : ...)

// tile 1：从第 1 行加载（偏移 1024 个元素 = cols=1024 时的第 1 行）
%pto3 = pto.partition_view %pto0, offsets = [%c1, %c0], sizes = [%c1, %c1024] : ...
pto.tload ins(%pto3 : ...) outs(%pto4 : ...)

// softmax(t0) — Vector 流水；Mte2 可与上面的 tload 重叠
pto.trowmax ...
pto.trowexpanddiv ins(...) outs(%pto10 : ...)

// softmax(t1)
pto.trowmax ...
pto.trowexpanddiv ins(...) outs(%pto16 : ...)

// 存储——输出的第 0 行和第 1 行
%pto18 = pto.partition_view %pto17, offsets = [%c0, %c0], ...
pto.tstore ins(%pto10 : ...) outs(%pto18 : ...)
%pto19 = pto.partition_view %pto17, offsets = [%c1, %c0], ...
pto.tstore ins(%pto16 : ...) outs(%pto19 : ...)

预期输出

2026-04-02T06:14:07Z INFO  [tile_softmax_double_buf] double_buf 2×(1×1024): total avg=0.0068ms min=0.0049ms max=0.0140ms | per-tile avg=0.0034ms min=0.0024ms | max_err=3.26e-9 PASS

原始数据：examples/tile_softmax_double_buf/results/bench_double_buf_910b2_2026-04-02.csv。

示例五：Linalg 桥的 Softmax — 上游 MLIR 跑在 910B2 上

本示例把同一份 softmax 内核走一遍 linalg ingress 桥，跑在真实的 910B2 硬件上。Rust 前端完全没用；源码是上游 MLIR 中的两行 linalg.softmax op，正是上游 MLIR 测试套件里能找到、或从 torch-mlir FX export 中抽出的那种 fixture。背景见第 4.7 节。

源码

完整 fixture 是两行上游 linalg：

// benchmarks/linalg/kernels_upstream_shape_matched/softmax_upstream_1x1024.mlir
func.func @upstream_softmax_1x1024(%arg0: tensor<1x1024xf32>) -> tensor<1x1024xf32> {
  %0 = tensor.empty() : tensor<1x1024xf32>
  %1 = linalg.softmax dimension(1) ins(%arg0 : tensor<1x1024xf32>)
                                   outs(%0   : tensor<1x1024xf32>) -> tensor<1x1024xf32>
  return %1 : tensor<1x1024xf32>
}

torch-mlir 从一个 4 行 PyTorch 脚本（adablue 上 /tmp/torch_mlir_linalg/dump_simple.py）export 出的等价形大体相同——见 benchmarks/linalg/kernels_torch_mlir_shape_matched/ 中的 add_tm.mlir、exp_tm.mlir、silumul_tm.mlir。所用的 torch-mlir wheel（torch-mlir-20260421.789）没有直接 export linalg.softmax；它降为一组 linalg.generic 归约序列，被桥通过 commit 299de147 加入的 GenericUnaryKind::Exp + GenericBinop matcher 处理。

构建与运行

# adablue（宿主侧构建）——把上游 linalg 转成 AscendC C++
cd /home/y00577373/ascend-rs-priv
cargo build -p mlir_to_cpp_tests --release --bin linalg_to_ascendc

crates/mlir_to_cpp_tests/target/release/linalg_to_ascendc \
  benchmarks/linalg/kernels_upstream_shape_matched/softmax_upstream_1x1024.mlir \
  /tmp/sm_upstream.cce

# 910c（NPU 侧构建并运行）——同步代码后编译为 .acl.o 并执行
ssh 910c
cd /data/yuyijun/ascend-rs/benchmarks/linalg_bridge_bench
ASCEND_DEVICE_ID=2 cargo run --release

预期输出（910B2 chip 2，2026-04-22，3 次重复）

[bridge_bench] pair=softmax_1x1024
  ascendrs (hand-written) : min= 4.83 µs  p50= 5.21 µs  mean= 5.34 µs
  upstream linalg (bridge): min= 4.95 µs  p50= 5.27 µs  mean= 5.42 µs
  Δmin= 0.12 µs  Δp50= 0.06 µs  Δmean= 0.08 µs   (各项均 <8%)
  vs CPU 参考的 max_err = 1.86e-9   PASS

[bridge_bench] pair=add_1x1024
  ascendrs : min= 4.18 µs  upstream: min= 4.20 µs  Δ= 0.02 µs   PASS
[bridge_bench] pair=exp_1x1024
  ascendrs : min= 4.46 µs  upstream: min= 4.54 µs  Δ= 0.08 µs   PASS
[bridge_bench] pair=matmul_32x64x32
  ascendrs : min= 1586.1 µs  upstream: min= 1586.4 µs  Δ= 0.3 µs   PASS  (<0.02%)

字节相同性证明

「字节相同 emit」是核心声明。证明它的纯宿主测试：

$ cargo test -p mlir_to_cpp_tests --release \
    --test upstream_matches_ascendrs_byte_identical -- --nocapture
running 5 tests
test add_1x1024_byte_identical          ... ok
test exp_1x1024_byte_identical          ... ok
test softmax_1x1024_byte_identical      ... ok
test matmul_32x64x32_byte_identical     ... ok
test silumul_1x1024_byte_identical      ... ok  （CPU 侧；今日尚不能在 910B2 上跑）
5 passed; 0 failed

每个 test 在 kernels_ascendrs/<name>.mlir（手写 ascendrs-form）与 kernels_upstream_shape_matched/<name>_upstream.mlir（上游 linalg）上都跑一次 linalg_to_ascendc，再字节比对生成的 .cce。零 diff 字节意味着桥在 hop 1 之后是结构性的 no-op；下游 mlir_to_cpp emitter 看不出任何区别。

管线示意

                              Rust 路径（示例 2–4）
                              ┌────────────────────────────┐
softmax.rs ── rustc ──┐       │    rustc_codegen_mlir       │
                      │       │           │                 │
                      │       │           ▼                 │
                      │       │       MLIR (LLVM-D)         │
                      │       └─────────────┬───────────────┘
                      │                     │
                      │       桥路径（本示例）
                      │       ┌─────────────────────────────┐
upstream.mlir ────────┴─────► │  linalg_to_ascend_tile      │
torch-mlir.mlir ──────────►   │           │                 │
                              │           ▼                 │
                              │      ascend_tile MLIR       │
                              └─────────────┬───────────────┘
                                            │
                                            ▼   （从此处起共用同一个 emitter）
                                      mlir_to_cpp
                                            │
                                            ▼
                                       AscendC C++
                                            │
                                            ▼
                                          bisheng
                                            │
                                            ▼
                                         910B2 NPU

两条分支在 mlir_to_cpp 处汇合。从那一点之后，硬件看到的字节与内核出发自哪条分支无关。

示例六：Softmax 上的安全卫士 — ptoas 说 OK，卫士说不

前面五个示例展示的内核都能跑。本示例展示一个看似能跑——通过 ptoas、ccec、bisheng——但悄悄输出错误结果的内核，并展示卫士抓住它的过程。章节讨论见 §11.3；本节是可运行 demo。

两份 fixture，同一编译器

两份 fixture 都是 1×1024 f32 softmax 的 PTO-MLIR .acl.pto。「good」是 mlir_to_pto 从示例五的上游 linalg fixture（或等价地从示例三的 Rust tile API 内核）emit 出的内容；「bad」是同一份文件，在归约序列前注入了 48 个额外 pto.alloc_tile + pto.tload——每个 tile 都是 1×1024 f32，没有任何下游读取，而 ptoas 的 PlanMemoryPass 把它们里的几个堆到了与活 tile %3 和 %11 同样的 UB offset。

# 生成两份 fixture
cd /home/y00577373/ascend-rs-priv
python3 blog/mdbook/scripts/ch11_make_bad_softmax.py /tmp/ch11_sm_bad.acl.pto

# good 文件已提交
cp examples/tile_softmax/artifacts/tile_softmax_kernels.acl.pto /tmp/ch11_sm_good.acl.pto

两份都过 ptoas

PTOAS=/usr/local/bin/ptoas-bin/ptoas   # adablue 上为 $HOME/ptoas-x86/bin/ptoas

$PTOAS /tmp/ch11_sm_good.acl.pto -o /tmp/good.cpp
echo "good rc=$?"
$PTOAS /tmp/ch11_sm_bad.acl.pto  -o /tmp/bad.cpp
echo "bad  rc=$?"

good rc=0
bad  rc=0

ptoas 都接受。ccec 都接受。bisheng 都链接得了。在 910B2 上，「good」内核给出 max_err=1.86e-9；「bad」内核给出每次都不同的垃圾——取决于死 tile 这一次踩到了哪些字节。

两份都过卫士

PTO_DIFF=/data/yuyijun/ascend-rs/target/release/pto-diff   # 或本地构建

$PTO_DIFF --from-pto /tmp/ch11_sm_good.acl.pto --ptoas $PTOAS
$PTO_DIFF --from-pto /tmp/ch11_sm_bad.acl.pto  --ptoas $PTOAS

=== /tmp/ch11_sm_good.acl.pto ===
0 errors, 0 warnings  (clean)

=== /tmp/ch11_sm_bad.acl.pto ===
[error] capacity: vec high-water 393216 B exceeds capacity 196608 B
        (on Ascend910B2 (CANN 8.5))
[error] aliasing: tiles `%3` and `%108` overlap at vec offset 0x1000
[error] dead-tile: tile `%108` is written but never read
... (94 more findings) ...
96 errors, 0 warnings

退出码：bad fixture 是 3，good 是 0。同一个 pto-diff 二进制、底下同一个 ptoas——两种结果的唯一区别是卫士以 ptoas 不会的方式审视 PlanMemoryPass 之后的 MLIR。

一键 demo 脚本

两次运行打包在 blog/mdbook/scripts/ch11_bad_demo.sh，也是 §11.6 demo 录制的驱动脚本。本地复现：

PTOAS=/usr/local/bin/ptoas-bin/ptoas \
PTO_DIFF=/data/yuyijun/ascend-rs/target/release/pto-diff \
  bash blog/mdbook/scripts/ch11_bad_demo.sh

在 linalg ingress 路径上的同等对比

为完整起见，下面是一段端到端 demo，从上游 linalg 出发（而非手编 PTO），同时演练 Path A（投影器）和 Path C（完整 ptoas 流水线）：

BIN=crates/mlir_to_cpp_tests/target/release/linalg_to_ascendc
SM=benchmarks/linalg/kernels_upstream_shape_matched/softmax_upstream_1x1024.mlir
ADV=benchmarks/linalg/kernels_adversarial/capacity_overflow_1x131072.mlir

echo "--- 干净 softmax via Path A ---"
ACLRS_LINALG_SAFETY=path-a $BIN $SM /tmp/clean.cce 2>&1 \
  | grep linalg-safety || echo "(clean — no findings)"

echo "--- 对抗 fixture via Path A ---"
ACLRS_LINALG_SAFETY=path-a $BIN $ADV /tmp/adv.cce 2>&1 \
  | grep linalg-safety || echo "(clean)"

echo "--- 对抗 fixture via Path C ---"
ACLRS_PTOAS_BIN=$HOME/ptoas-x86/bin/ptoas \
ACLRS_LINALG_SAFETY=path-c $BIN $ADV /tmp/adv.cce 2>&1 \
  | grep linalg-safety || echo "(clean)"

--- 干净 softmax via Path A ---
(clean — no findings)
--- 对抗 fixture via Path A ---
linalg-safety [path-a] [error] capacity: vec high-water 1048576 B exceeds capacity 196608 B
  (on Ascend910B2 (CANN 8.5)) (in `adv_capacity_overflow`)
--- 对抗 fixture via Path C ---
linalg-safety [path-c] [error] ptoas: vec overflow, requires 8388608 bits while 1572864 bits avaliable
  (in `adv_capacity_overflow`)

两条 Path 在同一份输入上都抓到了 capacity bug，机制不同——这正是给桥配两道互补安全面的全部理由。

常见问题排查

`Device startup failed`

NPU 驱动未运行或设备处于故障状态。请检查：

npu-smi info          # 查看 Health 是否为 OK（而非 Critical）
npu-smi reset -i 0    # 重置设备 0（需要 root 权限）

`Could not determine ASCEND_HOME_PATH`

ACLRS_CANN_PATH 未设置或路径不存在：

export ACLRS_CANN_PATH=/usr/local/Ascend/cann-8.5.0
# 验证路径是否存在：
ls $ACLRS_CANN_PATH/tools/ccec_compiler/bin/bisheng

`ptoas assembler not found`

将 ACLRS_PTOAS_PATH 设置为 ptoas 二进制文件的完整路径：

export ACLRS_PTOAS_PATH=/path/to/ptoas/build/tools/ptoas/ptoas

ptoas 是 pto-isa 项目的组成部分，仅 PTO 编译路径（示例三）需要。

`ccec PTO compilation failed: set_mask_count does not support target feature`

使用了错误的 --cce-aicore-arch。请确认：

ACLRS_SOC_VERSION 与你的芯片匹配
ascend-rs 位于 claude_code 或 main 分支（修复已提交至 d45ab4e3 和 adbf7294）

`error: definition of type 'bfloat16_t' conflicts with typedef`

你的 ccec 版本已定义 bfloat16_t。此问题已在提交 adbf7294 中修复。请更新到最新分支。

正确性检查失败（`max_err > 1e-5`）

310P 上的向量 softmax：期望 max_err < 1e-8（硬件 f32 精度）
910B 上的 tile softmax：期望 max_err < 1e-5（PTO 规约精度）
超出此范围可能说明 SoC 版本设置错误，导致 UB 缓冲区大小假设不匹配

总览：三条编译路径对比

示例一：Hello World
  Rust 宿主代码  →  cargo build  →  可执行文件  →  ACL 运行时  →  NPU 设备
  （无内核——纯宿主/驱动交互）

示例二：向量 Softmax（mlir_to_cpp 路径）
  Rust 内核  →  rustc  →  MLIR  →  mlir_to_cpp  →  AscendC C++
             →  bisheng  →  .acl.o  →  KernelLoader  →  NPU 执行

示例三：Tile Softmax（PTO 路径）
  Rust 内核  →  rustc  →  MLIR  →  mlir_to_pto  →  PTO-MLIR 方言
             →  ptoas  →  CCE C++  →  ccec  →  .acl.o
             →  KernelLoader  →  NPU 执行

三条路径共享同一套宿主端运行时（ascend_rs::prelude::*）：Acl、Device、AclContext、AclStream、DeviceBuffer、KernelLoader。唯一的区别在于 .acl.o 内核二进制文件的生成方式。

Keyboard shortcuts

ascend-rs: Memory-Safe NPU Kernel Programming in Rust