Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

附录 I:性能差异分析

对 998 个 CANN 8.5 kernel 往返性能模式的分析。

ascend-rs 编译流水线 (Rust → MLIR → C++ → bisheng) 相比手写 AscendC C++ 会引入一些特定的代码生成模式。本附录识别这些模式,对其影响进行分级, 并提出可复用的优化方案。

I.1 性能分级

分级数量%描述
EQUIVALENT12112%生成代码与原始 C++ 性能相当
SLOW_1.2X00%因中等影响模式导致约 20% 变慢
SLOW_1.5X00%因高影响模式 (TBuf、barrier) 导致约 50% 变慢
SLOW_2X+00%因多个高影响模式叠加导致 2 倍以上变慢

I.2 性能劣化模式

使用 TBuf 而非 TQue (高)

受影响 kernel: 998/998

问题: 使用 TBuf<VECCALC> 而非 TQue<VECIN/VECOUT>。TBuf 在每个同步点都需要显式的 pipe_barrier(PIPE_ALL),而 TQue 通过硬件 flag 实现细粒度的 pipe 重叠。

修复方案: 生成带 AllocTensor/FreeTensor 生命周期的 TQue<QuePosition::VECIN, depth>,取代 TBuf.Get/TBuf.Get 模式。


PIPE_ALL barrier (整流水线停顿) (高)

受影响 kernel: 998/998

问题: 每次 ascend_pipe_barrier() 都会生成 pipe_barrier(PIPE_ALL),从而同时停顿所有硬件 pipe。原始 C++ 通过 TQue 或有选择的 PIPE_V/PIPE_MTE2 flag 做按 pipe 同步。

修复方案: 用 pipe_barrier(PIPE_V) 做仅计算的同步,用 PIPE_MTE2 做 DMA 同步,或者通过 TQue 完全消除 barrier。


无双缓冲 (高)

受影响 kernel: 998/998

问题: DMA 与计算完全串行化:load→barrier→compute→barrier→store。原始 C++ 通过 TQue depth=2 把第 N+1 个 tile 的 DMA 与第 N 个 tile 的计算重叠。

修复方案: 检测分块循环并生成 depth=2 的 TQue。通过 EnQue/DeQue 在 tile 之间实现 DMA 与计算重叠。


统一最大缓冲区尺寸 (低)

受影响 kernel: 998/998

问题: 所有 TBuf 都被分配相同的最大尺寸 = (UB_SIZE - 8KB) / num_bufs。原始 C++ 会按每个缓冲区的实际数据需求分配大小。当缓冲区使用差异较大时,这会浪费 UB 空间。

修复方案: 在 MLIR 中追踪实际缓冲区使用量,并按比例分配。


标量数学的矢量化变通方案 (中)

受影响 kernel: 1/998

问题: 由于某些 NPU 型号上标量 pipe 会挂死,标量 log/exp/sqrt 操作通过 1KB 临时缓冲区被矢量化。这给每次标量数学操作增加了 DMA 和缓冲区开销。

修复方案: 在支持的型号上使用标量 pipe;在其他型号上,通过批量化标量操作来摊销开销。


I.3 优化机会

barrier 消除机会 (中)

适用 kernel: 998/998

描述: 作用在不同缓冲区上的连续矢量操作之间不需要 barrier。当前 codegen 只要 dirty_bufs 重叠就插入 barrier,但许多操作其实彼此独立。

实现: 在 MLIR 层面实现按缓冲区的 dirty 追踪。仅当同一缓冲区上存在 RAW 冒险 (read-after-write hazard) 时才插入 barrier。


循环展开候选 (低)

适用 kernel: 998/998

描述: 小的固定迭代次数循环 (例如 softmax 的两遍 reduce) 可以展开。当前 codegen 发射的是通用的 while(true) 循环。

实现: 检测已知小迭代次数的循环并将其展开。


算子融合候选 (中)

适用 kernel: 0/998

描述: 作用在同一缓冲区上的连续矢量操作 (例如 Sub→Exp 或 Div→Cast) 可以融合为一条矢量指令,或者至少共享一个 barrier。当前 codegen 把它们各自当作独立操作。

实现: 检测在同一缓冲区上的一元/二元操作链,并将其融合为复合 AscendC 指令。


I.4 可复用的优化计划

基于上述模式分析,三项优化就能为大多数 kernel 弥合性能差距:

优先级 1:TQue 迁移 (约弥合差距的 50%)

在 MLIR→C++ codegen 中将 TBuf<VECCALC> 替换为 TQue<VECIN/VECOUT>。 这样会用基于硬件 flag 的同步替代 PIPE_ALL barrier,并启用双缓冲实现 DMA 与计算重叠。

受影响文件: crates/rustc_codegen_mlir/src/mlir_to_cpp.rs

所需改动:

  1. 把缓冲区声明从 TBuf<TPosition::VECCALC> 改为 TQue<QuePosition::VECIN> / TQue<QuePosition::VECOUT>
  2. inQueue.AllocTensor<T>() / inQueue.DeQue<T>() 替代 tbuf.Get<T>()
  3. 加入 inQueue.EnQue(tensor) / outQueue.FreeTensor(tensor) 生命周期
  4. 用隐式 TQue 同步替代 pipe_barrier(PIPE_ALL)

优先级 2:barrier 消除 (约弥合差距的 20%)

实现按缓冲区的 dirty 追踪,以消除独立矢量操作之间的 barrier。 只有当同一缓冲区上存在 RAW 冒险 (read-after-write hazard) 时才插入 barrier。

当前行为: 任何读取 dirty 缓冲区的矢量操作都会触发 PIPE_ALL。

建议行为: 按缓冲区追踪 dirty 状态。只在以下情况插入 barrier:

  • DMA load 写入缓冲区 B,随后矢量操作读取缓冲区 B
  • 矢量操作写入缓冲区 B,随后 DMA store 读取缓冲区 B
  • 当 buf0 不是 dirty 时,跳过 Add(buf0, buf1, buf2)Mul(buf3, buf0, buf4) 之间的 barrier

优先级 3:算子融合 (约弥合差距的 10%)

把同一缓冲区上的连续矢量操作融合为复合操作:

  • Sub(buf, x, max) → Exp(buf, buf) → 单次 AscendC 调用同时完成 Sub+Exp
  • Muls(buf, buf, scale) → Adds(buf, buf, bias) → MulAdd 复合操作
  • 消除融合操作之间的中间 barrier

I.5 按类别的性能汇总

类别总数EquivalentSlow 1.2xSlow 1.5xSlow 2x+
ops_index1146000
ops_legacy2000000
ops_math1209000
ops_nn1500000
ops_optimizer820000
ops_reduce800000
ops_resize520000
ops_transformer200106000

I.6 逐 Kernel 模式细节

ops_index (114 个 kernel)

  • foreach_gather_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gather_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gather_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_add_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_add_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_add_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_mul_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_mul_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_mul_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_add_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_add_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_add_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_copy_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_copy_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_copy_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_fill_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_fill_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_fill_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_select_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_select_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_select_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_put_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_put_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_put_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_fill_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_fill_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_fill_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_select_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_select_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_select_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_scatter_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_scatter_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_scatter_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_where_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_where_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_where_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nonzero_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nonzero_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nonzero_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sort_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sort_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sort_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_argsort_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_argsort_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_argsort_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_topk_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_topk_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_topk_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_unique_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_unique_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_unique_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_searchsorted_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_searchsorted_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_searchsorted_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bucketize_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bucketize_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bucketize_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_one_hot_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_one_hot_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_one_hot_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_embedding_bag_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_embedding_bag_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_embedding_bag_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cummax_f32 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cummax_f16 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cummax_int32 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cummin_f32 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cummin_f16 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cummin_int32 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_scatter_nd_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_nd_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_nd_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gather_nd_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gather_nd_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gather_nd_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_put_accumulate_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_put_accumulate_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_put_accumulate_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_take_along_axis_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_take_along_axis_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_take_along_axis_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_put_along_axis_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_put_along_axis_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_put_along_axis_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bincount_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bincount_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bincount_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_max_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_scatter_max_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_scatter_max_int32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_scatter_min_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_scatter_min_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_scatter_min_int32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_gather_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_select_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_where_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sort_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_topk_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_fill_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_select_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sort_int64 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_argsort_int64 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_topk_int64 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_unique_int64 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gather_int8 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_int8 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_add_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_mul_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_add_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_copy_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE

ops_legacy (200 个 kernel)

  • foreach_exp_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_exp_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_exp_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_abs_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_abs_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_abs_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_neg_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_neg_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_neg_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sqrt_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sqrt_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sqrt_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rsqrt_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rsqrt_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rsqrt_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reciprocal_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reciprocal_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reciprocal_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ln_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ln_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ln_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log2_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log2_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log2_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log10_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log10_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log10_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ceil_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ceil_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ceil_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_floor_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_floor_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_floor_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_round_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_round_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_round_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_trunc_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_trunc_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_trunc_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sign_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sign_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sign_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_not_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_not_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_not_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_not_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_not_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_not_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_not_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_not_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_not_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_max_list_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_list_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_list_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_min_list_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_min_list_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_min_list_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_pow_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pow_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pow_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fmod_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fmod_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fmod_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_and_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_and_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_and_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_or_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_or_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_or_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_xor_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_xor_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_xor_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_and_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_and_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_and_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_or_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_or_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_or_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_scalar_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_scalar_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_scalar_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_scalar_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_scalar_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_scalar_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_scalar_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_scalar_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_scalar_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_scalar_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_scalar_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_scalar_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_max_scalar_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_scalar_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_scalar_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_min_scalar_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_min_scalar_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_min_scalar_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_pow_scalar_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pow_scalar_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pow_scalar_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_scalar_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_scalar_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_scalar_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_alpha_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_alpha_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_alpha_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_alpha_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_alpha_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_alpha_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_addcmul_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_addcdiv_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_copy_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_zero_inplace_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lerp_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_addcmul_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_addcdiv_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_copy_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_zero_inplace_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lerp_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_addcmul_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_addcdiv_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_copy_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_zero_inplace_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lerp_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • zeros_like_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • ones_like_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • zeros_like_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • ones_like_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • zeros_like_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • ones_like_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • zeros_like_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • ones_like_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_abs_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_abs_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_abs_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_relu_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_relu_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_relu_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_gelu_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_gelu_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_gelu_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_silu_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_silu_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_silu_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_neg_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_neg_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_neg_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_sign_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_sign_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_sign_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_ceil_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_ceil_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_ceil_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_floor_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_floor_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_floor_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_abs_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_abs_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_abs_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_relu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_relu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_relu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_neg_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_neg_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_neg_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_sign_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_sign_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_sign_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_abs_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_neg_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sign_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_not_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_not_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_list_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_max_list_int32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_abs_int8 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_neg_int8 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_not_int8 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_int8 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_scalar_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_scalar_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_scalar_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_scalar_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE

ops_math (120 个 kernel)

  • foreach_sin_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sin_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sin_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cos_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cos_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cos_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tan_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tan_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tan_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_asin_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_asin_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_asin_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_acos_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_acos_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_acos_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atan_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atan_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atan_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atan2_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atan2_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atan2_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sinh_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sinh_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sinh_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cosh_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cosh_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cosh_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanh_math_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanh_math_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanh_math_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_asinh_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_asinh_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_asinh_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_acosh_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_acosh_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_acosh_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atanh_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atanh_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atanh_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erf_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erf_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erf_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erfc_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erfc_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erfc_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erfinv_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erfinv_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erfinv_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_expm1_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_expm1_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_expm1_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log1p_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log1p_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log1p_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softplus_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softplus_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softplus_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_digamma_f32 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_digamma_f16 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_digamma_bf16 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lgamma_f32 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lgamma_f16 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lgamma_bf16 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_i0_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_i0_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_i0_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_i1_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_i1_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_i1_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hypot_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hypot_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hypot_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fma_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fma_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fma_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_remainder_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_remainder_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_remainder_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_copysign_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_copysign_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_copysign_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nextafter_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nextafter_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nextafter_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ldexp_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ldexp_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ldexp_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_frexp_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_frexp_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_frexp_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logaddexp_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logaddexp_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logaddexp_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logaddexp2_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logaddexp2_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logaddexp2_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sincos_f32_910b [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sincos_f16_910b [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sincos_bf16_910b [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sincospi_f32_910b [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sincospi_f16_910b [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sincospi_bf16_910b [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_j0_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_j0_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_j0_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_j1_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_j1_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_j1_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_y0_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_y0_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_y0_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_y1_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_y1_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_y1_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_polygamma_f32 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_polygamma_f16 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_polygamma_bf16 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_zeta_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_zeta_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_zeta_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE

ops_nn (150 个 kernel)

  • foreach_relu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_relu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_relu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_relu6_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_relu6_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_relu6_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_leaky_relu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_leaky_relu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_leaky_relu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prelu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prelu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prelu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_elu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_elu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_elu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_selu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_selu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_selu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gelu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gelu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gelu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fast_gelu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fast_gelu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fast_gelu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sigmoid_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sigmoid_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sigmoid_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardsigmoid_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardsigmoid_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardsigmoid_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardswish_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardswish_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardswish_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardtanh_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardtanh_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardtanh_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_silu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_silu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_silu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mish_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mish_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mish_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softplus_nn_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softplus_nn_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softplus_nn_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softsign_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softsign_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softsign_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanh_nn_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanh_nn_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanh_nn_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_celu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_celu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_celu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_glu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_glu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_glu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rrelu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rrelu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rrelu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_batch_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_batch_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_instance_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_instance_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_instance_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_layer_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_layer_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_layer_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_group_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_group_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_group_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_rms_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_rms_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_rms_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_softmax_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_softmax_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_softmax_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_log_softmax_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_log_softmax_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_log_softmax_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_dropout_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_dropout_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_dropout_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_embedding_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_embedding_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_embedding_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_swish_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_swish_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_swish_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logsigmoid_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logsigmoid_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logsigmoid_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanhshrink_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanhshrink_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanhshrink_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softshrink_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softshrink_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softshrink_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardshrink_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardshrink_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardshrink_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_threshold_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_threshold_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_threshold_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cross_entropy_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cross_entropy_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cross_entropy_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mse_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mse_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mse_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_l1_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_l1_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_l1_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_smooth_l1_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_smooth_l1_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_smooth_l1_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nll_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nll_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nll_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_avg_pool_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_avg_pool_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_avg_pool_2d_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_max_pool_2d_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_pool_2d_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_pool_2d_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_avg_pool_1d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_avg_pool_1d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_avg_pool_1d_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_max_pool_1d_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_pool_1d_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_pool_1d_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_lp_pool_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lp_pool_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lp_pool_2d_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bce_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bce_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bce_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bce_with_logits_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bce_with_logits_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bce_with_logits_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hinge_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hinge_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hinge_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kl_div_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kl_div_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kl_div_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cosine_embedding_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cosine_embedding_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cosine_embedding_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE

ops_optimizer (82 个 kernel)

  • foreach_adam_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_momentum_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_momentum_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_momentum_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_momentum_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adagrad_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adagrad_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adagrad_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adagrad_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adadelta_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adadelta_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adadelta_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adadelta_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rmsprop_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rmsprop_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rmsprop_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rmsprop_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lamb_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lamb_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lamb_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lamb_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lars_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lars_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lars_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lars_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ftrl_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ftrl_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ftrl_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ftrl_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_amsgrad_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_amsgrad_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_amsgrad_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_amsgrad_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_amsgrad_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_amsgrad_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_fused_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_fused_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_fused_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_fused_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_fused_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_fused_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_nesterov_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_nesterov_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_nesterov_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lion_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lion_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lion_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adafactor_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adafactor_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adafactor_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sophia_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sophia_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sophia_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_came_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_came_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_came_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_novograd_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_novograd_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_novograd_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prodigy_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prodigy_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prodigy_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_shampoo_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_shampoo_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_shampoo_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adalomo_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adalomo_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adalomo_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_galore_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_galore_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_galore_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE

ops_reduce (80 个 kernel)

  • foreach_reduce_sum_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_sum_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_sum_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_any_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_any_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_any_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_all_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_all_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_all_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_argmax_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_argmax_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_argmax_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_argmin_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_argmin_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_argmin_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cumsum_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cumsum_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cumsum_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cumprod_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cumprod_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cumprod_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reduce_sum_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_sum_f32_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_sum_f16_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_sum_f32_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_sum_f16_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_f32_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_f16_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_f32_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_f16_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_f32_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_f16_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_f32_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_f16_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_f32_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_f16_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_f32_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_f16_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_f32_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_f16_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_f32_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_f16_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_l1_norm_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_l1_norm_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_l2_norm_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_l2_norm_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_logsumexp_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_logsumexp_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_nansum_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_nansum_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_nanmean_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_nanmean_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_count_nonzero_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_count_nonzero_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_median_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_median_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_var_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_var_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_std_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_std_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_l1_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_l2_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_logsumexp_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_nansum_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE

ops_resize (52 个 kernel)

  • foreach_upsample_nearest_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_nearest_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_nearest_3d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_nearest_3d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bilinear_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bilinear_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bilinear_3d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bilinear_3d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bicubic_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bicubic_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_trilinear_3d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_trilinear_3d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_nearest_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_nearest_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_nearest_3d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_nearest_3d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bilinear_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bilinear_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bilinear_3d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bilinear_3d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bicubic_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bicubic_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_resize_nearest_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_resize_nearest_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_resize_bilinear_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_resize_bilinear_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adaptive_avg_pool_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adaptive_avg_pool_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adaptive_avg_pool_3d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adaptive_avg_pool_3d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adaptive_max_pool_2d_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_adaptive_max_pool_2d_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_adaptive_max_pool_3d_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_adaptive_max_pool_3d_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_upsample_bilinear_2d_align_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bilinear_2d_align_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bicubic_2d_align_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bicubic_2d_align_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bilinear_2d_align_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bilinear_2d_align_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_resize_bilinear_2d_align_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_resize_bilinear_2d_align_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grid_sample_bilinear_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grid_sample_bilinear_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grid_sample_nearest_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grid_sample_nearest_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grid_sample_bicubic_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grid_sample_bicubic_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pixel_shuffle_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pixel_unshuffle_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pixel_shuffle_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pixel_unshuffle_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE

ops_transformer (200 个 kernel)

  • foreach_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scaled_dot_product_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scaled_dot_product_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scaled_dot_product_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scaled_dot_product_attention_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scaled_dot_product_attention_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_head_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_head_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_head_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_head_attention_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_head_attention_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v1_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v1_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v1_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v1_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v1_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v2_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v2_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v2_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v2_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v2_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v3_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v3_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v3_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v3_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v3_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_paged_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_paged_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_paged_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_paged_attention_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_paged_attention_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rotary_embedding_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rotary_embedding_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rotary_embedding_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rotary_embedding_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rotary_embedding_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_apply_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_apply_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_apply_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_apply_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_apply_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_alibi_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_alibi_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_alibi_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_alibi_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_alibi_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_update_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_update_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_update_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_update_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_update_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_beam_search_score_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_beam_search_score_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_beam_search_score_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_beam_search_score_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_beam_search_score_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_f32_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_f32_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_f32_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_f32_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_f32_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_f32_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_f32_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_f32_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_f32_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_f32_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_position_encoding_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_position_encoding_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_position_encoding_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_causal_mask_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_causal_mask_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_causal_mask_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cross_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cross_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cross_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grouped_query_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grouped_query_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grouped_query_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sliding_window_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sliding_window_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sliding_window_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sparse_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sparse_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sparse_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_local_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_local_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_local_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ring_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ring_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ring_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prefix_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prefix_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prefix_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_quantize_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_quantize_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_quantize_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_score_mod_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_score_mod_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_score_mod_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_neox_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_neox_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_neox_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_glm_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_glm_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_glm_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_quant_int8_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_quant_int8_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_quant_int8_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_quant_int8_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_quant_int8_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_quant_int8_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_quant_int4_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_quant_int4_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_quant_int4_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_quant_int4_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_quant_int4_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_quant_int4_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_query_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_query_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_query_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_decoding_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_decoding_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_decoding_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_speculative_decoding_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_speculative_decoding_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_speculative_decoding_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_token_mixing_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_token_mixing_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_token_mixing_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_channel_mixing_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_channel_mixing_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_channel_mixing_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_gate_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_gate_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_gate_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_dispatch_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_dispatch_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_dispatch_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_combine_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_combine_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_combine_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_swiglu_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_swiglu_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_swiglu_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_geglu_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_geglu_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_geglu_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reglu_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reglu_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reglu_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rmsnorm_linear_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_rmsnorm_linear_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_rmsnorm_linear_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_prenorm_attention_f32 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_prenorm_attention_f16 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_prenorm_attention_bf16 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_postnorm_attention_f32 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_postnorm_attention_f16 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_postnorm_attention_bf16 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_parallel_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_parallel_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_parallel_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sandwich_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_sandwich_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_sandwich_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_qk_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_qk_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_qk_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE