English | 中文版
附录 I:性能差异分析
对 998 个 CANN 8.5 kernel 往返性能模式的分析。
ascend-rs 编译流水线 (Rust → MLIR → C++ → bisheng) 相比手写 AscendC C++ 会引入一些特定的代码生成模式。本附录识别这些模式,对其影响进行分级, 并提出可复用的优化方案。
I.1 性能分级
| 分级 | 数量 | % | 描述 |
|---|---|---|---|
| EQUIVALENT | 121 | 12% | 生成代码与原始 C++ 性能相当 |
| SLOW_1.2X | 0 | 0% | 因中等影响模式导致约 20% 变慢 |
| SLOW_1.5X | 0 | 0% | 因高影响模式 (TBuf、barrier) 导致约 50% 变慢 |
| SLOW_2X+ | 0 | 0% | 因多个高影响模式叠加导致 2 倍以上变慢 |
I.2 性能劣化模式
使用 TBuf 而非 TQue (高)
受影响 kernel: 998/998
问题: 使用 TBuf<VECCALC> 而非 TQue<VECIN/VECOUT>。TBuf 在每个同步点都需要显式的 pipe_barrier(PIPE_ALL),而 TQue 通过硬件 flag 实现细粒度的 pipe 重叠。
修复方案: 生成带 AllocTensor/FreeTensor 生命周期的 TQue<QuePosition::VECIN, depth>,取代 TBuf.Get/TBuf.Get 模式。
PIPE_ALL barrier (整流水线停顿) (高)
受影响 kernel: 998/998
问题: 每次 ascend_pipe_barrier() 都会生成 pipe_barrier(PIPE_ALL),从而同时停顿所有硬件 pipe。原始 C++ 通过 TQue 或有选择的 PIPE_V/PIPE_MTE2 flag 做按 pipe 同步。
修复方案: 用 pipe_barrier(PIPE_V) 做仅计算的同步,用 PIPE_MTE2 做 DMA 同步,或者通过 TQue 完全消除 barrier。
无双缓冲 (高)
受影响 kernel: 998/998
问题: DMA 与计算完全串行化:load→barrier→compute→barrier→store。原始 C++ 通过 TQue depth=2 把第 N+1 个 tile 的 DMA 与第 N 个 tile 的计算重叠。
修复方案: 检测分块循环并生成 depth=2 的 TQue。通过 EnQue/DeQue 在 tile 之间实现 DMA 与计算重叠。
统一最大缓冲区尺寸 (低)
受影响 kernel: 998/998
问题: 所有 TBuf 都被分配相同的最大尺寸 = (UB_SIZE - 8KB) / num_bufs。原始 C++ 会按每个缓冲区的实际数据需求分配大小。当缓冲区使用差异较大时,这会浪费 UB 空间。
修复方案: 在 MLIR 中追踪实际缓冲区使用量,并按比例分配。
标量数学的矢量化变通方案 (中)
受影响 kernel: 1/998
问题: 由于某些 NPU 型号上标量 pipe 会挂死,标量 log/exp/sqrt 操作通过 1KB 临时缓冲区被矢量化。这给每次标量数学操作增加了 DMA 和缓冲区开销。
修复方案: 在支持的型号上使用标量 pipe;在其他型号上,通过批量化标量操作来摊销开销。
I.3 优化机会
barrier 消除机会 (中)
适用 kernel: 998/998
描述: 作用在不同缓冲区上的连续矢量操作之间不需要 barrier。当前 codegen 只要 dirty_bufs 重叠就插入 barrier,但许多操作其实彼此独立。
实现: 在 MLIR 层面实现按缓冲区的 dirty 追踪。仅当同一缓冲区上存在 RAW 冒险 (read-after-write hazard) 时才插入 barrier。
循环展开候选 (低)
适用 kernel: 998/998
描述: 小的固定迭代次数循环 (例如 softmax 的两遍 reduce) 可以展开。当前 codegen 发射的是通用的 while(true) 循环。
实现: 检测已知小迭代次数的循环并将其展开。
算子融合候选 (中)
适用 kernel: 0/998
描述: 作用在同一缓冲区上的连续矢量操作 (例如 Sub→Exp 或 Div→Cast) 可以融合为一条矢量指令,或者至少共享一个 barrier。当前 codegen 把它们各自当作独立操作。
实现: 检测在同一缓冲区上的一元/二元操作链,并将其融合为复合 AscendC 指令。
I.4 可复用的优化计划
基于上述模式分析,三项优化就能为大多数 kernel 弥合性能差距:
优先级 1:TQue 迁移 (约弥合差距的 50%)
在 MLIR→C++ codegen 中将 TBuf<VECCALC> 替换为 TQue<VECIN/VECOUT>。
这样会用基于硬件 flag 的同步替代 PIPE_ALL barrier,并启用双缓冲实现
DMA 与计算重叠。
受影响文件: crates/rustc_codegen_mlir/src/mlir_to_cpp.rs
所需改动:
- 把缓冲区声明从
TBuf<TPosition::VECCALC>改为TQue<QuePosition::VECIN>/TQue<QuePosition::VECOUT> - 用
inQueue.AllocTensor<T>()/inQueue.DeQue<T>()替代tbuf.Get<T>() - 加入
inQueue.EnQue(tensor)/outQueue.FreeTensor(tensor)生命周期 - 用隐式 TQue 同步替代
pipe_barrier(PIPE_ALL)
优先级 2:barrier 消除 (约弥合差距的 20%)
实现按缓冲区的 dirty 追踪,以消除独立矢量操作之间的 barrier。 只有当同一缓冲区上存在 RAW 冒险 (read-after-write hazard) 时才插入 barrier。
当前行为: 任何读取 dirty 缓冲区的矢量操作都会触发 PIPE_ALL。
建议行为: 按缓冲区追踪 dirty 状态。只在以下情况插入 barrier:
- DMA load 写入缓冲区 B,随后矢量操作读取缓冲区 B
- 矢量操作写入缓冲区 B,随后 DMA store 读取缓冲区 B
- 当 buf0 不是 dirty 时,跳过
Add(buf0, buf1, buf2)与Mul(buf3, buf0, buf4)之间的 barrier
优先级 3:算子融合 (约弥合差距的 10%)
把同一缓冲区上的连续矢量操作融合为复合操作:
Sub(buf, x, max) → Exp(buf, buf)→ 单次 AscendC 调用同时完成 Sub+ExpMuls(buf, buf, scale) → Adds(buf, buf, bias)→ MulAdd 复合操作- 消除融合操作之间的中间 barrier
I.5 按类别的性能汇总
| 类别 | 总数 | Equivalent | Slow 1.2x | Slow 1.5x | Slow 2x+ |
|---|---|---|---|---|---|
| ops_index | 114 | 6 | 0 | 0 | 0 |
| ops_legacy | 200 | 0 | 0 | 0 | 0 |
| ops_math | 120 | 9 | 0 | 0 | 0 |
| ops_nn | 150 | 0 | 0 | 0 | 0 |
| ops_optimizer | 82 | 0 | 0 | 0 | 0 |
| ops_reduce | 80 | 0 | 0 | 0 | 0 |
| ops_resize | 52 | 0 | 0 | 0 | 0 |
| ops_transformer | 200 | 106 | 0 | 0 | 0 |
I.6 逐 Kernel 模式细节
ops_index (114 个 kernel)
foreach_gather_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gather_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gather_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_add_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_add_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_add_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_mul_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_mul_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_mul_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_add_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_add_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_add_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_copy_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_copy_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_copy_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_fill_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_fill_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_fill_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_select_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_select_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_select_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_put_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_put_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_put_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_fill_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_fill_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_fill_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_select_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_select_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_select_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_scatter_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_scatter_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_scatter_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_where_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_where_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_where_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nonzero_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nonzero_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nonzero_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sort_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sort_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sort_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_argsort_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_argsort_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_argsort_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_topk_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_topk_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_topk_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_unique_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_unique_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_unique_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_searchsorted_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_searchsorted_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_searchsorted_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bucketize_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bucketize_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bucketize_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_one_hot_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_one_hot_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_one_hot_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_embedding_bag_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_embedding_bag_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_embedding_bag_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cummax_f32[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cummax_f16[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cummax_int32[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cummin_f32[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cummin_f16[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cummin_int32[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_scatter_nd_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_nd_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_nd_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gather_nd_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gather_nd_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gather_nd_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_put_accumulate_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_put_accumulate_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_put_accumulate_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_take_along_axis_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_take_along_axis_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_take_along_axis_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_put_along_axis_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_put_along_axis_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_put_along_axis_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bincount_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bincount_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bincount_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_max_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_scatter_max_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_scatter_max_int32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_scatter_min_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_scatter_min_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_scatter_min_int32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_gather_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_select_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_where_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sort_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_topk_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_fill_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_select_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sort_int64[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_argsort_int64[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_topk_int64[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_unique_int64[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gather_int8[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_int8[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_add_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_mul_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_add_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_copy_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
ops_legacy (200 个 kernel)
foreach_exp_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_exp_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_exp_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_abs_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_abs_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_abs_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_neg_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_neg_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_neg_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sqrt_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sqrt_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sqrt_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rsqrt_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rsqrt_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rsqrt_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reciprocal_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reciprocal_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reciprocal_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ln_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ln_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ln_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log2_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log2_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log2_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log10_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log10_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log10_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ceil_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ceil_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ceil_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_floor_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_floor_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_floor_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_round_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_round_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_round_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_trunc_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_trunc_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_trunc_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sign_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sign_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sign_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_not_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_not_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_not_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_not_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_not_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_not_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_not_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_not_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_not_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_max_list_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_list_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_list_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_min_list_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_min_list_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_min_list_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_pow_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pow_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pow_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fmod_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fmod_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fmod_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_and_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_and_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_and_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_or_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_or_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_or_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_xor_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_xor_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_xor_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_and_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_and_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_and_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_or_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_or_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_or_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_scalar_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_scalar_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_scalar_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_scalar_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_scalar_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_scalar_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_scalar_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_scalar_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_scalar_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_scalar_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_scalar_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_scalar_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_max_scalar_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_scalar_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_scalar_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_min_scalar_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_min_scalar_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_min_scalar_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_pow_scalar_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pow_scalar_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pow_scalar_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_scalar_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_scalar_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_scalar_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_alpha_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_alpha_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_alpha_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_alpha_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_alpha_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_alpha_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_addcmul_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_addcdiv_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_copy_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_zero_inplace_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lerp_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_addcmul_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_addcdiv_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_copy_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_zero_inplace_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lerp_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_addcmul_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_addcdiv_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_copy_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_zero_inplace_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lerp_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEzeros_like_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEones_like_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEzeros_like_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEones_like_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEzeros_like_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEones_like_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEzeros_like_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEones_like_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_abs_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_abs_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_abs_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_relu_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_relu_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_relu_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_gelu_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_gelu_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_gelu_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_silu_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_silu_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_silu_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_neg_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_neg_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_neg_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_sign_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_sign_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_sign_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_ceil_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_ceil_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_ceil_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_floor_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_floor_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_floor_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_abs_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_abs_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_abs_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_relu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_relu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_relu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_neg_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_neg_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_neg_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_sign_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_sign_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_sign_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_abs_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_neg_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sign_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_not_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_not_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_list_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_max_list_int32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_abs_int8[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_neg_int8[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_not_int8[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_int8[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_scalar_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_scalar_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_scalar_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_scalar_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
ops_math (120 个 kernel)
foreach_sin_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sin_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sin_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cos_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cos_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cos_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tan_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tan_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tan_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_asin_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_asin_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_asin_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_acos_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_acos_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_acos_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atan_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atan_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atan_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atan2_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atan2_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atan2_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sinh_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sinh_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sinh_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cosh_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cosh_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cosh_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanh_math_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanh_math_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanh_math_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_asinh_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_asinh_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_asinh_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_acosh_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_acosh_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_acosh_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atanh_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atanh_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atanh_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erf_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erf_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erf_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erfc_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erfc_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erfc_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erfinv_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erfinv_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erfinv_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_expm1_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_expm1_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_expm1_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log1p_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log1p_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log1p_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softplus_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softplus_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softplus_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_digamma_f32[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_digamma_f16[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_digamma_bf16[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lgamma_f32[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lgamma_f16[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lgamma_bf16[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_i0_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_i0_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_i0_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_i1_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_i1_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_i1_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hypot_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hypot_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hypot_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fma_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fma_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fma_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_remainder_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_remainder_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_remainder_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_copysign_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_copysign_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_copysign_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nextafter_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nextafter_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nextafter_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ldexp_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ldexp_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ldexp_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_frexp_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_frexp_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_frexp_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logaddexp_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logaddexp_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logaddexp_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logaddexp2_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logaddexp2_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logaddexp2_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sincos_f32_910b[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sincos_f16_910b[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sincos_bf16_910b[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sincospi_f32_910b[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sincospi_f16_910b[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sincospi_bf16_910b[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_j0_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_j0_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_j0_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_j1_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_j1_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_j1_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_y0_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_y0_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_y0_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_y1_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_y1_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_y1_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_polygamma_f32[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_polygamma_f16[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_polygamma_bf16[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_zeta_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_zeta_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_zeta_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
ops_nn (150 个 kernel)
foreach_relu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_relu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_relu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_relu6_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_relu6_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_relu6_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_leaky_relu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_leaky_relu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_leaky_relu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prelu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prelu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prelu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_elu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_elu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_elu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_selu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_selu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_selu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gelu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gelu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gelu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fast_gelu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fast_gelu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fast_gelu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sigmoid_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sigmoid_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sigmoid_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardsigmoid_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardsigmoid_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardsigmoid_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardswish_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardswish_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardswish_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardtanh_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardtanh_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardtanh_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_silu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_silu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_silu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mish_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mish_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mish_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softplus_nn_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softplus_nn_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softplus_nn_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softsign_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softsign_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softsign_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanh_nn_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanh_nn_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanh_nn_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_celu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_celu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_celu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_glu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_glu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_glu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rrelu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rrelu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rrelu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_batch_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_batch_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_instance_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_instance_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_instance_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_layer_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_layer_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_layer_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_group_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_group_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_group_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_rms_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_rms_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_rms_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_softmax_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_softmax_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_softmax_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_log_softmax_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_log_softmax_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_log_softmax_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_dropout_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_dropout_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_dropout_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_embedding_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_embedding_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_embedding_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_swish_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_swish_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_swish_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logsigmoid_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logsigmoid_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logsigmoid_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanhshrink_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanhshrink_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanhshrink_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softshrink_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softshrink_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softshrink_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardshrink_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardshrink_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardshrink_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_threshold_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_threshold_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_threshold_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cross_entropy_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cross_entropy_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cross_entropy_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mse_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mse_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mse_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_l1_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_l1_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_l1_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_smooth_l1_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_smooth_l1_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_smooth_l1_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nll_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nll_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nll_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_avg_pool_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_avg_pool_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_avg_pool_2d_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_max_pool_2d_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_pool_2d_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_pool_2d_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_avg_pool_1d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_avg_pool_1d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_avg_pool_1d_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_max_pool_1d_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_pool_1d_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_pool_1d_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_lp_pool_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lp_pool_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lp_pool_2d_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bce_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bce_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bce_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bce_with_logits_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bce_with_logits_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bce_with_logits_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hinge_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hinge_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hinge_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kl_div_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kl_div_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kl_div_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cosine_embedding_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cosine_embedding_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cosine_embedding_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
ops_optimizer (82 个 kernel)
foreach_adam_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_momentum_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_momentum_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_momentum_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_momentum_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adagrad_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adagrad_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adagrad_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adagrad_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adadelta_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adadelta_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adadelta_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adadelta_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rmsprop_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rmsprop_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rmsprop_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rmsprop_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lamb_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lamb_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lamb_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lamb_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lars_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lars_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lars_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lars_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ftrl_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ftrl_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ftrl_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ftrl_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_amsgrad_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_amsgrad_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_amsgrad_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_amsgrad_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_amsgrad_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_amsgrad_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_fused_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_fused_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_fused_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_fused_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_fused_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_fused_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_nesterov_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_nesterov_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_nesterov_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lion_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lion_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lion_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adafactor_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adafactor_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adafactor_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sophia_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sophia_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sophia_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_came_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_came_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_came_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_novograd_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_novograd_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_novograd_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prodigy_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prodigy_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prodigy_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_shampoo_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_shampoo_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_shampoo_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adalomo_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adalomo_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adalomo_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_galore_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_galore_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_galore_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
ops_reduce (80 个 kernel)
foreach_reduce_sum_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_sum_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_sum_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_any_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_any_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_any_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_all_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_all_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_all_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_argmax_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_argmax_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_argmax_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_argmin_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_argmin_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_argmin_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cumsum_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cumsum_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cumsum_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cumprod_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cumprod_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cumprod_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reduce_sum_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_sum_f32_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_sum_f16_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_sum_f32_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_sum_f16_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_f32_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_f16_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_f32_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_f16_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_f32_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_f16_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_f32_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_f16_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_f32_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_f16_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_f32_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_f16_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_f32_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_f16_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_f32_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_f16_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_l1_norm_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_l1_norm_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_l2_norm_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_l2_norm_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_logsumexp_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_logsumexp_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_nansum_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_nansum_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_nanmean_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_nanmean_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_count_nonzero_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_count_nonzero_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_median_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_median_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_var_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_var_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_std_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_std_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_l1_norm_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_l2_norm_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_logsumexp_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_nansum_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
ops_resize (52 个 kernel)
foreach_upsample_nearest_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_nearest_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_nearest_3d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_nearest_3d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bilinear_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bilinear_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bilinear_3d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bilinear_3d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bicubic_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bicubic_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_trilinear_3d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_trilinear_3d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_nearest_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_nearest_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_nearest_3d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_nearest_3d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bilinear_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bilinear_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bilinear_3d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bilinear_3d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bicubic_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bicubic_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_resize_nearest_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_resize_nearest_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_resize_bilinear_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_resize_bilinear_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adaptive_avg_pool_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adaptive_avg_pool_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adaptive_avg_pool_3d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adaptive_avg_pool_3d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adaptive_max_pool_2d_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_adaptive_max_pool_2d_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_adaptive_max_pool_3d_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_adaptive_max_pool_3d_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_upsample_bilinear_2d_align_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bilinear_2d_align_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bicubic_2d_align_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bicubic_2d_align_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bilinear_2d_align_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bilinear_2d_align_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_resize_bilinear_2d_align_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_resize_bilinear_2d_align_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grid_sample_bilinear_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grid_sample_bilinear_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grid_sample_nearest_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grid_sample_nearest_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grid_sample_bicubic_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grid_sample_bicubic_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pixel_shuffle_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pixel_unshuffle_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pixel_shuffle_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pixel_unshuffle_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
ops_transformer (200 个 kernel)
foreach_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scaled_dot_product_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scaled_dot_product_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scaled_dot_product_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scaled_dot_product_attention_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scaled_dot_product_attention_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_head_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_head_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_head_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_head_attention_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_head_attention_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v1_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v1_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v1_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v1_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v1_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v2_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v2_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v2_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v2_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v2_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v3_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v3_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v3_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v3_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v3_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_paged_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_paged_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_paged_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_paged_attention_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_paged_attention_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rotary_embedding_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rotary_embedding_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rotary_embedding_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rotary_embedding_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rotary_embedding_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_apply_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_apply_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_apply_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_apply_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_apply_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_alibi_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_alibi_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_alibi_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_alibi_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_alibi_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_update_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_update_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_update_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_update_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_update_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_beam_search_score_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_beam_search_score_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_beam_search_score_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_beam_search_score_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_beam_search_score_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_f32_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_f32_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_f32_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_f32_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_f32_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_f32_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_f32_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_f32_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_f32_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_f32_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_position_encoding_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_position_encoding_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_position_encoding_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_causal_mask_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_causal_mask_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_causal_mask_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cross_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cross_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cross_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grouped_query_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grouped_query_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grouped_query_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sliding_window_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sliding_window_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sliding_window_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sparse_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sparse_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sparse_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_local_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_local_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_local_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ring_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ring_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ring_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prefix_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prefix_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prefix_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_quantize_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_quantize_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_quantize_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_score_mod_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_score_mod_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_score_mod_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_neox_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_neox_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_neox_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_glm_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_glm_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_glm_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_quant_int8_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_quant_int8_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_quant_int8_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_quant_int8_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_quant_int8_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_quant_int8_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_quant_int4_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_quant_int4_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_quant_int4_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_quant_int4_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_quant_int4_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_quant_int4_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_query_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_query_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_query_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_decoding_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_decoding_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_decoding_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_speculative_decoding_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_speculative_decoding_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_speculative_decoding_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_token_mixing_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_token_mixing_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_token_mixing_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_channel_mixing_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_channel_mixing_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_channel_mixing_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_gate_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_gate_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_gate_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_dispatch_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_dispatch_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_dispatch_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_combine_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_combine_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_combine_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_swiglu_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_swiglu_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_swiglu_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_geglu_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_geglu_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_geglu_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reglu_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reglu_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reglu_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rmsnorm_linear_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_rmsnorm_linear_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_rmsnorm_linear_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_prenorm_attention_f32[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_prenorm_attention_f16[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_prenorm_attention_bf16[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_postnorm_attention_f32[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_postnorm_attention_f16[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_postnorm_attention_bf16[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_parallel_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_parallel_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_parallel_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sandwich_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_sandwich_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_sandwich_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_qk_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_qk_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_qk_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE