English | 中文版
Appendix I: Performance Differential Analysis
Analysis of 998 CANN 8.5 kernel roundtrip performance patterns.
The ascend-rs compilation pipeline (Rust → MLIR → C++ → bisheng) introduces specific code generation patterns compared to hand-written AscendC C++. This appendix identifies those patterns, classifies their impact, and proposes generalizable optimizations.
I.1 Performance Classification
| Classification | Count | % | Description |
|---|---|---|---|
| EQUIVALENT | 121 | 12% | Generated code matches original C++ performance |
| SLOW_1.2X | 0 | 0% | ~20% slower due to medium-impact patterns |
| SLOW_1.5X | 0 | 0% | ~50% slower due to high-impact patterns (TBuf, barriers) |
| SLOW_2X+ | 0 | 0% | 2x+ slower due to multiple high-impact patterns |
I.2 Slowdown Patterns
TBuf instead of TQue (HIGH)
Affected kernels: 998/998
Problem: Uses TBuf<VECCALC> instead of TQue<VECIN/VECOUT>. TBuf requires explicit pipe_barrier(PIPE_ALL) for every sync point, while TQue uses hardware flags for fine-grained pipe overlap.
Fix: Generate TQue<QuePosition::VECIN, depth> with AllocTensor/FreeTensor lifecycle instead of TBuf.Get/TBuf.Get pattern.
PIPE_ALL barriers (full pipeline stall) (HIGH)
Affected kernels: 998/998
Problem: Every ascend_pipe_barrier() generates pipe_barrier(PIPE_ALL) which stalls ALL hardware pipes simultaneously. Original C++ uses per-pipe sync via TQue or selective PIPE_V/PIPE_MTE2 flags.
Fix: Use pipe_barrier(PIPE_V) for compute-only sync, PIPE_MTE2 for DMA sync, or eliminate barriers entirely with TQue.
No double-buffering (HIGH)
Affected kernels: 998/998
Problem: DMA and compute are fully serialized: load→barrier→compute→barrier→store. Original C++ overlaps tile N+1 DMA with tile N compute using TQue depth=2.
Fix: Detect tiling loops and generate TQue with depth=2. Use EnQue/DeQue to overlap DMA with compute across tiles.
Uniform maximum buffer sizing (LOW)
Affected kernels: 998/998
Problem: All TBuf get identical maximum size = (UB_SIZE - 8KB) / num_bufs. Original C++ sizes each buffer to its actual data needs. Wastes UB space when buffers have different usage.
Fix: Track actual buffer usage in MLIR and allocate proportionally.
Scalar math vectorization workaround (MEDIUM)
Affected kernels: 1/998
Problem: Scalar log/exp/sqrt operations are vectorized via 1KB scratch buffer because scalar pipe hangs on some NPU models. Adds DMA + buffer overhead for each scalar math op.
Fix: Use scalar pipe on models that support it; on others, amortize by batching scalar ops.
I.3 Optimization Opportunities
Barrier elision opportunity (MEDIUM)
Applicable kernels: 998/998
Description: Consecutive vector ops on DIFFERENT buffers don’t need barriers between them. The current codegen inserts barriers whenever dirty_bufs overlap, but many ops are independent.
Implementation: Implement per-buffer dirty tracking at the MLIR level. Only insert barrier when a read-after-write hazard exists on the SAME buffer.
Loop unrolling candidate (LOW)
Applicable kernels: 998/998
Description: Small fixed-iteration loops (e.g., softmax’s 2-pass reduce) could be unrolled. The current codegen emits generic while(true) loops.
Implementation: Detect loops with known small trip counts and unroll.
Operation fusion candidate (MEDIUM)
Applicable kernels: 0/998
Description: Sequential vector ops on the same buffer (e.g., Sub→Exp or Div→Cast) could be fused into a single vector instruction or at least share a barrier. Current codegen treats each as independent.
Implementation: Detect chains of unary/binary ops on the same buffer and fuse into composite AscendC instructions.
I.4 Generalizable Optimization Plan
Based on the pattern analysis, three optimizations would close the performance gap for the majority of kernels:
Priority 1: TQue Migration (closes ~50% of gap)
Replace TBuf<VECCALC> with TQue<VECIN/VECOUT> in the MLIR→C++ codegen.
This eliminates PIPE_ALL barriers in favor of hardware flag-based sync,
and enables double-buffering for DMA/compute overlap.
Affected files: crates/rustc_codegen_mlir/src/mlir_to_cpp.rs
Changes required:
- Change buffer declarations from
TBuf<TPosition::VECCALC>toTQue<QuePosition::VECIN>/TQue<QuePosition::VECOUT> - Replace
tbuf.Get<T>()withinQueue.AllocTensor<T>()/inQueue.DeQue<T>() - Add
inQueue.EnQue(tensor)/outQueue.FreeTensor(tensor)lifecycle - Replace
pipe_barrier(PIPE_ALL)with implicit TQue sync
Priority 2: Barrier Elision (closes ~20% of gap)
Implement per-buffer dirty tracking to eliminate barriers between independent vector operations. Only insert barriers when a read-after-write hazard exists on the SAME buffer.
Current behavior: Every vector op that reads a dirty buffer triggers PIPE_ALL.
Proposed behavior: Track dirty state per-buffer. Only barrier when:
- A DMA load writes buffer B, then a vector op reads buffer B
- A vector op writes buffer B, then a DMA store reads buffer B
- Skip barriers between
Add(buf0, buf1, buf2)andMul(buf3, buf0, buf4)when buf0 is not dirty
Priority 3: Operation Fusion (closes ~10% of gap)
Fuse sequential vector ops on the same buffer into compound operations:
Sub(buf, x, max) → Exp(buf, buf)→ single AscendC call with Sub+ExpMuls(buf, buf, scale) → Adds(buf, buf, bias)→ MulAdd composite- Eliminate intermediate barriers between fused ops
I.5 Per-Category Performance Summary
| Category | Total | Equivalent | Slow 1.2x | Slow 1.5x | Slow 2x+ |
|---|---|---|---|---|---|
| ops_index | 114 | 6 | 0 | 0 | 0 |
| ops_legacy | 200 | 0 | 0 | 0 | 0 |
| ops_math | 120 | 9 | 0 | 0 | 0 |
| ops_nn | 150 | 0 | 0 | 0 | 0 |
| ops_optimizer | 82 | 0 | 0 | 0 | 0 |
| ops_reduce | 80 | 0 | 0 | 0 | 0 |
| ops_resize | 52 | 0 | 0 | 0 | 0 |
| ops_transformer | 200 | 106 | 0 | 0 | 0 |
I.6 Per-Kernel Pattern Detail
ops_index (114 kernels)
foreach_gather_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gather_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gather_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_add_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_add_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_add_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_mul_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_mul_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_mul_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_add_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_add_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_add_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_copy_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_copy_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_copy_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_fill_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_fill_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_fill_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_select_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_select_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_select_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_put_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_put_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_put_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_fill_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_fill_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_fill_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_select_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_select_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_select_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_scatter_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_scatter_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_scatter_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_where_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_where_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_where_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nonzero_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nonzero_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nonzero_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sort_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sort_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sort_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_argsort_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_argsort_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_argsort_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_topk_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_topk_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_topk_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_unique_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_unique_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_unique_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_searchsorted_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_searchsorted_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_searchsorted_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bucketize_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bucketize_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bucketize_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_one_hot_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_one_hot_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_one_hot_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_embedding_bag_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_embedding_bag_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_embedding_bag_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cummax_f32[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cummax_f16[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cummax_int32[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cummin_f32[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cummin_f16[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cummin_int32[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_scatter_nd_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_nd_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_nd_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gather_nd_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gather_nd_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gather_nd_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_put_accumulate_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_put_accumulate_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_put_accumulate_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_take_along_axis_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_take_along_axis_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_take_along_axis_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_put_along_axis_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_put_along_axis_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_put_along_axis_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bincount_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bincount_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bincount_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_max_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_scatter_max_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_scatter_max_int32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_scatter_min_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_scatter_min_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_scatter_min_int32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_gather_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_select_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_where_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sort_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_topk_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_fill_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_masked_select_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sort_int64[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_argsort_int64[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_topk_int64[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_unique_int64[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gather_int8[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_int8[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_add_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scatter_mul_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_add_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_index_copy_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
ops_legacy (200 kernels)
foreach_exp_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_exp_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_exp_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_abs_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_abs_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_abs_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_neg_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_neg_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_neg_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sqrt_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sqrt_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sqrt_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rsqrt_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rsqrt_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rsqrt_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reciprocal_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reciprocal_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reciprocal_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ln_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ln_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ln_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log2_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log2_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log2_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log10_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log10_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log10_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ceil_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ceil_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ceil_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_floor_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_floor_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_floor_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_round_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_round_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_round_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_trunc_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_trunc_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_trunc_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sign_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sign_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sign_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_not_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_not_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_not_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_not_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_not_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_not_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_not_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_not_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_not_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_max_list_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_list_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_list_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_min_list_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_min_list_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_min_list_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_pow_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pow_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pow_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fmod_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fmod_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fmod_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_and_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_and_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_and_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_or_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_or_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_or_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_xor_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_xor_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_xor_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_and_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_and_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_and_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_or_list_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_or_list_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_or_list_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_scalar_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_scalar_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_scalar_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_scalar_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_scalar_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_scalar_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_scalar_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_scalar_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_scalar_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_scalar_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_scalar_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_scalar_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_max_scalar_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_scalar_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_scalar_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_min_scalar_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_min_scalar_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_min_scalar_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_pow_scalar_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pow_scalar_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pow_scalar_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_scalar_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_scalar_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_scalar_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_alpha_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_alpha_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_alpha_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_alpha_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_alpha_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_alpha_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_addcmul_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_addcdiv_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_copy_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_zero_inplace_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lerp_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_addcmul_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_addcdiv_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_copy_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_zero_inplace_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lerp_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_addcmul_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_addcdiv_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_copy_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_zero_inplace_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lerp_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEzeros_like_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEones_like_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEzeros_like_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEones_like_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEzeros_like_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEones_like_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEzeros_like_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEones_like_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_abs_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_abs_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_abs_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_relu_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_relu_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_relu_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_gelu_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_gelu_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_gelu_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_silu_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_silu_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_silu_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_neg_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_neg_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_neg_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_sign_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_sign_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_sign_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_ceil_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_ceil_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_ceil_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_floor_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_floor_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise_floor_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_abs_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_abs_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_abs_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_relu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_relu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_relu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_neg_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_neg_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_neg_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_sign_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_sign_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEelementwise16b_sign_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_abs_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_neg_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sign_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_not_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logical_not_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_list_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_list_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_list_int32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_max_list_int32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_abs_int8[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_neg_int8[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bitwise_not_int8[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_clamp_int8[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_add_scalar_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sub_scalar_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mul_scalar_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_div_scalar_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
ops_math (120 kernels)
foreach_sin_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sin_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sin_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cos_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cos_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cos_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tan_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tan_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tan_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_asin_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_asin_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_asin_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_acos_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_acos_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_acos_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atan_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atan_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atan_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atan2_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atan2_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atan2_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sinh_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sinh_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sinh_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cosh_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cosh_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cosh_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanh_math_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanh_math_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanh_math_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_asinh_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_asinh_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_asinh_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_acosh_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_acosh_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_acosh_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atanh_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atanh_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_atanh_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erf_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erf_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erf_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erfc_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erfc_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erfc_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erfinv_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erfinv_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_erfinv_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_expm1_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_expm1_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_expm1_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log1p_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log1p_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_log1p_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softplus_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softplus_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softplus_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_digamma_f32[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_digamma_f16[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_digamma_bf16[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lgamma_f32[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lgamma_f16[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lgamma_bf16[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_i0_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_i0_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_i0_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_i1_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_i1_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_i1_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hypot_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hypot_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hypot_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fma_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fma_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fma_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_remainder_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_remainder_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_remainder_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_copysign_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_copysign_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_copysign_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nextafter_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nextafter_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nextafter_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ldexp_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ldexp_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ldexp_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_frexp_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_frexp_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_frexp_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logaddexp_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logaddexp_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logaddexp_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logaddexp2_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logaddexp2_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logaddexp2_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sincos_f32_910b[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sincos_f16_910b[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sincos_bf16_910b[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sincospi_f32_910b[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sincospi_f16_910b[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sincospi_bf16_910b[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_j0_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_j0_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_j0_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_j1_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_j1_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_j1_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_y0_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_y0_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_y0_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_y1_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_y1_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_y1_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_polygamma_f32[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_polygamma_f16[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_polygamma_bf16[EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_zeta_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_zeta_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_zeta_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
ops_nn (150 kernels)
foreach_relu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_relu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_relu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_relu6_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_relu6_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_relu6_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_leaky_relu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_leaky_relu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_leaky_relu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prelu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prelu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prelu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_elu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_elu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_elu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_selu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_selu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_selu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gelu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gelu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gelu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fast_gelu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fast_gelu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_fast_gelu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sigmoid_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sigmoid_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sigmoid_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardsigmoid_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardsigmoid_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardsigmoid_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardswish_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardswish_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardswish_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardtanh_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardtanh_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardtanh_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_silu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_silu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_silu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mish_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mish_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mish_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softplus_nn_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softplus_nn_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softplus_nn_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softsign_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softsign_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softsign_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanh_nn_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanh_nn_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanh_nn_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_celu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_celu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_celu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_glu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_glu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_glu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rrelu_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rrelu_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rrelu_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_batch_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_batch_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_instance_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_instance_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_instance_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_layer_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_layer_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_layer_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_group_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_group_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_group_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_rms_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_rms_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_rms_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_softmax_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_softmax_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_softmax_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_log_softmax_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_log_softmax_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_log_softmax_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_dropout_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_dropout_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_dropout_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_embedding_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_embedding_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_embedding_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_swish_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_swish_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_swish_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logsigmoid_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logsigmoid_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_logsigmoid_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanhshrink_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanhshrink_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_tanhshrink_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softshrink_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softshrink_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_softshrink_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardshrink_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardshrink_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hardshrink_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_threshold_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_threshold_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_threshold_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cross_entropy_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cross_entropy_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cross_entropy_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mse_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mse_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_mse_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_l1_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_l1_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_l1_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_smooth_l1_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_smooth_l1_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_smooth_l1_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nll_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nll_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_nll_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_avg_pool_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_avg_pool_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_avg_pool_2d_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_max_pool_2d_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_pool_2d_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_pool_2d_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_avg_pool_1d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_avg_pool_1d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_avg_pool_1d_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_max_pool_1d_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_pool_1d_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_max_pool_1d_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_lp_pool_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lp_pool_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lp_pool_2d_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bce_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bce_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bce_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bce_with_logits_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bce_with_logits_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_bce_with_logits_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hinge_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hinge_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_hinge_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kl_div_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kl_div_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kl_div_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cosine_embedding_loss_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cosine_embedding_loss_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cosine_embedding_loss_bf16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
ops_optimizer (82 kernels)
foreach_adam_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_momentum_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_momentum_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_momentum_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_momentum_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adagrad_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adagrad_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adagrad_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adagrad_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adadelta_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adadelta_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adadelta_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adadelta_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rmsprop_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rmsprop_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rmsprop_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rmsprop_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lamb_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lamb_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lamb_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lamb_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lars_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lars_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lars_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lars_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ftrl_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ftrl_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ftrl_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ftrl_f32_wd[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_amsgrad_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_amsgrad_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_amsgrad_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_amsgrad_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_amsgrad_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_amsgrad_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_fused_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_fused_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adam_fused_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_fused_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_fused_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adamw_fused_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_nesterov_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_nesterov_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sgd_nesterov_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lion_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lion_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_lion_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adafactor_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adafactor_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adafactor_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sophia_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sophia_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sophia_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_came_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_came_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_came_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_novograd_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_novograd_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_novograd_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prodigy_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prodigy_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prodigy_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_shampoo_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_shampoo_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_shampoo_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adalomo_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adalomo_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adalomo_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_galore_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_galore_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_galore_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
ops_reduce (80 kernels)
foreach_reduce_sum_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_sum_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_sum_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_any_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_any_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_any_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_all_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_all_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_all_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_argmax_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_argmax_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_argmax_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_argmin_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_argmin_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_argmin_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cumsum_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cumsum_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cumsum_int32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_cumprod_f32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cumprod_f16[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cumprod_int32[SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reduce_sum_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_sum_f32_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_sum_f16_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_sum_f32_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_sum_f16_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_f32_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_f16_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_f32_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_max_f16_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_f32_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_f16_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_f32_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_min_f16_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_f32_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_f16_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_f32_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_mean_f16_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_f32_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_f16_axis0[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_f32_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_prod_f16_axis1[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_l1_norm_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_l1_norm_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_l2_norm_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_l2_norm_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_logsumexp_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_logsumexp_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_nansum_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_nansum_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_nanmean_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_nanmean_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_count_nonzero_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_count_nonzero_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_median_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_median_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_var_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_var_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_std_f32[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_std_f16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_l1_norm_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_l2_norm_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_logsumexp_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_reduce_nansum_bf16[SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
ops_resize (52 kernels)
foreach_upsample_nearest_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_nearest_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_nearest_3d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_nearest_3d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bilinear_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bilinear_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bilinear_3d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bilinear_3d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bicubic_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bicubic_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_trilinear_3d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_trilinear_3d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_nearest_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_nearest_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_nearest_3d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_nearest_3d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bilinear_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bilinear_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bilinear_3d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bilinear_3d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bicubic_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bicubic_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_resize_nearest_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_resize_nearest_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_resize_bilinear_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_resize_bilinear_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adaptive_avg_pool_2d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adaptive_avg_pool_2d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adaptive_avg_pool_3d_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adaptive_avg_pool_3d_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_adaptive_max_pool_2d_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_adaptive_max_pool_2d_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_adaptive_max_pool_3d_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_adaptive_max_pool_3d_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_upsample_bilinear_2d_align_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bilinear_2d_align_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bicubic_2d_align_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_upsample_bicubic_2d_align_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bilinear_2d_align_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_interpolate_bilinear_2d_align_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_resize_bilinear_2d_align_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_resize_bilinear_2d_align_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grid_sample_bilinear_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grid_sample_bilinear_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grid_sample_nearest_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grid_sample_nearest_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grid_sample_bicubic_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grid_sample_bicubic_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pixel_shuffle_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pixel_unshuffle_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pixel_shuffle_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_pixel_unshuffle_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
ops_transformer (200 kernels)
foreach_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scaled_dot_product_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scaled_dot_product_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scaled_dot_product_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scaled_dot_product_attention_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_scaled_dot_product_attention_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_head_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_head_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_head_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_head_attention_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_head_attention_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v1_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v1_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v1_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v1_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v1_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v2_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v2_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v2_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v2_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v2_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v3_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v3_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v3_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v3_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_attention_v3_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_paged_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_paged_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_paged_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_paged_attention_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_paged_attention_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rotary_embedding_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rotary_embedding_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rotary_embedding_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rotary_embedding_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rotary_embedding_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_apply_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_apply_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_apply_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_apply_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_apply_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_alibi_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_alibi_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_alibi_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_alibi_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_alibi_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_update_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_update_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_update_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_update_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_update_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_beam_search_score_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_beam_search_score_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_beam_search_score_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_beam_search_score_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_beam_search_score_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_f32_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_f32_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_f32_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_f32_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_batch_matmul_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_f32_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_f32_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_f32_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_f32_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_f16_910b[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemm_f16_310p[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_f32_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_f32_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_f16_910b[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_gemv_f16_310p[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_position_encoding_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_position_encoding_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_position_encoding_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_causal_mask_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_causal_mask_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_causal_mask_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cross_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cross_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_cross_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grouped_query_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grouped_query_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_grouped_query_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sliding_window_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sliding_window_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sliding_window_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sparse_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sparse_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sparse_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_local_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_local_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_local_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ring_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ring_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_ring_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prefix_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prefix_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_prefix_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_quantize_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_quantize_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_kv_cache_quantize_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_score_mod_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_score_mod_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_score_mod_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_neox_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_neox_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_neox_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_glm_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_glm_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rope_glm_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_quant_int8_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_quant_int8_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_quant_int8_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_quant_int8_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_quant_int8_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_quant_int8_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_quant_int4_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_matmul_quant_int4_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_quant_int4_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_attention_quant_int4_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_quant_int4_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_linear_quant_int4_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_query_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_query_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_multi_query_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_decoding_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_decoding_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_flash_decoding_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_speculative_decoding_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_speculative_decoding_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_speculative_decoding_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_token_mixing_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_token_mixing_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_token_mixing_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_channel_mixing_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_channel_mixing_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_channel_mixing_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_gate_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_gate_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_gate_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_dispatch_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_dispatch_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_dispatch_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_combine_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_combine_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_moe_combine_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_swiglu_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_swiglu_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_swiglu_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_geglu_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_geglu_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_geglu_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reglu_f32[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reglu_f16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_reglu_bf16[SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_rmsnorm_linear_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_rmsnorm_linear_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_rmsnorm_linear_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_prenorm_attention_f32[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_prenorm_attention_f16[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_prenorm_attention_bf16[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_postnorm_attention_f32[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_postnorm_attention_f16[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_postnorm_attention_bf16[EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_parallel_attention_f32[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_parallel_attention_f16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_parallel_attention_bf16[EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZEforeach_sandwich_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_sandwich_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_sandwich_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_qk_norm_f32[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_qk_norm_f16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZEforeach_qk_norm_bf16[SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE