Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

English | 中文版

Appendix I: Performance Differential Analysis

Analysis of 998 CANN 8.5 kernel roundtrip performance patterns.

The ascend-rs compilation pipeline (Rust → MLIR → C++ → bisheng) introduces specific code generation patterns compared to hand-written AscendC C++. This appendix identifies those patterns, classifies their impact, and proposes generalizable optimizations.

I.1 Performance Classification

ClassificationCount%Description
EQUIVALENT12112%Generated code matches original C++ performance
SLOW_1.2X00%~20% slower due to medium-impact patterns
SLOW_1.5X00%~50% slower due to high-impact patterns (TBuf, barriers)
SLOW_2X+00%2x+ slower due to multiple high-impact patterns

I.2 Slowdown Patterns

TBuf instead of TQue (HIGH)

Affected kernels: 998/998

Problem: Uses TBuf<VECCALC> instead of TQue<VECIN/VECOUT>. TBuf requires explicit pipe_barrier(PIPE_ALL) for every sync point, while TQue uses hardware flags for fine-grained pipe overlap.

Fix: Generate TQue<QuePosition::VECIN, depth> with AllocTensor/FreeTensor lifecycle instead of TBuf.Get/TBuf.Get pattern.


PIPE_ALL barriers (full pipeline stall) (HIGH)

Affected kernels: 998/998

Problem: Every ascend_pipe_barrier() generates pipe_barrier(PIPE_ALL) which stalls ALL hardware pipes simultaneously. Original C++ uses per-pipe sync via TQue or selective PIPE_V/PIPE_MTE2 flags.

Fix: Use pipe_barrier(PIPE_V) for compute-only sync, PIPE_MTE2 for DMA sync, or eliminate barriers entirely with TQue.


No double-buffering (HIGH)

Affected kernels: 998/998

Problem: DMA and compute are fully serialized: load→barrier→compute→barrier→store. Original C++ overlaps tile N+1 DMA with tile N compute using TQue depth=2.

Fix: Detect tiling loops and generate TQue with depth=2. Use EnQue/DeQue to overlap DMA with compute across tiles.


Uniform maximum buffer sizing (LOW)

Affected kernels: 998/998

Problem: All TBuf get identical maximum size = (UB_SIZE - 8KB) / num_bufs. Original C++ sizes each buffer to its actual data needs. Wastes UB space when buffers have different usage.

Fix: Track actual buffer usage in MLIR and allocate proportionally.


Scalar math vectorization workaround (MEDIUM)

Affected kernels: 1/998

Problem: Scalar log/exp/sqrt operations are vectorized via 1KB scratch buffer because scalar pipe hangs on some NPU models. Adds DMA + buffer overhead for each scalar math op.

Fix: Use scalar pipe on models that support it; on others, amortize by batching scalar ops.


I.3 Optimization Opportunities

Barrier elision opportunity (MEDIUM)

Applicable kernels: 998/998

Description: Consecutive vector ops on DIFFERENT buffers don’t need barriers between them. The current codegen inserts barriers whenever dirty_bufs overlap, but many ops are independent.

Implementation: Implement per-buffer dirty tracking at the MLIR level. Only insert barrier when a read-after-write hazard exists on the SAME buffer.


Loop unrolling candidate (LOW)

Applicable kernels: 998/998

Description: Small fixed-iteration loops (e.g., softmax’s 2-pass reduce) could be unrolled. The current codegen emits generic while(true) loops.

Implementation: Detect loops with known small trip counts and unroll.


Operation fusion candidate (MEDIUM)

Applicable kernels: 0/998

Description: Sequential vector ops on the same buffer (e.g., Sub→Exp or Div→Cast) could be fused into a single vector instruction or at least share a barrier. Current codegen treats each as independent.

Implementation: Detect chains of unary/binary ops on the same buffer and fuse into composite AscendC instructions.


I.4 Generalizable Optimization Plan

Based on the pattern analysis, three optimizations would close the performance gap for the majority of kernels:

Priority 1: TQue Migration (closes ~50% of gap)

Replace TBuf<VECCALC> with TQue<VECIN/VECOUT> in the MLIR→C++ codegen. This eliminates PIPE_ALL barriers in favor of hardware flag-based sync, and enables double-buffering for DMA/compute overlap.

Affected files: crates/rustc_codegen_mlir/src/mlir_to_cpp.rs

Changes required:

  1. Change buffer declarations from TBuf<TPosition::VECCALC> to TQue<QuePosition::VECIN> / TQue<QuePosition::VECOUT>
  2. Replace tbuf.Get<T>() with inQueue.AllocTensor<T>() / inQueue.DeQue<T>()
  3. Add inQueue.EnQue(tensor) / outQueue.FreeTensor(tensor) lifecycle
  4. Replace pipe_barrier(PIPE_ALL) with implicit TQue sync

Priority 2: Barrier Elision (closes ~20% of gap)

Implement per-buffer dirty tracking to eliminate barriers between independent vector operations. Only insert barriers when a read-after-write hazard exists on the SAME buffer.

Current behavior: Every vector op that reads a dirty buffer triggers PIPE_ALL.

Proposed behavior: Track dirty state per-buffer. Only barrier when:

  • A DMA load writes buffer B, then a vector op reads buffer B
  • A vector op writes buffer B, then a DMA store reads buffer B
  • Skip barriers between Add(buf0, buf1, buf2) and Mul(buf3, buf0, buf4) when buf0 is not dirty

Priority 3: Operation Fusion (closes ~10% of gap)

Fuse sequential vector ops on the same buffer into compound operations:

  • Sub(buf, x, max) → Exp(buf, buf) → single AscendC call with Sub+Exp
  • Muls(buf, buf, scale) → Adds(buf, buf, bias) → MulAdd composite
  • Eliminate intermediate barriers between fused ops

I.5 Per-Category Performance Summary

CategoryTotalEquivalentSlow 1.2xSlow 1.5xSlow 2x+
ops_index1146000
ops_legacy2000000
ops_math1209000
ops_nn1500000
ops_optimizer820000
ops_reduce800000
ops_resize520000
ops_transformer200106000

I.6 Per-Kernel Pattern Detail

ops_index (114 kernels)

  • foreach_gather_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gather_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gather_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_add_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_add_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_add_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_mul_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_mul_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_mul_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_add_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_add_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_add_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_copy_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_copy_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_copy_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_fill_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_fill_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_fill_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_select_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_select_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_select_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_put_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_put_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_put_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_fill_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_fill_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_fill_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_select_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_select_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_select_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_scatter_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_scatter_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_scatter_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_where_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_where_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_where_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nonzero_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nonzero_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nonzero_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sort_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sort_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sort_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_argsort_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_argsort_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_argsort_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_topk_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_topk_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_topk_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_unique_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_unique_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_unique_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_searchsorted_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_searchsorted_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_searchsorted_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bucketize_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bucketize_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bucketize_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_one_hot_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_one_hot_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_one_hot_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_embedding_bag_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_embedding_bag_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_embedding_bag_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cummax_f32 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cummax_f16 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cummax_int32 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cummin_f32 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cummin_f16 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cummin_int32 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_scatter_nd_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_nd_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_nd_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gather_nd_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gather_nd_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gather_nd_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_put_accumulate_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_put_accumulate_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_put_accumulate_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_take_along_axis_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_take_along_axis_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_take_along_axis_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_put_along_axis_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_put_along_axis_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_put_along_axis_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bincount_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bincount_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bincount_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_max_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_scatter_max_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_scatter_max_int32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_scatter_min_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_scatter_min_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_scatter_min_int32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_gather_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_select_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_where_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sort_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_topk_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_fill_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_masked_select_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sort_int64 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_argsort_int64 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_topk_int64 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_unique_int64 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gather_int8 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_int8 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_add_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scatter_mul_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_add_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_index_copy_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE

ops_legacy (200 kernels)

  • foreach_exp_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_exp_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_exp_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_abs_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_abs_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_abs_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_neg_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_neg_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_neg_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sqrt_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sqrt_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sqrt_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rsqrt_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rsqrt_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rsqrt_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reciprocal_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reciprocal_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reciprocal_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ln_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ln_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ln_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log2_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log2_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log2_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log10_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log10_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log10_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ceil_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ceil_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ceil_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_floor_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_floor_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_floor_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_round_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_round_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_round_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_trunc_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_trunc_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_trunc_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sign_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sign_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sign_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_not_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_not_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_not_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_not_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_not_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_not_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_not_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_not_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_not_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_max_list_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_list_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_list_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_min_list_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_min_list_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_min_list_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_pow_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pow_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pow_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fmod_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fmod_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fmod_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_and_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_and_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_and_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_or_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_or_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_or_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_xor_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_xor_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_xor_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_and_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_and_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_and_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_or_list_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_or_list_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_or_list_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_scalar_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_scalar_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_scalar_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_scalar_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_scalar_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_scalar_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_scalar_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_scalar_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_scalar_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_scalar_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_scalar_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_scalar_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_max_scalar_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_scalar_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_scalar_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_min_scalar_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_min_scalar_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_min_scalar_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_pow_scalar_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pow_scalar_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pow_scalar_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_scalar_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_scalar_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_scalar_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_alpha_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_alpha_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_alpha_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_alpha_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_alpha_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_alpha_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_addcmul_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_addcdiv_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_copy_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_zero_inplace_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lerp_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_addcmul_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_addcdiv_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_copy_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_zero_inplace_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lerp_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_addcmul_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_addcdiv_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_copy_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_zero_inplace_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lerp_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • zeros_like_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • ones_like_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • zeros_like_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • ones_like_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • zeros_like_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • ones_like_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • zeros_like_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • ones_like_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_abs_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_abs_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_abs_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_relu_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_relu_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_relu_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_gelu_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_gelu_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_gelu_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_silu_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_silu_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_silu_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_neg_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_neg_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_neg_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_sign_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_sign_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_sign_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_ceil_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_ceil_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_ceil_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_floor_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_floor_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise_floor_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_abs_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_abs_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_abs_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_relu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_relu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_relu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_neg_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_neg_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_neg_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_sign_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_sign_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • elementwise16b_sign_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_abs_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_neg_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sign_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_not_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logical_not_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_list_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_list_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_list_int32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_max_list_int32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_abs_int8 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_neg_int8 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bitwise_not_int8 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_clamp_int8 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_add_scalar_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sub_scalar_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mul_scalar_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_div_scalar_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE

ops_math (120 kernels)

  • foreach_sin_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sin_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sin_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cos_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cos_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cos_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tan_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tan_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tan_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_asin_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_asin_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_asin_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_acos_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_acos_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_acos_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atan_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atan_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atan_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atan2_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atan2_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atan2_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sinh_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sinh_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sinh_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cosh_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cosh_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cosh_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanh_math_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanh_math_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanh_math_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_asinh_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_asinh_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_asinh_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_acosh_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_acosh_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_acosh_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atanh_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atanh_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_atanh_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erf_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erf_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erf_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erfc_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erfc_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erfc_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erfinv_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erfinv_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_erfinv_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_expm1_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_expm1_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_expm1_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log1p_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log1p_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_log1p_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softplus_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softplus_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softplus_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_digamma_f32 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_digamma_f16 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_digamma_bf16 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lgamma_f32 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lgamma_f16 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lgamma_bf16 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_i0_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_i0_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_i0_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_i1_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_i1_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_i1_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hypot_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hypot_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hypot_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fma_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fma_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fma_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_remainder_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_remainder_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_remainder_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_copysign_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_copysign_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_copysign_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nextafter_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nextafter_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nextafter_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ldexp_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ldexp_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ldexp_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_frexp_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_frexp_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_frexp_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logaddexp_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logaddexp_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logaddexp_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logaddexp2_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logaddexp2_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logaddexp2_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sincos_f32_910b [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sincos_f16_910b [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sincos_bf16_910b [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sincospi_f32_910b [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sincospi_f16_910b [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sincospi_bf16_910b [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_j0_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_j0_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_j0_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_j1_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_j1_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_j1_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_y0_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_y0_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_y0_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_y1_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_y1_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_y1_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_polygamma_f32 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_polygamma_f16 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_polygamma_bf16 [EQUIVALENT]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_zeta_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_zeta_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_zeta_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE

ops_nn (150 kernels)

  • foreach_relu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_relu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_relu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_relu6_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_relu6_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_relu6_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_leaky_relu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_leaky_relu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_leaky_relu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prelu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prelu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prelu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_elu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_elu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_elu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_selu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_selu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_selu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gelu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gelu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gelu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fast_gelu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fast_gelu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_fast_gelu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sigmoid_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sigmoid_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sigmoid_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardsigmoid_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardsigmoid_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardsigmoid_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardswish_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardswish_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardswish_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardtanh_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardtanh_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardtanh_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_silu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_silu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_silu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mish_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mish_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mish_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softplus_nn_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softplus_nn_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softplus_nn_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softsign_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softsign_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softsign_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanh_nn_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanh_nn_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanh_nn_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_celu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_celu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_celu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_glu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_glu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_glu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rrelu_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rrelu_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rrelu_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_batch_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_batch_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_instance_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_instance_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_instance_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_layer_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_layer_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_layer_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_group_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_group_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_group_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_rms_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_rms_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_rms_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_softmax_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_softmax_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_softmax_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_log_softmax_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_log_softmax_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_log_softmax_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_dropout_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_dropout_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_dropout_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_embedding_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_embedding_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_embedding_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_swish_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_swish_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_swish_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logsigmoid_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logsigmoid_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_logsigmoid_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanhshrink_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanhshrink_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_tanhshrink_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softshrink_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softshrink_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_softshrink_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardshrink_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardshrink_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hardshrink_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_threshold_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_threshold_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_threshold_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cross_entropy_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cross_entropy_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cross_entropy_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mse_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mse_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_mse_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_l1_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_l1_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_l1_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_smooth_l1_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_smooth_l1_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_smooth_l1_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nll_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nll_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_nll_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_avg_pool_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_avg_pool_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_avg_pool_2d_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_max_pool_2d_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_pool_2d_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_pool_2d_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_avg_pool_1d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_avg_pool_1d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_avg_pool_1d_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_max_pool_1d_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_pool_1d_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_max_pool_1d_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_lp_pool_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lp_pool_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lp_pool_2d_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bce_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bce_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bce_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bce_with_logits_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bce_with_logits_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_bce_with_logits_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hinge_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hinge_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_hinge_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kl_div_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kl_div_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kl_div_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cosine_embedding_loss_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cosine_embedding_loss_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cosine_embedding_loss_bf16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE

ops_optimizer (82 kernels)

  • foreach_adam_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_momentum_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_momentum_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_momentum_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_momentum_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adagrad_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adagrad_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adagrad_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adagrad_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adadelta_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adadelta_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adadelta_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adadelta_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rmsprop_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rmsprop_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rmsprop_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rmsprop_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lamb_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lamb_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lamb_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lamb_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lars_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lars_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lars_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lars_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ftrl_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ftrl_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ftrl_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ftrl_f32_wd [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_amsgrad_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_amsgrad_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_amsgrad_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_amsgrad_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_amsgrad_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_amsgrad_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_fused_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_fused_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adam_fused_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_fused_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_fused_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adamw_fused_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_nesterov_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_nesterov_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sgd_nesterov_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lion_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lion_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_lion_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adafactor_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adafactor_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adafactor_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sophia_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sophia_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sophia_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_came_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_came_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_came_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_novograd_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_novograd_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_novograd_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prodigy_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prodigy_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prodigy_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_shampoo_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_shampoo_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_shampoo_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adalomo_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adalomo_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adalomo_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_galore_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_galore_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_galore_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE

ops_reduce (80 kernels)

  • foreach_reduce_sum_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_sum_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_sum_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_any_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_any_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_any_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_all_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_all_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_all_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_argmax_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_argmax_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_argmax_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_argmin_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_argmin_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_argmin_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cumsum_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cumsum_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cumsum_int32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_cumprod_f32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cumprod_f16 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cumprod_int32 [SLOW_1.02X]: barriers=3, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reduce_sum_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_sum_f32_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_sum_f16_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_sum_f32_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_sum_f16_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_f32_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_f16_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_f32_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_max_f16_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_f32_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_f16_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_f32_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_min_f16_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_f32_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_f16_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_f32_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_mean_f16_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_f32_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_f16_axis0 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_f32_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_prod_f16_axis1 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_l1_norm_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_l1_norm_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_l2_norm_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_l2_norm_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_logsumexp_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_logsumexp_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_nansum_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_nansum_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_nanmean_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_nanmean_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_count_nonzero_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_count_nonzero_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_median_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_median_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_var_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_var_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_std_f32 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_std_f16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_l1_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_l2_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_logsumexp_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_reduce_nansum_bf16 [SLOW_1.05X]: barriers=4, bufs=3, vec_ops=2, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE

ops_resize (52 kernels)

  • foreach_upsample_nearest_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_nearest_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_nearest_3d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_nearest_3d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bilinear_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bilinear_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bilinear_3d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bilinear_3d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bicubic_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bicubic_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_trilinear_3d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_trilinear_3d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_nearest_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_nearest_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_nearest_3d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_nearest_3d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bilinear_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bilinear_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bilinear_3d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bilinear_3d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bicubic_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bicubic_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_resize_nearest_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_resize_nearest_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_resize_bilinear_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_resize_bilinear_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adaptive_avg_pool_2d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adaptive_avg_pool_2d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adaptive_avg_pool_3d_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adaptive_avg_pool_3d_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_adaptive_max_pool_2d_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_adaptive_max_pool_2d_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_adaptive_max_pool_3d_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_adaptive_max_pool_3d_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_upsample_bilinear_2d_align_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bilinear_2d_align_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bicubic_2d_align_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_upsample_bicubic_2d_align_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bilinear_2d_align_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_interpolate_bilinear_2d_align_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_resize_bilinear_2d_align_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_resize_bilinear_2d_align_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grid_sample_bilinear_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grid_sample_bilinear_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grid_sample_nearest_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grid_sample_nearest_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grid_sample_bicubic_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grid_sample_bicubic_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pixel_shuffle_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pixel_unshuffle_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pixel_shuffle_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_pixel_unshuffle_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE

ops_transformer (200 kernels)

  • foreach_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scaled_dot_product_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scaled_dot_product_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scaled_dot_product_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scaled_dot_product_attention_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_scaled_dot_product_attention_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_head_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_head_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_head_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_head_attention_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_head_attention_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v1_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v1_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v1_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v1_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v1_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v2_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v2_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v2_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v2_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v2_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v3_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v3_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v3_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v3_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_attention_v3_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_paged_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_paged_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_paged_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_paged_attention_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_paged_attention_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rotary_embedding_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rotary_embedding_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rotary_embedding_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rotary_embedding_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rotary_embedding_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_apply_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_apply_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_apply_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_apply_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_apply_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_alibi_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_alibi_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_alibi_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_alibi_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_alibi_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_update_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_update_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_update_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_update_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_update_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_beam_search_score_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_beam_search_score_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_beam_search_score_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_beam_search_score_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_beam_search_score_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_f32_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_f32_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_f32_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_f32_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_batch_matmul_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_f32_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_f32_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_f32_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_f32_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_f16_910b [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemm_f16_310p [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_f32_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_f32_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_f16_910b [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_gemv_f16_310p [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_position_encoding_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_position_encoding_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_position_encoding_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_causal_mask_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_causal_mask_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_causal_mask_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cross_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cross_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_cross_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grouped_query_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grouped_query_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_grouped_query_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sliding_window_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sliding_window_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sliding_window_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sparse_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sparse_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sparse_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_local_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_local_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_local_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ring_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ring_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_ring_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prefix_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prefix_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_prefix_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_quantize_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_quantize_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_kv_cache_quantize_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_score_mod_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_score_mod_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_score_mod_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_neox_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_neox_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_neox_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_glm_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_glm_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rope_glm_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_quant_int8_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_quant_int8_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_quant_int8_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_quant_int8_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_quant_int8_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_quant_int8_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_quant_int4_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_matmul_quant_int4_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_quant_int4_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_attention_quant_int4_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_quant_int4_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_linear_quant_int4_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_query_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_query_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_multi_query_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_decoding_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_decoding_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_flash_decoding_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_speculative_decoding_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_speculative_decoding_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_speculative_decoding_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_token_mixing_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_token_mixing_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_token_mixing_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_channel_mixing_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_channel_mixing_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_channel_mixing_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_gate_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_gate_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_gate_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_dispatch_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_dispatch_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_dispatch_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_combine_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_combine_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_moe_combine_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_swiglu_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_swiglu_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_swiglu_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_geglu_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_geglu_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_geglu_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reglu_f32 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reglu_f16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_reglu_bf16 [SLOW_1.02X]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_rmsnorm_linear_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_rmsnorm_linear_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_rmsnorm_linear_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_prenorm_attention_f32 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_prenorm_attention_f16 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_prenorm_attention_bf16 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_postnorm_attention_f32 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_postnorm_attention_f16 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_postnorm_attention_bf16 [EQUIVALENT]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_parallel_attention_f32 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_parallel_attention_f16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_parallel_attention_bf16 [EQUIVALENT]: barriers=3, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S5_UNIFORM_BUF_SIZE, S6_HARDCODED_GM_SIZE
  • foreach_sandwich_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_sandwich_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_sandwich_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_qk_norm_f32 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_qk_norm_f16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE
  • foreach_qk_norm_bf16 [SLOW_1.05X]: barriers=4, bufs=2, vec_ops=1, tiling=yes — S1_TBUF_NOT_TQUE, S2_PIPE_ALL_BARRIERS, S3_NO_DOUBLE_BUFFERING, S4_ALIAS_SCRATCH_COPY, S5_UNIFORM_BUF_SIZE