Level | Task | Rank | Task & Kernel Name | Runtime (ms) | Speedup Native | Speedup Compile |
---|---|---|---|---|---|---|
2 | 23 | 🥇 |
23_Conv3d_GroupNorm_Mean
fused_ops_strided_optimized_base
|
0.01 | 128.51 | 82.34 |
2 | 13 | 🥇 |
13_ConvTranspose3d_Mean_Add_Softmax_Tanh_Scaling
13_convtranspose3d_mean_add_softmax_tanh_scaling_optimized_edit_1
|
0.01 | 84.55 | 66.82 |
1 | 12 | 🥇 |
12_Matmul_with_diagonal_matrices_
stride_loop_diag_matmul_base
|
0.05 | 54.40 | 55.46 |
2 | 18 | 🥇 |
18_Matmul_Sum_Max_AvgPool_LogSumExp_LogSumExp
vectorized_inner_loop_warp_shuffle_base
|
0.01 | 14.42 | 2.64 |
1 | 95 | 🥇 |
95_CrossEntropyLoss
ce_loss_grid_stride_unroll_base
|
0.01 | 8.97 | 2.45 |
2 | 42 | 🥇 |
42_ConvTranspose2d_GlobalAvgPool_BiasAdd_LogSumExp_Sum_Multiply
optimized_fused_conv_lse_min_divergence_edit_1
|
0.01 | 8.66 | 7.99 |
1 | 40 | 🥇 |
40_LayerNorm
optimized_layernorm_streamed_base
|
0.94 | 8.60 | 0.70 |
1 | 97 | 🥇 |
97_CosineSimilarityLoss
blocksize_tuning_cosine_loss_base
|
0.01 | 7.64 | 5.47 |
3 | 3 | 🥇 |
3_DeepNarrowMLP
uniform_control_flow_optimization_edit_1
|
0.05 | 7.46 | 5.98 |
2 | 66 | 🥇 |
66_Matmul_Dropout_Mean_Softmax
warp_divergence_optimized_matmul_base
|
0.01 | 6.47 | 11.33 |
2 | 45 | 🥇 |
45_Gemm_Sigmoid_Sum_LogSumExp
fused_gemm_sigmoid_logsumexp_base
|
0.01 | 5.98 | 2.74 |
2 | 95 | 🥇 |
95_Matmul_Add_Swish_Tanh_GELU_Hardtanh
warp_level_vec_ldg_opt_edit_1
|
0.01 | 5.91 | 8.85 |
1 | 88 | 🥇 |
88_MinGPTNewGelu
optimized_gelu_combined_edit_1
|
0.02 | 5.72 | 2.99 |
3 | 49 | 🥇 |
49_Mamba2ReturnFinalState
49_mamba2returnfinalstate_shared_balanced_base
|
0.07 | 4.75 | 0.81 |
2 | 9 | 🥇 |
9_Matmul_Subtract_Multiply_ReLU
tiled_grid_stride_base
|
0.01 | 4.05 | 2.63 |
2 | 40 | 🥇 |
40_Matmul_Scaling_ResidualAdd
warp_level_matmul_base
|
0.01 | 3.92 | 3.10 |
1 | 99 | 🥇 |
99_TripletMarginLoss
vectorized_warp_shfl_reduction_base
|
0.01 | 3.87 | 2.62 |
2 | 55 | 🥇 |
55_Matmul_MaxPool_Sum_Scale
tiled_matmul_pool_sync_opt_base
|
0.01 | 3.73 | 2.23 |
2 | 65 | 🥇 |
65_Conv2d_AvgPool_Sigmoid_Sum
block512_conv_pool_sigsum_base
|
0.01 | 3.67 | 3.78 |
2 | 29 | 🥇 |
29_Matmul_Mish_Mish
optimized_tiled_matmul_mish_base
|
0.01 | 3.65 | 9.33 |
2 | 28 | 🥇 |
28_BMM_InstanceNorm_Sum_ResidualAdd_Multiply
28_bmm_instancenorm_ldg_aligned_base
|
0.01 | 3.50 | 2.39 |
2 | 56 | 🥇 |
56_Matmul_Sigmoid_Sum
block_size_tuning_base_base
|
0.01 | 3.31 | 2.40 |
1 | 90 | 🥇 |
90_cumprod
cumprod_min_sync_base_base
|
0.01 | 3.09 | 3.13 |
2 | 68 | 🥇 |
68_Matmul_Min_Subtract
stride_loop_optimization_thread_base_base
|
0.01 | 2.95 | 1.92 |
3 | 34 | 🥇 |
34_VanillaRNNHidden
stride_rnn_hidden_warp_base
|
0.01 | 2.91 | 6.68 |
1 | 98 | 🥇 |
98_KLDivLoss
optimized_kl_div_cuda_base
|
0.01 | 2.83 | 3.20 |
1 | 36 | 🥇 |
36_RMSNorm_
36_rmsnorm_even_workload_base
|
0.19 | 2.81 | 2.68 |
2 | 99 | 🥇 |
99_Matmul_GELU_Softmax
warp_divergence_minimized_kernel_base
|
0.01 | 2.78 | 2.26 |
1 | 50 | 🥇 |
50_Product_reduction_over_a_dimension
block_size_optimized_reduction_base
|
0.01 | 2.69 | 5.41 |
2 | 14 | 🥇 |
14_Gemm_Divide_Sum_Scaling
balanced_workload_gemm_base
|
0.01 | 2.52 | 6.31 |
1 | 30 | 🥇 |
30_Softsign
shared_mem_warp_opt_base_base
|
0.01 | 2.47 | 4.72 |
3 | 4 | 🥇 |
4_LeNet5
4_LeNet5_fused_even_edit_1
|
0.05 | 2.38 | 1.47 |
3 | 24 | 🥇 |
24_EfficientNetB2
efficientnetb2_warp_shuffle_optimization_base_base
|
0.63 | 2.37 | 1.63 |
2 | 20 | 🥇 |
20_ConvTranspose3d_Sum_ResidualAdd_Multiply_ResidualAdd
coalesced_vectorized_fused_kernel_base
|
1.52 | 2.36 | 0.73 |
3 | 2 | 🥇 |
2_ShallowWideMLP
min_sync_warp_base
|
0.04 | 2.29 | 4.36 |
2 | 97 | 🥇 |
97_Matmul_BatchNorm_BiasAdd_Divide_Swish
block_experiment_fused_bn_swish_base
|
0.03 | 2.28 | 1.94 |
3 | 18 | 🥇 |
18_SqueezeNet
18_squeezenet_shared_memory_reduction_base
|
0.52 | 2.26 | 0.59 |
1 | 89 | 🥇 |
89_cumsum
hybrid_aligned_cumsum_base
|
0.01 | 2.21 | 2.14 |
1 | 74 | 🥇 |
74_conv_transposed_1D_dilated
shared_mem_tiled_74_conv_transposed_1D_dilated_edit_1
|
0.01 | 2.20 | 2.79 |
2 | 81 | 🥇 |
81_Gemm_Swish_Divide_Clamp_Tanh_Clamp
fused_activation_base
|
0.02 | 2.19 | 1.90 |
2 | 22 | 🥇 |
22_Matmul_Scale_ResidualAdd_Clamp_LogSumExp_Mish
22_matmul_scale_residualadd_clamp_logsumexp_mish_syncthreads_optimized_base
|
0.03 | 2.19 | 1.49 |
1 | 53 | 🥇 |
53_Min_reduction_over_a_dimension
min_reduce_fused_warp_base
|
0.01 | 2.19 | 3.09 |
1 | 38 | 🥇 |
38_L1Norm_
l1norm_shared_memory_optimization_edit_1
|
0.01 | 2.04 | 4.17 |
2 | 33 | 🥇 |
33_Gemm_Scale_BatchNorm
fused_scale_bn_coalesced_base
|
0.03 | 2.02 | 1.00 |
2 | 91 | 🥇 |
91_ConvTranspose2d_Softmax_BiasAdd_Scaling_Sigmoid
optimized_fused_ops_kernel_minimized_warp_divergence_edit_1
|
0.15 | 2.01 | 0.68 |
2 | 82 | 🥇 |
82_Conv2d_Tanh_Scaling_BiasAdd_Max
fused_conv_pool_base
|
0.03 | 2.00 | 2.05 |
3 | 46 | 🥇 |
46_NetVladWithGhostClusters
netvlad_warp_shfl_sync_optimized_base
|
0.10 | 1.99 | 0.78 |
1 | 46 | 🥇 |
46_Average_Pooling_3D
avgpool3d_combo_edit_1
|
0.29 | 1.96 | 3.44 |
2 | 88 | 🥇 |
88_Gemm_GroupNorm_Swish_Multiply_Swish
minimal_sync_88_gemm_groupnorm_swish_base
|
0.02 | 1.95 | 1.98 |
1 | 45 | 🥇 |
45_Average_Pooling_2D
modular_avg_pool2d_base_base
|
0.11 | 1.94 | 3.03 |
2 | 35 | 🥇 |
35_Conv2d_Subtract_HardSwish_MaxPool_Mish
manually_unrolled_kernel_base
|
0.03 | 1.93 | 2.16 |
3 | 43 | 🥇 |
43_MinGPTCausalAttention
coalesced_causal_attention_base_base
|
6.68 | 1.93 | 1.51 |
2 | 59 | 🥇 |
59_Matmul_Swish_Scaling
59_matmul_swish_scaling_coalesced_base
|
0.02 | 1.91 | 1.94 |
1 | 43 | 🥇 |
43_Max_Pooling_3D
maxpool3d_unrolled_base_base
|
0.25 | 1.91 | 3.59 |
2 | 58 | 🥇 |
58_ConvTranspose3d_LogSumExp_HardSwish_Subtract_Clamp_Max
fused_optimized_kernel_base
|
4.49 | 1.90 | 0.83 |
1 | 39 | 🥇 |
39_L2Norm_
l2norm_strided_optimized_base_base
|
0.01 | 1.89 | 5.60 |
1 | 76 | 🥇 |
76_conv_standard_1D_dilated_strided__
conv1d_warp_uniform_base_base
|
0.01 | 1.88 | 8.37 |
3 | 42 | 🥇 |
42_GRUBidirectionalHidden
42_GRUBidirectionalHidden_grid_optimized_base
|
51.74 | 1.87 | 1.96 |
2 | 92 | 🥇 |
92_Conv2d_GroupNorm_Tanh_HardSwish_ResidualAdd_LogSumExp
optimal_block_size_kernel_base
|
0.06 | 1.86 | 1.06 |
1 | 51 | 🥇 |
51_Argmax_over_a_dimension
warp_argmax_nosm_edit_1
|
0.01 | 1.85 | 2.54 |
1 | 100 | 🥇 |
100_HingeLoss
100_HingeLoss
|
0.01 | 1.85 | 1.74 |
2 | 49 | 🥇 |
49_ConvTranspose3d_Softmax_Sigmoid
adaptive_block_softmax_sigmoid_base_base
|
1.57 | 1.83 | 0.95 |
2 | 31 | 🥇 |
31_Conv2d_Min_Add_Multiply
block_size_tuned_conv2d_base_base
|
0.03 | 1.83 | 2.42 |
2 | 3 | 🥇 |
3_ConvTranspose3d_Sum_LayerNorm_AvgPool_GELU
3_convtranspose3d_sum_layernorm_avgpool_gelu_opt_customnorm_base
|
25.20 | 1.81 | 0.40 |
1 | 35 | 🥇 |
35_GroupNorm_
pipelined_stream_groupnorm_optimized_base
|
0.37 | 1.81 | 1.41 |
2 | 53 | 🥇 |
53_Gemm_Scaling_Hardtanh_GELU
modular_functions_base_edit_1
|
0.03 | 1.81 | 1.78 |
3 | 47 | 🥇 |
47_NetVladNoGhostClusters
netvlad_fused_streams_edit_1
|
0.07 | 1.79 | 1.14 |
2 | 17 | 🥇 |
17_Conv2d_InstanceNorm_Divide
unrolled_fused_conv_instnorm_base_base
|
0.04 | 1.79 | 1.62 |
2 | 21 | 🥇 |
21_Conv2d_Add_Scale_Sigmoid_GroupNorm
shared_memory_coalesced_access_kernel_base
|
0.04 | 1.79 | 1.53 |
2 | 32 | 🥇 |
32_Conv2d_Scaling_Min
warp_aligned_conv_scale_min_base
|
0.03 | 1.77 | 1.86 |
1 | 48 | 🥇 |
48_Mean_reduction_over_a_dimension
evenly_distributed_mean_base
|
0.01 | 1.76 | 3.62 |
2 | 87 | 🥇 |
87_Conv2d_Subtract_Subtract_Mish
87_conv2d_subtract_subtract_mish_templated_base_base
|
0.03 | 1.75 | 2.05 |
2 | 25 | 🥇 |
25_Conv2d_Min_Tanh_Tanh
conv_min_tanh_optimized_base
|
0.03 | 1.74 | 1.83 |
1 | 52 | 🥇 |
52_Argmin_over_a_dimension
52_argmin_tuned_blocks_base_base
|
0.01 | 1.73 | 2.40 |
1 | 83 | 🥇 |
83_conv_depthwise_2D_square_input_asymmetric_kernel
hybrid_tiled_warp_depthwise_conv_edit_1
|
0.02 | 1.72 | 22.65 |
2 | 70 | 🥇 |
70_Gemm_Sigmoid_Scaling_ResidualAdd
optimized_sigmoid_scaling_residual_add_base
|
0.03 | 1.71 | 1.74 |
2 | 4 | 🥇 |
4_Conv2d_Mish_Mish
conv2d_mish_warp_uniform_base_base
|
0.03 | 1.71 | 2.34 |
2 | 52 | 🥇 |
52_Conv2d_Activation_BatchNorm
ldg_alignment_fusion_opt_base
|
0.06 | 1.70 | 1.14 |
2 | 80 | 🥇 |
80_Gemm_Max_Subtract_GELU
warp_aligned_gemm_base_edit_1
|
0.03 | 1.70 | 1.81 |
2 | 16 | 🥇 |
16_ConvTranspose2d_Mish_Add_Hardtanh_Scaling
16_ConvTranspose2d_Mish_Add_Hardtanh_Scaling_coalesced_memory_base_edit_1
|
0.13 | 1.69 | 1.06 |
1 | 37 | 🥇 |
37_FrobeniusNorm_
modular_frobenius_norm_edit_1
|
0.20 | 1.64 | 2.53 |
2 | 67 | 🥇 |
67_Conv2d_GELU_GlobalAvgPool
unrolled_fused_conv_gelu_pool_base
|
0.03 | 1.63 | 2.28 |
2 | 2 | 🥇 |
2_ConvTranspose2d_BiasAdd_Clamp_Scaling_Clamp_Divide
ldg_128bit_align_opt_base
|
0.18 | 1.63 | 0.77 |
2 | 57 | 🥇 |
57_Conv2d_ReLU_HardSwish
balanced_workload_conv2d_base
|
0.04 | 1.63 | 1.46 |
1 | 96 | 🥇 |
96_HuberLoss
sync_optimized_unrolled_reduction_edit_1
|
0.01 | 1.62 | 5.37 |
2 | 51 | 🥇 |
51_Gemm_Subtract_GlobalAvgPool_LogSumExp_GELU_ResidualAdd
fused_forward_base
|
0.05 | 1.62 | 0.92 |
1 | 25 | 🥇 |
25_Swish
25_Swish
|
0.01 | 1.56 | 8.89 |
3 | 22 | 🥇 |
22_EfficientNetB0
22_EfficientNetB0
|
1.60 | 1.56 | 0.69 |
2 | 94 | 🥇 |
94_Gemm_BiasAdd_Hardtanh_Mish_GroupNorm
fused_aligned_ldg_base
|
0.03 | 1.56 | 1.55 |
2 | 26 | 🥇 |
26_ConvTranspose3d_Add_HardSwish
ldg_smem_vectorized_edit2_edit_1
|
3.32 | 1.53 | 1.00 |
1 | 84 | 🥇 |
84_conv_depthwise_2D_asymmetric_input_square_kernel
84_conv_dw2d_unroll_gridstride_shared_kernel_base
|
0.01 | 1.52 | 3.22 |
2 | 75 | 🥇 |
75_Gemm_GroupNorm_Min_BiasAdd
fused_groupnorm_min_base
|
0.02 | 1.51 | 2.10 |
1 | 49 | 🥇 |
49_Max_reduction_over_a_dimension
adaptive_max_reduce_base
|
0.02 | 1.50 | 2.04 |
2 | 37 | 🥇 |
37_Matmul_Swish_Sum_GroupNorm
fused_swish_bias_groupnorm_aligned_edit_1
|
0.03 | 1.50 | 1.53 |
2 | 84 | 🥇 |
84_Gemm_BatchNorm_Scaling_Softmax
modular_fused_gemm_bn_softmax_edit_1
|
0.04 | 1.50 | 0.81 |
2 | 93 | 🥇 |
93_ConvTranspose2d_Add_Min_GELU_Multiply
warp_optimized_reduction_with_shared_memory_edit_1
|
0.17 | 1.49 | 0.80 |
3 | 19 | 🥇 |
19_MobileNetV1
19_MobileNetV1
|
1.13 | 1.49 | 0.79 |
1 | 47 | 🥇 |
47_Sum_reduction_over_a_dimension
fully_unrolled_warp_sum_reduction_base
|
0.01 | 1.47 | 2.97 |
2 | 62 | 🥇 |
62_Matmul_GroupNorm_LeakyReLU_Sum
warp_fused_gn_lrelu_sum_base
|
0.02 | 1.46 | 2.24 |
1 | 93 | 🥇 |
93_masked_cumsum
shared_memory_optimized_masked_cumsum_base
|
0.02 | 1.45 | 0.93 |
1 | 85 | 🥇 |
85_conv_depthwise_2D_asymmetric_input_asymmetric_kernel
combined_conv_vectorized_edit_1
|
0.02 | 1.44 | 3.46 |
1 | 42 | 🥇 |
42_Max_Pooling_2D
warp_divergence_optimized_unroll_base
|
0.02 | 1.43 | 3.04 |
3 | 9 | 🥇 |
9_ResNet18
resnet18_aligned_memory_base
|
0.68 | 1.43 | 0.58 |
2 | 60 | 🥇 |
60_ConvTranspose3d_Swish_GroupNorm_HardSwish
warp_optimized_fused_base_base
|
5.40 | 1.42 | 0.84 |
3 | 41 | 🥇 |
41_GRUBirectional
41_GRUBirectional
|
69.30 | 1.42 | 1.45 |
3 | 15 | 🥇 |
15_DenseNet121
optimized_dense_net_base
|
4.19 | 1.42 | 0.90 |
1 | 67 | 🥇 |
67_conv_standard_1D
aligned_ldg_conv1d_base
|
0.01 | 1.41 | 2.75 |
2 | 7 | 🥇 |
7_Conv3d_ReLU_LeakyReLU_GELU_Sigmoid_BiasAdd
coalesced_memory_activation_kernel_base_base
|
0.76 | 1.41 | 0.69 |
2 | 46 | 🥇 |
46_Conv2d_Subtract_Tanh_Subtract_AvgPool
balanced_workload_conv2d_subtract_tanh_avgpool_base
|
0.04 | 1.39 | 1.48 |
3 | 50 | 🥇 |
50_ReLUSelfAttention
shared_memory_bias_tiling_edit_1
|
3.71 | 1.37 | 0.73 |
3 | 1 | 🥇 | 0.03 | 1.35 | 6.59 | |
2 | 61 | 🥇 |
61_ConvTranspose3d_ReLU_GroupNorm
fused_rg_atomic_opt_base_base
|
0.18 | 1.34 | 1.14 |
2 | 27 | 🥇 |
27_Conv3d_HardSwish_ReLU_Softmax_Mean
ldg_aligned_fused_kernel_base
|
0.83 | 1.34 | 0.72 |
3 | 10 | 🥇 |
10_ResNet101
resnet101_modular_functions_base_base
|
23.20 | 1.33 | 1.33 |
3 | 7 | 🥇 |
7_GoogleNetInceptionV1
optimized_thread_block_indexing_edit_1
|
1.70 | 1.31 | 0.71 |
2 | 12 | 🥇 |
12_Gemm_Multiply_LeakyReLU
12_gemm_warp_primitives_base
|
0.03 | 1.30 | 2.32 |
2 | 48 | 🥇 |
48_Conv3d_Scaling_Tanh_Multiply_Sigmoid
optimized_hybrid_conv3d_base
|
0.78 | 1.29 | 0.68 |
3 | 23 | 🥇 |
23_EfficientNetB1
23_EfficientNetB1
|
1.09 | 1.28 | 0.68 |
2 | 54 | 🥇 |
54_Conv2d_Multiply_LeakyReLU_GELU
dynamic_block_size_54conv_base
|
0.04 | 1.28 | 1.44 |
2 | 63 | 🥇 |
63_Gemm_ReLU_Divide
unrolled_tiled_gemm_base_base
|
0.03 | 1.28 | 1.34 |
2 | 64 | 🥇 |
64_Gemm_LogSumExp_LeakyReLU_LeakyReLU_GELU_GELU
tiled_fused_optimized_kernel_edit_1
|
0.06 | 1.25 | 0.80 |
2 | 90 | 🥇 |
90_Conv3d_LeakyReLU_Sum_Clamp_GELU
aligned_vectorized_ldg_90_conv3d_edit_1
|
0.79 | 1.25 | 0.66 |
2 | 85 | 🥇 |
85_Conv2d_GroupNorm_Scale_MaxPool_Clamp
conv2d_gn_scale_pool_clamp_sync_opt_base
|
0.06 | 1.24 | 1.15 |
1 | 82 | 🥇 |
82_conv_depthwise_2D_square_input_square_kernel
manual_unroll_depthwise_2d_kernel_edit_1
|
0.03 | 1.24 | 2.14 |
3 | 39 | 🥇 | 27.75 | 1.24 | 1.83 | |
1 | 44 | 🥇 |
44_Average_Pooling_1D
vectorized_4x_base
|
0.01 | 1.24 | 8.08 |
2 | 44 | 🥇 |
44_ConvTranspose2d_Multiply_GlobalAvgPool_GlobalAvgPool_Mean
optimized_spatial_reduction_edit_1
|
0.18 | 1.21 | 0.72 |
2 | 74 | 🥇 |
74_ConvTranspose3d_LeakyReLU_Multiply_LeakyReLU_Max
74_ConvTranspose3d_LeakyReLU_Multiply_LeakyReLU_Max_fused_edit_1
|
1.21 | 1.21 | 0.70 |
2 | 96 | 🥇 |
96_ConvTranspose3d_Multiply_Max_GlobalAvgPool_Clamp
conv_transpose3d_opt_stride_loops_edit_1
|
4.39 | 1.21 | 1.22 |
2 | 8 | 🥇 |
8_Conv3d_Divide_Max_GlobalAvgPool_BiasAdd_Sum
fused_stride_loops_base
|
0.75 | 1.21 | 0.91 |
3 | 33 | 🥇 |
33_VanillaRNN
fused_rnn_i2h_warp_base
|
0.02 | 1.21 | 2.67 |
2 | 69 | 🥇 |
69_Conv2d_HardSwish_ReLU
fused_hardswish_relu_const_edit_1
|
0.04 | 1.19 | 1.58 |
2 | 11 | 🥇 |
11_ConvTranspose2d_BatchNorm_Tanh_MaxPool_GroupNorm
11_convtranspose_bn_fusedtanhm_pool_groupnorm_warp_optimized_base
|
0.73 | 1.19 | 0.49 |
1 | 41 | 🥇 |
41_Max_Pooling_1D
modular_device_functions_edit_1_base
|
0.01 | 1.18 | 5.01 |
2 | 71 | 🥇 |
71_Conv2d_Divide_LeakyReLU
aligned_memory_access_base
|
0.04 | 1.18 | 1.54 |
2 | 19 | 🥇 |
19_ConvTranspose2d_GELU_GroupNorm
opt_convtrans_gelu_gn_even_distribution_base
|
0.55 | 1.18 | 0.61 |
3 | 12 | 🥇 |
12_VGG19
vgg19_cudnn_optimized_base
|
3.15 | 1.17 | 0.70 |
2 | 1 | 🥇 |
1_Conv2D_ReLU_BiasAdd
block_size_optimized_base
|
0.04 | 1.17 | 1.49 |
1 | 22 | 🥇 |
22_Tanh
combined_tanh_kernel_edit_1
|
0.01 | 1.17 | 4.88 |
1 | 29 | 🥇 |
29_Softplus
warp_optimized_softplus_base
|
0.01 | 1.16 | 4.88 |
2 | 36 | 🥇 |
36_ConvTranspose2d_Min_Sum_GELU_Add
warp_reduction_optimized_kernel_base_base
|
0.18 | 1.15 | 0.74 |
2 | 89 | 🥇 |
89_ConvTranspose3d_MaxPool_Softmax_Subtract_Swish_Max
balanced_thread_block_distribution_base
|
5.03 | 1.14 | 0.99 |
1 | 31 | 🥇 |
31_ELU
vec_shared_elu_base
|
0.01 | 1.14 | 4.80 |
1 | 20 | 🥇 |
20_LeakyReLU
shared_leakyrelu_base
|
0.01 | 1.13 | 4.85 |
2 | 100 | 🥇 |
100_ConvTranspose3d_Clamp_Min_Divide
memory_coalescing_optimization_base
|
0.57 | 1.13 | 0.84 |
2 | 6 | 🥇 |
6_Conv3d_Softmax_MaxPool_MaxPool
strided_maxpool_base_base
|
0.95 | 1.13 | 0.90 |
1 | 26 | 🥇 |
26_GELU_
26_gelu_ldg_vec_base
|
0.01 | 1.13 | 4.96 |
1 | 28 | 🥇 |
28_HardSigmoid
hardsigmoid_shared_optimized_edit_1
|
0.01 | 1.12 | 4.96 |
1 | 21 | 🥇 |
21_Sigmoid
optimized_sigmoid_hybrid_edit_1
|
0.01 | 1.11 | 4.82 |
1 | 27 | 🥇 |
27_SELU_
27_selu_aligned_ldg_base
|
0.01 | 1.10 | 4.96 |
2 | 5 | 🥇 |
5_ConvTranspose2d_Subtract_Tanh
optimized_shared_mem_tanh_base
|
0.08 | 1.09 | 0.91 |
2 | 47 | 🥇 |
47_Conv3d_Mish_Tanh
shared_mem_mish_tanh_base_base
|
0.10 | 1.09 | 0.95 |
1 | 72 | 🥇 |
72_conv_transposed_3D_asymmetric_input_asymmetric_kernel___strided_padded_grouped_
minimize_warp_divergence_base
|
25.36 | 1.08 | 1.09 |
1 | 19 | 🥇 |
19_ReLU
19_ReLU
|
0.01 | 1.08 | 4.75 |
3 | 44 | 🥇 |
44_MiniGPTBlock
block_size_optimized_transformer_base_base
|
28.16 | 1.08 | 0.82 |
1 | 24 | 🥇 |
24_LogSoftmax
strided_logsoftmax_base_base
|
0.01 | 1.07 | 3.72 |
2 | 34 | 🥇 |
34_ConvTranspose3d_LayerNorm_GELU_Scaling
balanced_load_kernel_base
|
42.49 | 1.07 | 0.21 |
1 | 23 | 🥇 |
23_Softmax
reduced_sync_softmax_kernel_edit_1
|
0.01 | 1.07 | 3.43 |
1 | 32 | 🥇 |
32_HardTanh
32_hardtanh_aligned_128_opt_edit_1
|
0.01 | 1.06 | 5.07 |
2 | 38 | 🥇 |
38_ConvTranspose3d_AvgPool_Clamp_Softmax_Multiply
modular_device_functions_refactor_base
|
0.64 | 1.06 | 0.95 |
2 | 43 | 🥇 |
43_Conv3d_Max_LogSumExp_ReLU
block_tuned_fused_kernel_base_base
|
0.79 | 1.05 | 1.01 |
2 | 72 | 🥇 |
72_ConvTranspose3d_BatchNorm_AvgPool_AvgPool
warp_uniform_control_flow_edit_1
|
23.59 | 1.05 | 1.06 |
2 | 78 | 🥇 |
78_ConvTranspose3d_Max_Max_Sum
optimized_maxpool_kernel_base
|
0.58 | 1.05 | 1.21 |
2 | 98 | 🥇 |
98_Matmul_AvgPool_GELU_Scale_Max
fused_pipeline_base
|
0.03 | 1.04 | 1.50 |
1 | 94 | 🥇 |
94_MSELoss
mse_unrolled_optimized_edit_1
|
0.02 | 1.03 | 2.04 |
1 | 75 | 🥇 |
75_conv_transposed_2D_asymmetric_input_asymmetric_kernel_strided__grouped____padded____dilated__
conv_transposed_2d_tiled_shared_bias_base
|
6.44 | 1.03 | 1.04 |
2 | 73 | 🥇 |
73_Conv2d_BatchNorm_Scaling
73_Conv2d_BatchNorm_Scaling
|
0.12 | 1.02 | 1.43 |
1 | 69 | 🥇 |
69_conv_transposed_2D__asymmetric_input__asymmetric_kernel
69_conv_transposed_2D__asymmetric_input__asymmetric_kernel
|
0.03 | 1.02 | 1.74 |
1 | 64 | 🥇 |
64_conv_transposed_1D
modular_64_conv_transposed_1D_base
|
0.02 | 1.02 | 2.24 |
3 | 11 | 🥇 | 3.19 | 1.02 | 0.60 | |
3 | 16 | 🥇 |
16_DenseNet201
warp_optimized_densenet_op_base
|
8.04 | 1.01 | 1.03 |
2 | 41 | 🥇 |
41_Gemm_BatchNorm_GELU_GroupNorm_Mean_ReLU
41_Gemm_BatchNorm_GELU_GroupNorm_Mean_ReLU
|
0.06 | 1.01 | 0.66 |
1 | 91 | 🥇 |
91_cumsum_reverse
reverse_cumsum_block_size_tuning_edit_1
|
0.04 | 1.01 | 0.96 |
2 | 77 | 🥇 |
77_ConvTranspose3d_Scale_BatchNorm_GlobalAvgPool
ldg_memory_alignment_optimization_edit_1
|
0.78 | 1.01 | 0.72 |
1 | 57 | 🥇 |
57_conv_transposed_2D__square_input__square_kernel
stride_loop_optimized_conv_transpose2d_base
|
0.15 | 1.01 | 1.18 |
1 | 58 | 🥇 |
58_conv_transposed_3D__asymmetric_input__asymmetric_kernel
58_conv_transposed_3D__asymmetric_input__asymmetric_kernel
|
2.30 | 1.01 | 1.03 |
1 | 78 | 🥇 |
78_conv_transposed_2D_asymmetric_input_asymmetric_kernel___padded__
conv_trans_tuned_blocks_base_base
|
0.33 | 1.00 | 1.14 |
3 | 27 | 🥇 |
27_RegNet
27_RegNet
|
2.24 | 1.00 | 0.44 |
3 | 45 | 🥇 |
45_UNetSoftmax
45_UNetSoftmax
|
4.94 | 1.00 | 0.88 |
1 | 62 | 🥇 |
62_conv_standard_2D__square_input__asymmetric_kernel
62_conv_standard_2D__square_input__asymmetric_kernel
|
0.28 | 1.00 | 1.52 |
2 | 24 | 🥇 |
24_Conv3d_Min_Softmax
shared_memory_tiling_optimization_base
|
0.77 | 1.00 | 0.69 |
3 | 8 | 🥇 |
8_ResNetBasicBlock
8_ResNetBasicBlock
|
1.68 | 1.00 | 0.48 |
2 | 50 | 🥇 |
50_ConvTranspose3d_Scaling_AvgPool_BiasAdd_Scaling
50_ConvTranspose3d_Scaling_AvgPool_BiasAdd_Scaling
|
5.73 | 1.00 | 0.75 |
1 | 8 | 🥇 |
8_Matmul_with_irregular_shapes_
cublas_handle_reuse_optimized_edit_1
|
6.22 | 1.00 | 1.02 |
3 | 25 | 🥇 |
25_ShuffleNetUnit
25_ShuffleNetUnit
|
9.27 | 1.00 | 1.31 |
3 | 14 | 🥇 |
14_DenseNet121DenseBlock
14_DenseNet121DenseBlock
|
7.15 | 1.00 | 0.86 |
1 | 5 | 🥇 |
5_Matrix_scalar_multiplication
vector_no_sync_scalar_mult_edit_1
|
0.18 | 1.00 | 2.21 |
1 | 71 | 🥇 |
71_conv_transposed_2D__asymmetric_input__square_kernel
atomic_optimized_transpose_base
|
0.27 | 1.00 | 1.16 |
2 | 15 | 🥇 |
15_ConvTranspose3d_BatchNorm_Subtract
15_ConvTranspose3d_BatchNorm_Subtract
|
2.02 | 1.00 | 0.53 |
3 | 6 | 🥇 |
6_GoogleNetInceptionModule
6_GoogleNetInceptionModule
|
8.72 | 1.00 | 1.15 |
1 | 2 | 🥇 |
2_Standard_matrix_multiplication_
hybrid_matmul_base
|
0.43 | 1.00 | 1.08 |
3 | 17 | 🥇 |
17_SqueezeNetFireModule
17_SqueezeNetFireModule
|
0.84 | 1.00 | 0.93 |
1 | 63 | 🥇 |
63_conv_standard_2D__square_input__square_kernel
adaptive_conv2d_cuda_base
|
0.23 | 1.00 | 1.68 |
2 | 39 | 🥇 |
39_Gemm_Scale_BatchNorm
gemm_scale_batchnorm_warp_divergence_optimized_edit_1
|
0.05 | 1.00 | 0.51 |
1 | 55 | 🥇 |
55_conv_standard_2D__asymmetric_input__square_kernel
adaptive_conv2d_blocksize_tuning_base
|
0.12 | 1.00 | 1.75 |
3 | 5 | 🥇 |
5_AlexNet
5_AlexNet
|
0.56 | 0.99 | 0.84 |
3 | 40 | 🥇 |
40_GRUHidden
40_GRUHidden
|
36.25 | 0.97 | 1.38 |
3 | 13 | 🥇 |
13_DenseNet121TransitionLayer
13_DenseNet121TransitionLayer
|
0.57 | 0.94 | 0.52 |
1 | 4 | 🥇 |
4_Matrix_vector_multiplication_
modular_device_functions_matvec_base
|
0.07 | 0.94 | 2.61 |
1 | 79 | 🥇 |
79_conv_transposed_1D_asymmetric_input_square_kernel___padded____strided____dilated__
conv_transpose1d_shared_tile_sync_base
|
0.02 | 0.93 | 2.61 |
2 | 76 | 🥇 |
76_Gemm_Add_ReLU
combined_warp_tile_base
|
0.03 | 0.93 | 1.54 |
2 | 86 | 🥇 |
86_Matmul_Divide_GELU
block_size_optimized_fused_kernel_base_base
|
0.03 | 0.92 | 1.63 |
2 | 30 | 🥇 |
30_Gemm_GroupNorm_Hardtanh
warp_divergence_minimization_base
|
0.06 | 0.88 | 0.91 |
1 | 86 | 🥇 |
86_conv_depthwise_separable_2D
conv_dw_separable_strided_loops_edit_1
|
0.31 | 0.81 | 1.18 |
1 | 9 | 🥇 |
9_Tall_skinny_matrix_multiplication_
unrolled_loop_matmul_base
|
0.68 | 0.78 | 0.59 |
3 | 36 | 🥇 |
36_LTSMHn
optimized_lstm_base
|
36.44 | 0.76 | 1.59 |
1 | 14 | 🥇 |
14_Matmul_for_upper_triangular_matrices
coalesced_memory_access_upper_triangular_matmul_base
|
3.90 | 0.72 | 0.74 |
1 | 7 | 🥇 |
7_Matmul_with_small_K_dimension_
modular_matmul_refactored_base
|
0.99 | 0.67 | 0.58 |
1 | 13 | 🥇 |
13_Matmul_for_symmetric_matrices
vec_ldg_aligned_matmul_128_base_optimized_base
|
4.29 | 0.64 | 0.67 |
1 | 33 | 🥇 |
33_BatchNorm
adaptive_blocksize_batchnorm_base
|
0.90 | 0.62 | 0.39 |
3 | 37 | 🥇 |
37_LTSMCn
37_ltsmcn_balanced_workload_base_base
|
52.09 | 0.52 | 1.10 |
3 | 35 | 🥇 |
35_LTSM
35_lstm_grid_stride_base_base
|
72.97 | 0.44 | 0.83 |
1 | 1 | 🥇 |
1_Square_matrix_multiplication_
regtile_2x2_optimized_sync_edit_1
|
1.01 | 0.42 | 0.44 |
1 | 80 | 🥇 |
80_conv_standard_2D_square_input_asymmetric_kernel___dilated____padded__
warp_divergence_minimized_conv2d_base
|
0.88 | 0.31 | 0.48 |
1 | 3 | 🥇 |
3_Batched_matrix_multiplication
bmm_tiled_shared_memory_optimized_edit_1
|
0.51 | 0.25 | 0.35 |
1 | 87 | 🥇 |
87_conv_pointwise_2D
stride_loop_conv2d_base
|
0.45 | 0.25 | 0.77 |
1 | 11 | 🥇 |
11_4D_tensor_matrix_multiplication
ldg_optimized_shared_mem_tiled_4d_matrix_mult_base
|
94.27 | 0.22 | 0.22 |
1 | 10 | 🥇 |
10_3D_tensor_matrix_multiplication
unrolled_tiled_kernel_base
|
5.52 | 0.21 | 0.23 |
1 | 18 | 🥇 |
18_Matmul_with_transposed_both
optimized_matmul_transpose_base
|
1.87 | 0.19 | 0.22 |
1 | 81 | 🥇 |
81_conv_transposed_2D_asymmetric_input_square_kernel___dilated____padded____strided__
conv_transpose2d_thread_block_map_edit_1
|
10.93 | 0.16 | 0.17 |
1 | 16 | 🥇 |
16_Matmul_with_transposed_A
tiled_double_output_base
|
2.29 | 0.15 | 0.17 |
1 | 60 | 🥇 |
60_conv_standard_3D__square_input__asymmetric_kernel
pipelined_streams_conv3d_base
|
37.87 | 0.14 | 0.14 |
1 | 17 | 🥇 |
17_Matmul_with_transposed_B
warp_matmul_optimized_v2_base
|
2.53 | 0.14 | 0.16 |
1 | 56 | 🥇 |
56_conv_standard_2D__asymmetric_input__asymmetric_kernel
56_conv_unrolled_2d_base
|
1.13 | 0.13 | 0.20 |
1 | 6 | 🥇 |
6_Matmul_with_large_K_dimension_
double_buffered_matmul_base
|
5.11 | 0.07 | 0.11 |
1 | 65 | 🥇 |
65_conv_transposed_2D__square_input__asymmetric_kernel
atomic_operations_minimization_base
|
3.71 | 0.05 | 0.06 |