The AI CUDA Engineer 👷 Archive

Kernel Leaderboard 🏆

Level Task Rank Task & Kernel Name Runtime (ms) Speedup Native Speedup Compile
2 23 🥇
23_Conv3d_GroupNorm_Mean fused_ops_strided_optimized_base
0.01 128.51 82.34
2 13 🥇
13_ConvTranspose3d_Mean_Add_Softmax_Tanh_Scaling 13_convtranspose3d_mean_add_softmax_tanh_scaling_optimized_edit_1
0.01 84.55 66.82
1 12 🥇
12_Matmul_with_diagonal_matrices_ stride_loop_diag_matmul_base
0.05 54.40 55.46
2 18 🥇
18_Matmul_Sum_Max_AvgPool_LogSumExp_LogSumExp vectorized_inner_loop_warp_shuffle_base
0.01 14.42 2.64
1 95 🥇
95_CrossEntropyLoss ce_loss_grid_stride_unroll_base
0.01 8.97 2.45
2 42 🥇
42_ConvTranspose2d_GlobalAvgPool_BiasAdd_LogSumExp_Sum_Multiply optimized_fused_conv_lse_min_divergence_edit_1
0.01 8.66 7.99
1 40 🥇 0.94 8.60 0.70
1 97 🥇
97_CosineSimilarityLoss blocksize_tuning_cosine_loss_base
0.01 7.64 5.47
3 3 🥇 0.05 7.46 5.98
2 66 🥇
66_Matmul_Dropout_Mean_Softmax warp_divergence_optimized_matmul_base
0.01 6.47 11.33
2 45 🥇
45_Gemm_Sigmoid_Sum_LogSumExp fused_gemm_sigmoid_logsumexp_base
0.01 5.98 2.74
2 95 🥇
95_Matmul_Add_Swish_Tanh_GELU_Hardtanh warp_level_vec_ldg_opt_edit_1
0.01 5.91 8.85
1 88 🥇 0.02 5.72 2.99
3 49 🥇 0.07 4.75 0.81
2 9 🥇
9_Matmul_Subtract_Multiply_ReLU tiled_grid_stride_base
0.01 4.05 2.63
2 40 🥇
40_Matmul_Scaling_ResidualAdd warp_level_matmul_base
0.01 3.92 3.10
1 99 🥇 0.01 3.87 2.62
2 55 🥇
55_Matmul_MaxPool_Sum_Scale tiled_matmul_pool_sync_opt_base
0.01 3.73 2.23
2 65 🥇
65_Conv2d_AvgPool_Sigmoid_Sum block512_conv_pool_sigsum_base
0.01 3.67 3.78
2 29 🥇
29_Matmul_Mish_Mish optimized_tiled_matmul_mish_base
0.01 3.65 9.33
2 28 🥇
28_BMM_InstanceNorm_Sum_ResidualAdd_Multiply 28_bmm_instancenorm_ldg_aligned_base
0.01 3.50 2.39
2 56 🥇
56_Matmul_Sigmoid_Sum block_size_tuning_base_base
0.01 3.31 2.40
1 90 🥇 0.01 3.09 3.13
2 68 🥇 0.01 2.95 1.92
3 34 🥇
34_VanillaRNNHidden stride_rnn_hidden_warp_base
0.01 2.91 6.68
1 98 🥇 0.01 2.83 3.20
1 36 🥇 0.19 2.81 2.68
2 99 🥇
99_Matmul_GELU_Softmax warp_divergence_minimized_kernel_base
0.01 2.78 2.26
1 50 🥇
50_Product_reduction_over_a_dimension block_size_optimized_reduction_base
0.01 2.69 5.41
2 14 🥇
14_Gemm_Divide_Sum_Scaling balanced_workload_gemm_base
0.01 2.52 6.31
1 30 🥇 0.01 2.47 4.72
3 4 🥇 0.05 2.38 1.47
3 24 🥇 0.63 2.37 1.63
2 20 🥇
20_ConvTranspose3d_Sum_ResidualAdd_Multiply_ResidualAdd coalesced_vectorized_fused_kernel_base
1.52 2.36 0.73
3 2 🥇
2_ShallowWideMLP min_sync_warp_base
0.04 2.29 4.36
2 97 🥇
97_Matmul_BatchNorm_BiasAdd_Divide_Swish block_experiment_fused_bn_swish_base
0.03 2.28 1.94
3 18 🥇 0.52 2.26 0.59
1 89 🥇 0.01 2.21 2.14
1 74 🥇 0.01 2.20 2.79
2 81 🥇
81_Gemm_Swish_Divide_Clamp_Tanh_Clamp fused_activation_base
0.02 2.19 1.90
2 22 🥇
22_Matmul_Scale_ResidualAdd_Clamp_LogSumExp_Mish 22_matmul_scale_residualadd_clamp_logsumexp_mish_syncthreads_optimized_base
0.03 2.19 1.49
1 53 🥇
53_Min_reduction_over_a_dimension min_reduce_fused_warp_base
0.01 2.19 3.09
1 38 🥇 0.01 2.04 4.17
2 33 🥇
33_Gemm_Scale_BatchNorm fused_scale_bn_coalesced_base
0.03 2.02 1.00
2 91 🥇
91_ConvTranspose2d_Softmax_BiasAdd_Scaling_Sigmoid optimized_fused_ops_kernel_minimized_warp_divergence_edit_1
0.15 2.01 0.68
2 82 🥇
82_Conv2d_Tanh_Scaling_BiasAdd_Max fused_conv_pool_base
0.03 2.00 2.05
3 46 🥇
46_NetVladWithGhostClusters netvlad_warp_shfl_sync_optimized_base
0.10 1.99 0.78
1 46 🥇
46_Average_Pooling_3D avgpool3d_combo_edit_1
0.29 1.96 3.44
2 88 🥇
88_Gemm_GroupNorm_Swish_Multiply_Swish minimal_sync_88_gemm_groupnorm_swish_base
0.02 1.95 1.98
1 45 🥇
45_Average_Pooling_2D modular_avg_pool2d_base_base
0.11 1.94 3.03
2 35 🥇
35_Conv2d_Subtract_HardSwish_MaxPool_Mish manually_unrolled_kernel_base
0.03 1.93 2.16
3 43 🥇
43_MinGPTCausalAttention coalesced_causal_attention_base_base
6.68 1.93 1.51
2 59 🥇
59_Matmul_Swish_Scaling 59_matmul_swish_scaling_coalesced_base
0.02 1.91 1.94
1 43 🥇
43_Max_Pooling_3D maxpool3d_unrolled_base_base
0.25 1.91 3.59
2 58 🥇
58_ConvTranspose3d_LogSumExp_HardSwish_Subtract_Clamp_Max fused_optimized_kernel_base
4.49 1.90 0.83
1 39 🥇 0.01 1.89 5.60
1 76 🥇
76_conv_standard_1D_dilated_strided__ conv1d_warp_uniform_base_base
0.01 1.88 8.37
3 42 🥇 51.74 1.87 1.96
2 92 🥇
92_Conv2d_GroupNorm_Tanh_HardSwish_ResidualAdd_LogSumExp optimal_block_size_kernel_base
0.06 1.86 1.06
1 51 🥇
51_Argmax_over_a_dimension warp_argmax_nosm_edit_1
0.01 1.85 2.54
1 100 🥇
100_HingeLoss 100_HingeLoss
0.01 1.85 1.74
2 49 🥇
49_ConvTranspose3d_Softmax_Sigmoid adaptive_block_softmax_sigmoid_base_base
1.57 1.83 0.95
2 31 🥇
31_Conv2d_Min_Add_Multiply block_size_tuned_conv2d_base_base
0.03 1.83 2.42
2 3 🥇
3_ConvTranspose3d_Sum_LayerNorm_AvgPool_GELU 3_convtranspose3d_sum_layernorm_avgpool_gelu_opt_customnorm_base
25.20 1.81 0.40
1 35 🥇 0.37 1.81 1.41
2 53 🥇
53_Gemm_Scaling_Hardtanh_GELU modular_functions_base_edit_1
0.03 1.81 1.78
3 47 🥇
47_NetVladNoGhostClusters netvlad_fused_streams_edit_1
0.07 1.79 1.14
2 17 🥇
17_Conv2d_InstanceNorm_Divide unrolled_fused_conv_instnorm_base_base
0.04 1.79 1.62
2 21 🥇
21_Conv2d_Add_Scale_Sigmoid_GroupNorm shared_memory_coalesced_access_kernel_base
0.04 1.79 1.53
2 32 🥇
32_Conv2d_Scaling_Min warp_aligned_conv_scale_min_base
0.03 1.77 1.86
1 48 🥇
48_Mean_reduction_over_a_dimension evenly_distributed_mean_base
0.01 1.76 3.62
2 87 🥇
87_Conv2d_Subtract_Subtract_Mish 87_conv2d_subtract_subtract_mish_templated_base_base
0.03 1.75 2.05
2 25 🥇
25_Conv2d_Min_Tanh_Tanh conv_min_tanh_optimized_base
0.03 1.74 1.83
1 52 🥇
52_Argmin_over_a_dimension 52_argmin_tuned_blocks_base_base
0.01 1.73 2.40
1 83 🥇
83_conv_depthwise_2D_square_input_asymmetric_kernel hybrid_tiled_warp_depthwise_conv_edit_1
0.02 1.72 22.65
2 70 🥇
70_Gemm_Sigmoid_Scaling_ResidualAdd optimized_sigmoid_scaling_residual_add_base
0.03 1.71 1.74
2 4 🥇 0.03 1.71 2.34
2 52 🥇
52_Conv2d_Activation_BatchNorm ldg_alignment_fusion_opt_base
0.06 1.70 1.14
2 80 🥇
80_Gemm_Max_Subtract_GELU warp_aligned_gemm_base_edit_1
0.03 1.70 1.81
2 16 🥇 0.13 1.69 1.06
1 37 🥇
37_FrobeniusNorm_ modular_frobenius_norm_edit_1
0.20 1.64 2.53
2 67 🥇
67_Conv2d_GELU_GlobalAvgPool unrolled_fused_conv_gelu_pool_base
0.03 1.63 2.28
2 2 🥇
2_ConvTranspose2d_BiasAdd_Clamp_Scaling_Clamp_Divide ldg_128bit_align_opt_base
0.18 1.63 0.77
2 57 🥇
57_Conv2d_ReLU_HardSwish balanced_workload_conv2d_base
0.04 1.63 1.46
1 96 🥇 0.01 1.62 5.37
2 51 🥇
51_Gemm_Subtract_GlobalAvgPool_LogSumExp_GELU_ResidualAdd fused_forward_base
0.05 1.62 0.92
1 25 🥇
25_Swish 25_Swish
0.01 1.56 8.89
3 22 🥇
22_EfficientNetB0 22_EfficientNetB0
1.60 1.56 0.69
2 94 🥇
94_Gemm_BiasAdd_Hardtanh_Mish_GroupNorm fused_aligned_ldg_base
0.03 1.56 1.55
2 26 🥇
26_ConvTranspose3d_Add_HardSwish ldg_smem_vectorized_edit2_edit_1
3.32 1.53 1.00
1 84 🥇
84_conv_depthwise_2D_asymmetric_input_square_kernel 84_conv_dw2d_unroll_gridstride_shared_kernel_base
0.01 1.52 3.22
2 75 🥇
75_Gemm_GroupNorm_Min_BiasAdd fused_groupnorm_min_base
0.02 1.51 2.10
1 49 🥇
49_Max_reduction_over_a_dimension adaptive_max_reduce_base
0.02 1.50 2.04
2 37 🥇
37_Matmul_Swish_Sum_GroupNorm fused_swish_bias_groupnorm_aligned_edit_1
0.03 1.50 1.53
2 84 🥇
84_Gemm_BatchNorm_Scaling_Softmax modular_fused_gemm_bn_softmax_edit_1
0.04 1.50 0.81
2 93 🥇
93_ConvTranspose2d_Add_Min_GELU_Multiply warp_optimized_reduction_with_shared_memory_edit_1
0.17 1.49 0.80
3 19 🥇
19_MobileNetV1 19_MobileNetV1
1.13 1.49 0.79
1 47 🥇
47_Sum_reduction_over_a_dimension fully_unrolled_warp_sum_reduction_base
0.01 1.47 2.97
2 62 🥇
62_Matmul_GroupNorm_LeakyReLU_Sum warp_fused_gn_lrelu_sum_base
0.02 1.46 2.24
1 93 🥇 0.02 1.45 0.93
1 85 🥇
85_conv_depthwise_2D_asymmetric_input_asymmetric_kernel combined_conv_vectorized_edit_1
0.02 1.44 3.46
1 42 🥇 0.02 1.43 3.04
3 9 🥇 0.68 1.43 0.58
2 60 🥇
60_ConvTranspose3d_Swish_GroupNorm_HardSwish warp_optimized_fused_base_base
5.40 1.42 0.84
3 41 🥇
41_GRUBirectional 41_GRUBirectional
69.30 1.42 1.45
3 15 🥇
15_DenseNet121 optimized_dense_net_base
4.19 1.42 0.90
1 67 🥇
67_conv_standard_1D aligned_ldg_conv1d_base
0.01 1.41 2.75
2 7 🥇
7_Conv3d_ReLU_LeakyReLU_GELU_Sigmoid_BiasAdd coalesced_memory_activation_kernel_base_base
0.76 1.41 0.69
2 46 🥇
46_Conv2d_Subtract_Tanh_Subtract_AvgPool balanced_workload_conv2d_subtract_tanh_avgpool_base
0.04 1.39 1.48
3 50 🥇
50_ReLUSelfAttention shared_memory_bias_tiling_edit_1
3.71 1.37 0.73
3 1 🥇 0.03 1.35 6.59
2 61 🥇
61_ConvTranspose3d_ReLU_GroupNorm fused_rg_atomic_opt_base_base
0.18 1.34 1.14
2 27 🥇
27_Conv3d_HardSwish_ReLU_Softmax_Mean ldg_aligned_fused_kernel_base
0.83 1.34 0.72
3 10 🥇 23.20 1.33 1.33
3 7 🥇 1.70 1.31 0.71
2 12 🥇
12_Gemm_Multiply_LeakyReLU 12_gemm_warp_primitives_base
0.03 1.30 2.32
2 48 🥇
48_Conv3d_Scaling_Tanh_Multiply_Sigmoid optimized_hybrid_conv3d_base
0.78 1.29 0.68
3 23 🥇
23_EfficientNetB1 23_EfficientNetB1
1.09 1.28 0.68
2 54 🥇
54_Conv2d_Multiply_LeakyReLU_GELU dynamic_block_size_54conv_base
0.04 1.28 1.44
2 63 🥇
63_Gemm_ReLU_Divide unrolled_tiled_gemm_base_base
0.03 1.28 1.34
2 64 🥇
64_Gemm_LogSumExp_LeakyReLU_LeakyReLU_GELU_GELU tiled_fused_optimized_kernel_edit_1
0.06 1.25 0.80
2 90 🥇
90_Conv3d_LeakyReLU_Sum_Clamp_GELU aligned_vectorized_ldg_90_conv3d_edit_1
0.79 1.25 0.66
2 85 🥇
85_Conv2d_GroupNorm_Scale_MaxPool_Clamp conv2d_gn_scale_pool_clamp_sync_opt_base
0.06 1.24 1.15
1 82 🥇
82_conv_depthwise_2D_square_input_square_kernel manual_unroll_depthwise_2d_kernel_edit_1
0.03 1.24 2.14
3 39 🥇 27.75 1.24 1.83
1 44 🥇
44_Average_Pooling_1D vectorized_4x_base
0.01 1.24 8.08
2 44 🥇
44_ConvTranspose2d_Multiply_GlobalAvgPool_GlobalAvgPool_Mean optimized_spatial_reduction_edit_1
0.18 1.21 0.72
2 74 🥇
74_ConvTranspose3d_LeakyReLU_Multiply_LeakyReLU_Max 74_ConvTranspose3d_LeakyReLU_Multiply_LeakyReLU_Max_fused_edit_1
1.21 1.21 0.70
2 96 🥇
96_ConvTranspose3d_Multiply_Max_GlobalAvgPool_Clamp conv_transpose3d_opt_stride_loops_edit_1
4.39 1.21 1.22
2 8 🥇
8_Conv3d_Divide_Max_GlobalAvgPool_BiasAdd_Sum fused_stride_loops_base
0.75 1.21 0.91
3 33 🥇
33_VanillaRNN fused_rnn_i2h_warp_base
0.02 1.21 2.67
2 69 🥇
69_Conv2d_HardSwish_ReLU fused_hardswish_relu_const_edit_1
0.04 1.19 1.58
2 11 🥇
11_ConvTranspose2d_BatchNorm_Tanh_MaxPool_GroupNorm 11_convtranspose_bn_fusedtanhm_pool_groupnorm_warp_optimized_base
0.73 1.19 0.49
1 41 🥇 0.01 1.18 5.01
2 71 🥇
71_Conv2d_Divide_LeakyReLU aligned_memory_access_base
0.04 1.18 1.54
2 19 🥇
19_ConvTranspose2d_GELU_GroupNorm opt_convtrans_gelu_gn_even_distribution_base
0.55 1.18 0.61
3 12 🥇 3.15 1.17 0.70
2 1 🥇
1_Conv2D_ReLU_BiasAdd block_size_optimized_base
0.04 1.17 1.49
1 22 🥇 0.01 1.17 4.88
1 29 🥇 0.01 1.16 4.88
2 36 🥇
36_ConvTranspose2d_Min_Sum_GELU_Add warp_reduction_optimized_kernel_base_base
0.18 1.15 0.74
2 89 🥇
89_ConvTranspose3d_MaxPool_Softmax_Subtract_Swish_Max balanced_thread_block_distribution_base
5.03 1.14 0.99
1 31 🥇 0.01 1.14 4.80
1 20 🥇
20_LeakyReLU shared_leakyrelu_base
0.01 1.13 4.85
2 100 🥇
100_ConvTranspose3d_Clamp_Min_Divide memory_coalescing_optimization_base
0.57 1.13 0.84
2 6 🥇
6_Conv3d_Softmax_MaxPool_MaxPool strided_maxpool_base_base
0.95 1.13 0.90
1 26 🥇 0.01 1.13 4.96
1 28 🥇 0.01 1.12 4.96
1 21 🥇 0.01 1.11 4.82
1 27 🥇 0.01 1.10 4.96
2 5 🥇
5_ConvTranspose2d_Subtract_Tanh optimized_shared_mem_tanh_base
0.08 1.09 0.91
2 47 🥇
47_Conv3d_Mish_Tanh shared_mem_mish_tanh_base_base
0.10 1.09 0.95
1 72 🥇
72_conv_transposed_3D_asymmetric_input_asymmetric_kernel___strided_padded_grouped_ minimize_warp_divergence_base
25.36 1.08 1.09
1 19 🥇
19_ReLU 19_ReLU
0.01 1.08 4.75
3 44 🥇 28.16 1.08 0.82
1 24 🥇 0.01 1.07 3.72
2 34 🥇
34_ConvTranspose3d_LayerNorm_GELU_Scaling balanced_load_kernel_base
42.49 1.07 0.21
1 23 🥇 0.01 1.07 3.43
1 32 🥇 0.01 1.06 5.07
2 38 🥇
38_ConvTranspose3d_AvgPool_Clamp_Softmax_Multiply modular_device_functions_refactor_base
0.64 1.06 0.95
2 43 🥇
43_Conv3d_Max_LogSumExp_ReLU block_tuned_fused_kernel_base_base
0.79 1.05 1.01
2 72 🥇
72_ConvTranspose3d_BatchNorm_AvgPool_AvgPool warp_uniform_control_flow_edit_1
23.59 1.05 1.06
2 78 🥇
78_ConvTranspose3d_Max_Max_Sum optimized_maxpool_kernel_base
0.58 1.05 1.21
2 98 🥇
98_Matmul_AvgPool_GELU_Scale_Max fused_pipeline_base
0.03 1.04 1.50
1 94 🥇 0.02 1.03 2.04
1 75 🥇
75_conv_transposed_2D_asymmetric_input_asymmetric_kernel_strided__grouped____padded____dilated__ conv_transposed_2d_tiled_shared_bias_base
6.44 1.03 1.04
2 73 🥇
73_Conv2d_BatchNorm_Scaling 73_Conv2d_BatchNorm_Scaling
0.12 1.02 1.43
1 69 🥇
69_conv_transposed_2D__asymmetric_input__asymmetric_kernel 69_conv_transposed_2D__asymmetric_input__asymmetric_kernel
0.03 1.02 1.74
1 64 🥇
64_conv_transposed_1D modular_64_conv_transposed_1D_base
0.02 1.02 2.24
3 11 🥇 3.19 1.02 0.60
3 16 🥇 8.04 1.01 1.03
2 41 🥇
41_Gemm_BatchNorm_GELU_GroupNorm_Mean_ReLU 41_Gemm_BatchNorm_GELU_GroupNorm_Mean_ReLU
0.06 1.01 0.66
1 91 🥇 0.04 1.01 0.96
2 77 🥇
77_ConvTranspose3d_Scale_BatchNorm_GlobalAvgPool ldg_memory_alignment_optimization_edit_1
0.78 1.01 0.72
1 57 🥇
57_conv_transposed_2D__square_input__square_kernel stride_loop_optimized_conv_transpose2d_base
0.15 1.01 1.18
1 58 🥇
58_conv_transposed_3D__asymmetric_input__asymmetric_kernel 58_conv_transposed_3D__asymmetric_input__asymmetric_kernel
2.30 1.01 1.03
1 78 🥇
78_conv_transposed_2D_asymmetric_input_asymmetric_kernel___padded__ conv_trans_tuned_blocks_base_base
0.33 1.00 1.14
3 27 🥇
27_RegNet 27_RegNet
2.24 1.00 0.44
3 45 🥇
45_UNetSoftmax 45_UNetSoftmax
4.94 1.00 0.88
1 62 🥇
62_conv_standard_2D__square_input__asymmetric_kernel 62_conv_standard_2D__square_input__asymmetric_kernel
0.28 1.00 1.52
2 24 🥇 0.77 1.00 0.69
3 8 🥇
8_ResNetBasicBlock 8_ResNetBasicBlock
1.68 1.00 0.48
2 50 🥇
50_ConvTranspose3d_Scaling_AvgPool_BiasAdd_Scaling 50_ConvTranspose3d_Scaling_AvgPool_BiasAdd_Scaling
5.73 1.00 0.75
1 8 🥇
8_Matmul_with_irregular_shapes_ cublas_handle_reuse_optimized_edit_1
6.22 1.00 1.02
3 25 🥇
25_ShuffleNetUnit 25_ShuffleNetUnit
9.27 1.00 1.31
3 14 🥇
14_DenseNet121DenseBlock 14_DenseNet121DenseBlock
7.15 1.00 0.86
1 5 🥇
5_Matrix_scalar_multiplication vector_no_sync_scalar_mult_edit_1
0.18 1.00 2.21
1 71 🥇
71_conv_transposed_2D__asymmetric_input__square_kernel atomic_optimized_transpose_base
0.27 1.00 1.16
2 15 🥇
15_ConvTranspose3d_BatchNorm_Subtract 15_ConvTranspose3d_BatchNorm_Subtract
2.02 1.00 0.53
3 6 🥇
6_GoogleNetInceptionModule 6_GoogleNetInceptionModule
8.72 1.00 1.15
1 2 🥇
2_Standard_matrix_multiplication_ hybrid_matmul_base
0.43 1.00 1.08
3 17 🥇
17_SqueezeNetFireModule 17_SqueezeNetFireModule
0.84 1.00 0.93
1 63 🥇
63_conv_standard_2D__square_input__square_kernel adaptive_conv2d_cuda_base
0.23 1.00 1.68
2 39 🥇 0.05 1.00 0.51
1 55 🥇
55_conv_standard_2D__asymmetric_input__square_kernel adaptive_conv2d_blocksize_tuning_base
0.12 1.00 1.75
3 5 🥇
5_AlexNet 5_AlexNet
0.56 0.99 0.84
3 40 🥇
40_GRUHidden 40_GRUHidden
36.25 0.97 1.38
3 13 🥇
13_DenseNet121TransitionLayer 13_DenseNet121TransitionLayer
0.57 0.94 0.52
1 4 🥇
4_Matrix_vector_multiplication_ modular_device_functions_matvec_base
0.07 0.94 2.61
1 79 🥇
79_conv_transposed_1D_asymmetric_input_square_kernel___padded____strided____dilated__ conv_transpose1d_shared_tile_sync_base
0.02 0.93 2.61
2 76 🥇
76_Gemm_Add_ReLU combined_warp_tile_base
0.03 0.93 1.54
2 86 🥇 0.03 0.92 1.63
2 30 🥇
30_Gemm_GroupNorm_Hardtanh warp_divergence_minimization_base
0.06 0.88 0.91
1 86 🥇
86_conv_depthwise_separable_2D conv_dw_separable_strided_loops_edit_1
0.31 0.81 1.18
1 9 🥇
9_Tall_skinny_matrix_multiplication_ unrolled_loop_matmul_base
0.68 0.78 0.59
3 36 🥇 36.44 0.76 1.59
1 14 🥇
14_Matmul_for_upper_triangular_matrices coalesced_memory_access_upper_triangular_matmul_base
3.90 0.72 0.74
1 7 🥇
7_Matmul_with_small_K_dimension_ modular_matmul_refactored_base
0.99 0.67 0.58
1 13 🥇
13_Matmul_for_symmetric_matrices vec_ldg_aligned_matmul_128_base_optimized_base
4.29 0.64 0.67
1 33 🥇 0.90 0.62 0.39
3 37 🥇 52.09 0.52 1.10
3 35 🥇 72.97 0.44 0.83
1 1 🥇
1_Square_matrix_multiplication_ regtile_2x2_optimized_sync_edit_1
1.01 0.42 0.44
1 80 🥇
80_conv_standard_2D_square_input_asymmetric_kernel___dilated____padded__ warp_divergence_minimized_conv2d_base
0.88 0.31 0.48
1 3 🥇
3_Batched_matrix_multiplication bmm_tiled_shared_memory_optimized_edit_1
0.51 0.25 0.35
1 87 🥇
87_conv_pointwise_2D stride_loop_conv2d_base
0.45 0.25 0.77
1 11 🥇
11_4D_tensor_matrix_multiplication ldg_optimized_shared_mem_tiled_4d_matrix_mult_base
94.27 0.22 0.22
1 10 🥇
10_3D_tensor_matrix_multiplication unrolled_tiled_kernel_base
5.52 0.21 0.23
1 18 🥇
18_Matmul_with_transposed_both optimized_matmul_transpose_base
1.87 0.19 0.22
1 81 🥇
81_conv_transposed_2D_asymmetric_input_square_kernel___dilated____padded____strided__ conv_transpose2d_thread_block_map_edit_1
10.93 0.16 0.17
1 16 🥇
16_Matmul_with_transposed_A tiled_double_output_base
2.29 0.15 0.17
1 60 🥇
60_conv_standard_3D__square_input__asymmetric_kernel pipelined_streams_conv3d_base
37.87 0.14 0.14
1 17 🥇
17_Matmul_with_transposed_B warp_matmul_optimized_v2_base
2.53 0.14 0.16
1 56 🥇
56_conv_standard_2D__asymmetric_input__asymmetric_kernel 56_conv_unrolled_2d_base
1.13 0.13 0.20
1 6 🥇
6_Matmul_with_large_K_dimension_ double_buffered_matmul_base
5.11 0.07 0.11
1 65 🥇
65_conv_transposed_2D__square_input__asymmetric_kernel atomic_operations_minimization_base
3.71 0.05 0.06