sparse_permute_and_compute on H100s according to the PR. This is the grouped_permute_and_compute part in MegaBlocks. (Looks like this is only good for H100s)
Naive matrix multiplication 2304 thread blocks on 68 SMs

v0 - 4 thread blocks on 4 SMs

v1 - 4 thread blocks on 4 SMs

v2 - 1 thread block on 1 SM

v3 - 1 thread block on 1 SM