sparse_permute_and_compute
on H100s according to the PR. This is the grouped_permute_and_compute
part in MegaBlocks. (Looks like this is only good for H100s)Naive matrix multiplication 2304 thread blocks on 68 SMs
v0 - 4 thread blocks on 4 SMs
v1 - 4 thread blocks on 4 SMs
v2 - 1 thread block on 1 SM
v3 - 1 thread block on 1 SM