CUDA | Notion

Intro

[x] https://github.com/cuda-mode/lectures
[ ] https://nanxiao.gitbooks.io/cuda-little-book
[ ] https://github.com/tgale96/grouped_gemm
- 15-25% faster than MegaBlocks sparse_permute_and_compute on H100s according to the PR. This is the grouped_permute_and_compute part in MegaBlocks. (Looks like this is only good for H100s)
[ ] https://github.com/pytorch/pytorch/blob/main/caffe2/utils/math/elementwise.cu
[ ] https://ppc.cs.aalto.fi/ch4/cuda/
- [ ] https://github.com/matiaslindgren/cuda-memory-access-recorder/tree/master/examples

Naive matrix multiplication
2304 thread blocks on 68 SMs

Naive matrix multiplication 2304 thread blocks on 68 SMs

v0 - 4 thread blocks on 4 SMs

v0 - 4 thread blocks on 4 SMs

v1 - 4 thread blocks on 4 SMs

v1 - 4 thread blocks on 4 SMs

v2 - 1 thread block on 1 SM

v2 - 1 thread block on 1 SM

v3 - 1 thread block on 1 SM

v3 - 1 thread block on 1 SM

Bible

[ ] https://docs.nvidia.com/cuda/

Diffusion

Language Model