CUDA Glossary
References
Untitled

https://www.mishalaskin.com/posts/tensor_parallel

https://www.mishalaskin.com/posts/tensor_parallel
Megatron





Mixed-precision Training

https://www.youtube.com/watch?v=UvRl4ansfCg

https://arxiv.org/pdf/1710.03740.pdf


Example: Adam optimizer

ZeRO
- Adam + Mixed precision
- Parameter: 2 (fp16)
- Gradient: 2 (fp16)
- 4 (fp32) * 3 = 12
- Parameter (fp32)
- Momentum (fp32)
- Variance (fp32)

https://www.youtube.com/watch?v=By_O0k102PY