CUDA Glossary
References
Untitled
https://www.mishalaskin.com/posts/tensor_parallel
https://www.mishalaskin.com/posts/tensor_parallel
Megatron
Mixed-precision Training
https://www.youtube.com/watch?v=UvRl4ansfCg
https://arxiv.org/pdf/1710.03740.pdf
Example: Adam optimizer
ZeRO
- Adam + Mixed precision
- Parameter: 2 (fp16)
- Gradient: 2 (fp16)
- 4 (fp32) * 3 = 12
- Parameter (fp32)
- Momentum (fp32)
- Variance (fp32)
https://www.youtube.com/watch?v=By_O0k102PY