Common


https://towardsdatascience.com/ai-accelerators-machine-learning-algorithms-and-their-co-design-and-evolution-2676efd47179

https://towardsdatascience.com/ai-accelerators-machine-learning-algorithms-and-their-co-design-and-evolution-2676efd47179

Single precision

Called "single" because it uses a single set of 32 bits (in the IEEE 754 standard) to represent a floating-point number.

associative_scan

https://www.youtube.com/watch?app=desktop&v=OO3o14cINbo

https://www.youtube.com/watch?app=desktop&v=OO3o14cINbo

PyTorch


Untitled

GPU

1. CUDA


https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

Untitled

2. NCCL


NVIDIA Collective Communications Library

Multi-node communication primitives

Untitled