FPGA Field-Programmable Gate Array
.png)
https://towardsdatascience.com/ai-accelerators-machine-learning-algorithms-and-their-co-design-and-evolution-2676efd47179
GPU

Event
synchronization markers that can be used to monitor the device's progress, to accurately measure timing, and to synchronize CUDA streams.
Stream
A sequence of operations that execute in issue order on the GPU
CUDA Core
Can execute instructions from threads within a warp in parallel
A single CUDA core executes a single thread at a time
Streaming Multiprocessors
A GPU consists of several SMs
Within each SM, threads are grouped into warps (32 threads in NVIDIA GPUs)

https://medium.com/@smallfishbigsea/basic-concepts-in-gpu-computing-3388710e9239
sycnhronize
Waits for all kernels in all streams on a CUDA device to complete.
CUDA Concurrency

https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

Malloc
Memory allocation
global is a specifier indicating that the function is a CUDA kernel that can be called from the host (CPU) and executed on the device (GPU).
restrict
tell the compiler that a pointer is the only reference, through any path, to the object it points to for the lifetime of the pointer, which allows the compiler to generate more optimized code (e.g. loop vectorization, parallelization**)**
blockDim
the dimension of each block, the number of threads per block a thread block may contain up to 1024 threads.
Grid (collection of blocks)

https://nyu-cds.github.io/python-gpu/02-cuda/

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html


Warp

https://medium.com/@smallfishbigsea/basic-concepts-in-gpu-computing-3388710e9239
Multi-node communication primitives

Rank: ith GPU
AllReduce = All-to-all + reduce

https://spcl.inf.ethz.ch/Teaching/2019-dphpc/lectures/lecture11-nns.pdf
Broadcast

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html
Reduce
ou[i] = sum(inX[i])

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html
AllGather = All-to-all + gather
out[Y * size of block (height) + i] = inY[i]

ReduceScatter = Reduce + Scatter (distributes equal portions of it to all processes)
outY[i]: output of Yth rank at ith index
= sum_X(inX[Y * count + i])
