Efficient Transformer
https://www.tensorflow.org/text/tutorials/transformer#define_the_components
The global self attention layer
The causal self attention layer
The cross-attention layer
Each query sees the whole context.
Layer normalization
- Normalize the activation for each layer
- Subtract the average and divide it by its standard deviation (across the features)
- Usually located right before activation
- Why?
- Mitigate internal covariant shift
- Avoid the vanishing gradient problem
- Work as regularization
- Faster convergence
Residual connection
- Allow gradients to flow directly through the skip connection