Efficient Transformer


https://www.tensorflow.org/text/tutorials/transformer#define_the_components
The global self attention layer


The causal self attention layer


The cross-attention layer


Each query sees the whole context.

Layer normalization
- Normalize the activation for each layer
- Subtract the average and divide it by its standard deviation (across the features)
- Usually located right before activation
- Why?
- Mitigate internal covariant shift
- Avoid the vanishing gradient problem
- Work as regularization
- Faster convergence
Residual connection
- Allow gradients to flow directly through the skip connection




