Transformer | Notion

Untitled

Untitled

Untitled

Untitled

Each query sees the whole context.

Untitled

Normalize the activation for each layer
Subtract the average and divide it by its standard deviation (across the features)
Usually located right before activation
Why?
- Mitigate internal covariant shift
- Avoid the vanishing gradient problem
- Work as regularization
- Faster convergence

Untitled