why in the transformer model from the "Attention is all you need" paper there is no activation applied after both the multihead attention layer and to the residual connections. It seems to me that there are multiple linear layers in a row, and I have always been under the impression that you should have an activation between linear layers.
For instance when I look at the different flavors of resnet they always apply some sort of non linearity following a linear layer. For instance a residual block might look something like...
Input -> Conv -> BN -> Relu -> Conv -> (+ Input) -> BN -> Relu
or in the case of pre-activation...
Input -> BN -> Relu -> Conv -> BN -> Relu -> Conv -> (+ Input)
In all the resnet flavors I have seen, they never allow two linear layers to be connected without a relu in-between.
However in the the transformer...
Input -> Multihead-Attn -> Add/Norm -> Feed Forward(Dense Layer -> Relu -> Dense Layer) -> Add/Norm
In the multihead attention layer it performs the attention mechanism and then applies a fully connected layer to project back to the dimension of its input. However, there is no non linearity between that and feed forward network (except for maybe the softmax used in part of the attention.) A model like this would make more sense to me...
Input -> Multihead-Attn -> Add/Norm -> Relu -> Feed Forward(Dense Layer -> Relu -> Dense Layer) -> Add/Norm -> Relu
or something like the pre-activated resnet...
Input -> Relu -> Multihead-Attn -> Add/Norm -> Input2 -> Relu -> Feed Forward(Dense Layer -> Relu -> Dense Layer) -> Add/Norm(Input2)
Can anyone explain why the transformer is the way it is?
I have asked a similar question when I was looking at the architecture of wavenet on another forum but I never really got a clear answer. In that case it did not make sense to me again why there was no activation applied to the residual connections. (https://www.reddit.com/r/MachineLearning/comments/njbjfb/d_is_there_a_point_to_having_layers_with_just_a/)
This goes back to the purpose of self-attention.
Measure between word-vectors is generally computed through cosine-similarity because in the dimensions word tokens exist, it's highly unlikely for two words to be colinear even if they are trained to be closer in value if they are similar. However, two trained tokens will have higher cosine-similarity if they are semantically closer to each other than two completely unrelated words.
This fact is exploited by the self-attention mechanism; After several of these matrix multiplications, the dissimilar words will zero out or become negative due to the dot product between them, and the similar words will stand out in the resulting matrix.
So, as Tom points out in the comments below, self attention can be viewed as a weighted average, where less similar words become averaged out faster (toward the zero vector, on average), thereby achieving groupings of important and unimportant words (i.e. attention). The weighting happens through the dot product. If input vectors were normalized, the weights would be exactly the cosine similarities.
The important thing to take into consideration is that within the self-attention mechanism, there are no parameters; Those linear operations are just there to capture the relationship between the different vectors by using the properties of the vectors used to represent them.
Read this blog post by Peter Bloem for a more in-depth explanation of self-attention.