Attention layer: Number of parameters doesn't change when changing number of heads

14 次查看(过去 30 天)
Changing the number of heads attribute of an attention layer from the Matlab deep learning toolbox doesn't seem to affect the resulting number of learnable parameters.
The following code will result in 1793 total paramters
% Number of heads for multi head attention layer
num_heads = 1;
% Number of key channels for querry, key and value
num_keys = 256;
% Number of classes
num_classes = 5;
% Define architecture
network_layers = [
sequenceInputLayer(1)
selfAttentionLayer(num_heads,num_keys)
fullyConnectedLayer(num_classes)
softmaxLayer
classificationLayer];
% Define layer graph
net = layerGraph;
net = addLayers(net,network_layers);
% Plot network structure
analyzeNetwork(net)
When changing the number of heads to e.g. 16, the number of learnable paramters doesn't change.
% Number of heads for multi head attention layer
num_heads = 16;
Why is that?
Shouldn't the number of learnable paramters of the attention layer increase proportional to the number of heads?
Any help is highly appreciated!

回答(1 个)

Angelo Yeo
Angelo Yeo 2024-1-10
This is expected. Increasing or decreasing the number of heads of multi-head attention does not change the total number of learnable parameters. This is because multi-headed attention divides the embedding dimensionality (or model dimensionality) by the number of heads. Multiple heads are for parallel computation and this does not result in the change of # parameters.
Let's assume a single-head attention model with model dimensionality d. When this model proejcts embeddeings to a single triplet of Q, K, V tensors, it will produce parameters excluding biases.
Let's also assume another model with multi-head attention with k attention heads. When this model project embeddings to triplets of d/k-dimentions Q, K, V tensors, it will produce parameters excluding biases.
  3 个评论
Angelo Yeo
Angelo Yeo 2024-7-5
编辑:Angelo Yeo 2024-7-5
Hi @铖, the code below might reproduce what you have experienced. So your question is why the learnable parameters for two different models with different number of heads of self-attention layers have the same dimension.
% Number of heads for multi head attention layer
num_heads1 = 1;
num_heads2 = 4;
% Number of key channels for querry, key and value
num_keys = 64;
% Number of classes
num_classes = 5;
% Define architectures
network_layers1 = [
sequenceInputLayer(1)
selfAttentionLayer(num_heads1,num_keys)
fullyConnectedLayer(num_classes)
softmaxLayer];
network_layers2 = [
sequenceInputLayer(1)
selfAttentionLayer(num_heads2,num_keys)
fullyConnectedLayer(num_classes)
softmaxLayer];
% Define layer graph
net1 = dlnetwork(network_layers1);
net2 = dlnetwork(network_layers2);
net1.Learnables
ans = 10x3 table
Layer Parameter Value _______________ _______________ _______________ "selfattention" "QueryWeights" {64x1 dlarray} "selfattention" "KeyWeights" {64x1 dlarray} "selfattention" "ValueWeights" {64x1 dlarray} "selfattention" "OutputWeights" { 1x64 dlarray} "selfattention" "QueryBias" {64x1 dlarray} "selfattention" "KeyBias" {64x1 dlarray} "selfattention" "ValueBias" {64x1 dlarray} "selfattention" "OutputBias" { 1x1 dlarray} "fc" "Weights" { 5x1 dlarray} "fc" "Bias" { 5x1 dlarray}
net2.Learnables
ans = 10x3 table
Layer Parameter Value _______________ _______________ _______________ "selfattention" "QueryWeights" {64x1 dlarray} "selfattention" "KeyWeights" {64x1 dlarray} "selfattention" "ValueWeights" {64x1 dlarray} "selfattention" "OutputWeights" { 1x64 dlarray} "selfattention" "QueryBias" {64x1 dlarray} "selfattention" "KeyBias" {64x1 dlarray} "selfattention" "ValueBias" {64x1 dlarray} "selfattention" "OutputBias" { 1x1 dlarray} "fc" "Weights" { 5x1 dlarray} "fc" "Bias" { 5x1 dlarray}
To answer this question, it's good to understand what "multi-head attention" does. In essence, it divides the embedding dimensions and helps to perform calculations in parallel. Let's take a look at the picture below (from Attention Is All You Need, Vaswani et al., 2017).
Do you see the "h"? The "h" indicates the number of heads and it means how many parts you want to split the embedding dimensions into. And you see the "Concat" block? Yes, so, after calculating "Scaled Dot-Product Attention", you concatenate the divided tokens.
Let's say you have a sentence "Anthony Hopkins admired Michael Bay as a great director." input to the self attention layer, and let's assume that you have defined the embedding dimension as 512. Then the embedded vector for "Anthony Hopkins admired Michael Bay as a great director." will be represented as a 9 x 512 matrix where "9" represents the number of tokens and "512" represents the embedding dimension. If you have set the number of heads as 8, there would be eight matrices of 9x(512/8) = 9x64 dimension. These eight matrices will be concatenated after calculating attention score. The example and pictue are taken from K_1.3. Multi-head Attention, deep dive_EN - Deep Learning Bible - 3. Natural Language Processing - Eng. (wikidocs.net).
In conclusion, the number of heads does not affect the number of learnable parameters in self-attention layer.
铖
2024-7-5
Hello, thank you very much for your reply. I have understood the dimension issue of the multi head attention mechanism. Furthermore, my problem is actually that when using layer=selfAttentionLayer (4,64) or layer=selfAttentionLayer (1,64) for training, they trained exactly the same weights, which means that the values trained by these two different codes are exactly the same! This surprised me very much. I am using the version Matlab2023B. Thank you very much for your help!

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Deep Learning Toolbox 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by