Attention layer: Number of parameters doesn't change when changing number of heads

Question

Imran 2024-1-9

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2068031-attention-layer-number-of-parameters-doesn-t-change-when-changing-number-of-heads

评论：铖 2024-7-5

Changing the number of heads attribute of an attention layer from the Matlab deep learning toolbox doesn't seem to affect the resulting number of learnable parameters.

The following code will result in 1793 total paramters

% Number of heads for multi head attention layer
num_heads = 1;
% Number of key channels for querry, key and value
num_keys = 256;
% Number of classes
num_classes = 5;
% Define architecture
network_layers = [
    sequenceInputLayer(1)
    selfAttentionLayer(num_heads,num_keys)
    fullyConnectedLayer(num_classes)
    softmaxLayer
    classificationLayer];
% Define layer graph
net = layerGraph;
net = addLayers(net,network_layers);
% Plot network structure
analyzeNetwork(net)

When changing the number of heads to e.g. 16, the number of learnable paramters doesn't change.

% Number of heads for multi head attention layer
num_heads = 16;

Why is that?

Shouldn't the number of learnable paramters of the attention layer increase proportional to the number of heads?

Any help is highly appreciated!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Angelo Yeo 2024-1-10

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2068031-attention-layer-number-of-parameters-doesn-t-change-when-changing-number-of-heads#answer_1386541

This is expected. Increasing or decreasing the number of heads of multi-head attention does not change the total number of learnable parameters. This is because multi-headed attention divides the embedding dimensionality (or model dimensionality) by the number of heads. Multiple heads are for parallel computation and this does not result in the change of # parameters.

Let's assume a single-head attention model with model dimensionality d. When this model proejcts embeddeings to a single triplet of Q, K, V tensors, it will produce

parameters excluding biases.

Let's also assume another model with multi-head attention with k attention heads. When this model project embeddings to triplets of d/k-dimentions Q, K, V tensors, it will produce

parameters excluding biases.

3 个评论
显示 1更早的评论隐藏 1更早的评论

Angelo Yeo 2024-7-5

编辑：Angelo Yeo 2024-7-5

在 MATLAB Online 中打开

Hi @铖, the code below might reproduce what you have experienced. So your question is why the learnable parameters for two different models with different number of heads of self-attention layers have the same dimension.

% Number of heads for multi head attention layer
num_heads1 = 1;
num_heads2 = 4;
% Number of key channels for querry, key and value
num_keys = 64;
% Number of classes
num_classes = 5;
% Define architectures
network_layers1 = [
    sequenceInputLayer(1)
    selfAttentionLayer(num_heads1,num_keys)
    fullyConnectedLayer(num_classes)
    softmaxLayer];
network_layers2 = [
    sequenceInputLayer(1)
    selfAttentionLayer(num_heads2,num_keys)
    fullyConnectedLayer(num_classes)
    softmaxLayer];
% Define layer graph
net1 = dlnetwork(network_layers1);
net2 = dlnetwork(network_layers2);
net1.Learnables
ans = 10x3 table
         Layer            Parameter            Value     
    _______________    _______________    _______________

    "selfattention"    "QueryWeights"     {64x1  dlarray}
    "selfattention"    "KeyWeights"       {64x1  dlarray}
    "selfattention"    "ValueWeights"     {64x1  dlarray}
    "selfattention"    "OutputWeights"    { 1x64 dlarray}
    "selfattention"    "QueryBias"        {64x1  dlarray}
    "selfattention"    "KeyBias"          {64x1  dlarray}
    "selfattention"    "ValueBias"        {64x1  dlarray}
    "selfattention"    "OutputBias"       { 1x1  dlarray}
    "fc"               "Weights"          { 5x1  dlarray}
    "fc"               "Bias"             { 5x1  dlarray}
net2.Learnables
ans = 10x3 table
         Layer            Parameter            Value     
    _______________    _______________    _______________

    "selfattention"    "QueryWeights"     {64x1  dlarray}
    "selfattention"    "KeyWeights"       {64x1  dlarray}
    "selfattention"    "ValueWeights"     {64x1  dlarray}
    "selfattention"    "OutputWeights"    { 1x64 dlarray}
    "selfattention"    "QueryBias"        {64x1  dlarray}
    "selfattention"    "KeyBias"          {64x1  dlarray}
    "selfattention"    "ValueBias"        {64x1  dlarray}
    "selfattention"    "OutputBias"       { 1x1  dlarray}
    "fc"               "Weights"          { 5x1  dlarray}
    "fc"               "Bias"             { 5x1  dlarray}

To answer this question, it's good to understand what "multi-head attention" does. In essence, it divides the embedding dimensions and helps to perform calculations in parallel. Let's take a look at the picture below (from Attention Is All You Need, Vaswani et al., 2017).

Do you see the "h"? The "h" indicates the number of heads and it means how many parts you want to split the embedding dimensions into. And you see the "Concat" block? Yes, so, after calculating "Scaled Dot-Product Attention", you concatenate the divided tokens.

Let's say you have a sentence "Anthony Hopkins admired Michael Bay as a great director." input to the self attention layer, and let's assume that you have defined the embedding dimension as 512. Then the embedded vector for "Anthony Hopkins admired Michael Bay as a great director." will be represented as a 9 x 512 matrix where "9" represents the number of tokens and "512" represents the embedding dimension. If you have set the number of heads as 8, there would be eight matrices of 9x(512/8) = 9x64 dimension. These eight matrices will be concatenated after calculating attention score. The example and pictue are taken from K_1.3. Multi-head Attention, deep dive_EN - Deep Learning Bible - 3. Natural Language Processing - Eng. (wikidocs.net).

In conclusion, the number of heads does not affect the number of learnable parameters in self-attention layer.

铖 2024-7-5

Hello, thank you very much for your reply. I have understood the dimension issue of the multi head attention mechanism. Furthermore, my problem is actually that when using layer=selfAttentionLayer (4,64) or layer=selfAttentionLayer (1,64) for training, they trained exactly the same weights, which means that the values trained by these two different codes are exactly the same! This surprised me very much. I am using the version Matlab2023B. Thank you very much for your help!

请先登录，再进行评论。

Attention layer: Number of parameters doesn't change when changing number of heads

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

3 个评论
显示 1更早的评论隐藏 1更早的评论

另请参阅

类别

标签

产品

Community Treasure Hunt

Attention layer: Number of parameters doesn't change when changing number of heads

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

3 个评论 显示 1更早的评论隐藏 1更早的评论

另请参阅

类别

标签

产品

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论