主要内容

deep.gpu.fastAttentionAlgorithms

Disable fast attention algorithms used by deep learning operations on the GPU

Since R2026a

    Description

    previousState = deep.gpu.fastAttentionAlgorithms(newState) returns the current state of the GPU fast attention algorithms option as 1 (true) or 0 (false) before changing the state according to the input newState. The default is 1 (true). This function requires Parallel Computing Toolbox™.

    If newState is 1 (true), then subsequent calls to GPU deep learning attention operations use algorithms optimized for performance. These algorithms achieve improved performance by using reduced-precision arithmetic, that is, arithmetic that uses fewer bits than single-precision arithmetic. If newState is 0 (false), then subsequent calls to GPU deep learning attention operations use higher-precision algorithms at the cost of performance.

    example

    state = deep.gpu.fastAttentionAlgorithms returns the current state of GPU fast attention algorithms option as 1 (true) or 0 (false).

    Tip

    Use this function if your training loss is NaN and normalizing your training data does not resolve the issue. For more information about normalizing training data, see Normalize Sequence Data.

    Examples

    collapse all

    Specify the sizes of the queries, keys, and values.

    querySize = 120;
    valueSize = 120;
    numQueries = 100;
    numValues = 80;
    numObservations = 64;

    Create random, single-precision gpuArray data containing the queries, keys, and values. For the queries, specify the dlarray format "CBT" (channel, batch, time).

    queries = dlarray(rand(querySize,numObservations,numQueries,"single","gpuArray"),"CBT");
    keys = dlarray(rand(querySize,numObservations,numValues,"single","gpuArray"));
    values = dlarray(rand(valueSize,numObservations,numValues,"single","gpuArray"));

    Specify the number of attention heads.

    numHeads = 5;

    To simulate data with significant outliers, scale up 100 random queries so that the query data has a large dynamic range.

    idx = randperm(numel(queries),100);
    queries(idx) = 1e5*queries(idx);

    Inspect the smallest and largest query.

    [minQuery,maxQuery] = bounds(queries,"all")
    minQuery = 
      1(C) × 1(B) × 1(T) single gpuArray dlarray
    
      2.2615e-06
    
    
    maxQuery = 
      1(C) × 1(B) × 1(T) single gpuArray dlarray
    
      9.8323e+04
    
    

    Apply the attention operation.

    Y = attention(queries,keys,values,numHeads);

    Count the number of NaN values in the output. Due to the large dynamic range of the input data and the fast attention algorithm using reduced-precision arithmetic, most of the output values are NaN.

    sum(isnan(Y),"all")
    ans = 
      1(C) × 1(B) × 1(T) gpuArray dlarray
    
            1008
    
    

    Disable fast attention algorithms and store its previous state, then apply the attention operation again. As the attention operation now uses higher-precision arithmetic, it will run slower than before.

    previousState = deep.gpu.fastAttentionAlgorithms(false);
    Y = attention(queries,keys,values,numHeads);

    Count the number of NaN values in the output. As the attention algorithm now uses single-precision arithmetic, there are no NaN values in the output.

    any(isnan(Y),"all")
    ans = 
      1(C) × 1(B) × 1(T) logical gpuArray dlarray
    
       0
    
    

    Restore the fast attention algorithms option to its original state.

    deep.gpu.fastAttentionAlgorithms(previousState);

    Input Arguments

    collapse all

    New state of GPU fast attention algorithms option, specified as one of the following:

    • 1 (true) — Subsequent calls to GPU deep learning attention operations use algorithms optimized for performance. These algorithms achieve improved performance by using reduced-precision arithmetic, that is, arithmetic that uses fewer bits than single-precision arithmetic.

    • 0 (false) — Subsequent calls to GPU deep learning attention operations use higher-precision algorithms at the cost of performance.

    The layers and functions that use GPU fast attention algorithms by default are:

    Data Types: logical

    Version History

    Introduced in R2026a