Main Content

vggishPreprocess

Preprocess audio for VGGish feature extraction

Since R2021a

    Description

    example

    features = vggishPreprocess(audioIn,fs) generates mel spectrograms from audioIn that can be fed to the VGGish pretrained network.

    features = vggishPreprocess(audioIn,fs,'OverlapPercentage',OP) specifies the overlap percentage between consecutive audio frames.

    For example, vggishPreprocess(audioIn,fs,'OverlapPercentage',75) applies a 75% overlap between consecutive frames used to generate the spectrograms.

    Examples

    collapse all

    Download and unzip the Audio Toolbox™ model for VGGish.

    Type vggish at the Command Window. If the Audio Toolbox model for VGGish is not installed, then the function provides a link to the location of the network weights. To download the model, click the link. Unzip the file to a location on the MATLAB path.

    Alternatively, execute these commands to download and unzip the VGGish model to your temporary directory.

    downloadFolder = fullfile(tempdir,'VGGishDownload');
    loc = websave(downloadFolder,'https://ssd.mathworks.com/supportfiles/audio/vggish.zip');
    VGGishLocation = tempdir;
    unzip(loc,VGGishLocation)
    addpath(fullfile(VGGishLocation,'vggish'))

    Check that the installation is successful by typing vggish at the Command Window. If the network is installed, then the function returns a SeriesNetwork (Deep Learning Toolbox) object.

    vggish
    ans = 
      SeriesNetwork with properties:
    
             Layers: [24×1 nnet.cnn.layer.Layer]
         InputNames: {'InputBatch'}
        OutputNames: {'regressionoutput'}
    
    

    Load a pretrained VGGish convolutional neural network and examine the layers and classes.

    Use vggish to load the pretrained VGGish network. The output net is a SeriesNetwork (Deep Learning Toolbox) object.

    net = vggish
    net = 
      SeriesNetwork with properties:
    
             Layers: [24×1 nnet.cnn.layer.Layer]
         InputNames: {'InputBatch'}
        OutputNames: {'regressionoutput'}
    
    

    View the network architecture using the Layers property. The network has 24 layers. There are nine layers with learnable weights, of which six are convolutional layers and three are fully connected layers.

    net.Layers
    ans = 
      24×1 Layer array with layers:
    
         1   'InputBatch'         Image Input         96×64×1 images
         2   'conv1'              Convolution         64 3×3×1 convolutions with stride [1  1] and padding 'same'
         3   'relu'               ReLU                ReLU
         4   'pool1'              Max Pooling         2×2 max pooling with stride [2  2] and padding 'same'
         5   'conv2'              Convolution         128 3×3×64 convolutions with stride [1  1] and padding 'same'
         6   'relu2'              ReLU                ReLU
         7   'pool2'              Max Pooling         2×2 max pooling with stride [2  2] and padding 'same'
         8   'conv3_1'            Convolution         256 3×3×128 convolutions with stride [1  1] and padding 'same'
         9   'relu3_1'            ReLU                ReLU
        10   'conv3_2'            Convolution         256 3×3×256 convolutions with stride [1  1] and padding 'same'
        11   'relu3_2'            ReLU                ReLU
        12   'pool3'              Max Pooling         2×2 max pooling with stride [2  2] and padding 'same'
        13   'conv4_1'            Convolution         512 3×3×256 convolutions with stride [1  1] and padding 'same'
        14   'relu4_1'            ReLU                ReLU
        15   'conv4_2'            Convolution         512 3×3×512 convolutions with stride [1  1] and padding 'same'
        16   'relu4_2'            ReLU                ReLU
        17   'pool4'              Max Pooling         2×2 max pooling with stride [2  2] and padding 'same'
        18   'fc1_1'              Fully Connected     4096 fully connected layer
        19   'relu5_1'            ReLU                ReLU
        20   'fc1_2'              Fully Connected     4096 fully connected layer
        21   'relu5_2'            ReLU                ReLU
        22   'fc2'                Fully Connected     128 fully connected layer
        23   'EmbeddingBatch'     ReLU                ReLU
        24   'regressionoutput'   Regression Output   mean-squared-error
    

    Use analyzeNetwork (Deep Learning Toolbox) to visually explore the network.

    analyzeNetwork(net)

    Read in an audio signal.

    [audioIn,fs] = audioread('SpeechDFT-16-8-mono-5secs.wav');

    Plot and listen to the audio signal.

    T = 1/fs;
    t = 0:T:(length(audioIn)*T) - T;
    plot(t,audioIn);
    grid on
    xlabel('Time (t)')
    ylabel('Ampltiude')

    soundsc(audioIn,fs)

    Use vggishPreprocess to extract mel spectrograms from the audio signal.

    melSpectVgg = vggishPreprocess(audioIn,fs);

    Create a VGGish network (This requires Deep Learning Toolbox). Call predict to use your VGGish network for audio feature embedding extraction from the preprocessed mel spectrogram images. The feature embeddings are returned as a numFrames-by-128 matrix, where numFrames is the number of individual spectrograms, and 128 is the number of elements in each feature vector.

    net = vggish;
    embeddings = predict(net,melSpectVgg);
    [numFrames,numFeatures] = size(embeddings)
    numFrames = 9
    
    numFeatures = 128
    

    Visualize the VGGish feature embeddings.

    surf(embeddings,'EdgeColor','none')
    view([90,-90])
    axis([1 numFeatures 1 numFrames])
    xlabel('Feature')
    ylabel('Frame')
    title('VGGish Audio Feature Embeddings')

    Input Arguments

    collapse all

    Input signal, specified as a column vector or matrix. If you specify a matrix, vggishPreprocess treats the columns of the matrix as individual audio channels.

    Data Types: single | double

    Sample rate of the input signal in Hz, specified as a positive scalar.

    Data Types: single | double

    Percentage overlap between consecutive mel spectrograms, specified as a scalar in the range [0,100).

    Data Types: single | double

    Output Arguments

    collapse all

    Mel spectrograms generated from audioIn, returned as a 96-by-64-by-1-by-K array, where:

    • 96 –– Represents the number of 25 ms frames in each mel spectrogram.

    • 64 –– Represents the number of mel bands spanning 125 Hz to 7.5 kHz.

    • K –– Represents the number of mel spectrograms and depends on the length of audioIn, the number of channels in audioIn, as well as OverlapPercentage.

      Note

      Each 96-by-64-by-1 patch represents a single mel spectrogram image. For multichannel inputs, mel spectrograms are stacked along the 4th dimension.

    Data Types: single

    References

    [1] Gemmeke, Jort F., et al. “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 776–80. DOI.org (Crossref),doi:10.1109/ICASSP.2017.7952261.

    [2] Hershey, Shawn, et al. “CNN Architectures for Large-Scale Audio Classification.” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017, pp. 131–35. DOI.org (Crossref), doi:10.1109/ICASSP.2017.7952132.

    Extended Capabilities

    C/C++ Code Generation
    Generate C and C++ code using MATLAB® Coder™.

    Version History

    Introduced in R2021a