Interface with pretrained model or third-party speech service
speechClient to interface with third-party speech
services, you must download the extended Audio Toolbox™ functionality from File
Exchange. The File Exchange submission includes a tutorial to get started with
the third-party services.
Using wav2vec 2.0 requires Deep Learning Toolbox™ and installing the pretrained model.
name — Pretrained model or service name
Name of the pretrained model or speech service, specified as
"wav2vec2.0"–– Use a pretrained wav2vec 2.0 model. You can only use wav2vec 2.0 to perform speech-to-text transcription, and therefore you cannot use it with
"Google"–– Interface with the Google® Cloud Speech-to-Text and Text-to-Speech service.
"IBM"–– Interface with the IBM® Watson Speech to Text and Text to Speech service.
"Microsoft"–– Interface with the Microsoft® Azure® Speech service.
"Amazon"–– Interface with the Amazon® Transcribe and Amazon Polly services.
Using the wav2vec 2.0 pretrained model requires Deep Learning Toolbox and installing the pretrained wav2vec 2.0 model. If the model is not
"wav2vec2.0" provides a link to download and install the
To use any of the third-party speech services (Google, IBM, Microsoft, or Amazon), you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.
Segmentation — Segmentation of transcript
Segmentation of the output transcript, specified as
This property applies only to the wav2vec 2.0 pretrained model and the Amazon speech service.
speech2textreturns the transcript as a table where each word is in its own row. This is the default for the wav2vec 2.0 pretrained model.
speech2textreturns the transcript as a table where each sentence is in its own row. The wav2vec 2.0 pretrained model does not support this option.
speech2textreturns a string containing the entire transcript. This is the default for the Amazon speech service.
TimeStamps — Include timestamps in transcript
false (default) |
Include timestamps of transcribed speech in the transcript, specified as
false. If you specify
includes an additional column in the transcript table that contains the timestamps. When
using the wav2vec 2.0 pretrained model, the
determines the timestamps using the algorithm described in .
This property applies only if you set the Segmentation
TimeOut — Connection timeout
Connection timeout, specified as a nonnegative scalar in seconds. The timeout specifies the time to wait for the initial server connection to the third-party speech service.
This property applies only to the third-party speech services.
For the third-party speech services, you can configure server-specific options using the following functions. See the documentation for the specific service for option names and values.
Download wav2vec 2.0 Network
Download and install the pretrained wav2vec 2.0 model for speech-to-text transcription.
speechClient("wav2vec2.0") into the command line. If the pretrained model for wav2vec 2.0 is not installed, the function provides a download link. To install the model, click the link to download the file and unzip it to a location on the MATLAB path.
Alternatively, execute the following commands to download the wav2vec 2.0 model, unzip it to your temporary directory, and then add it to your MATLAB path.
downloadFile = matlab.internal.examples.downloadSupportFile("audio","wav2vec2/wav2vec2-base-960.zip"); wav2vecLocation = fullfile(tempdir,"wav2vec"); unzip(downloadFile,wav2vecLocation) addpath(wav2vecLocation)
Check that the installation is successful by typing
speechClient("wav2vec2.0") into the command line. If the model is installed, then the function returns a
ans = Wav2VecSpeechClient with properties: Segmentation: 'word' TimeStamps: 0
Perform Speech-to-Text Transcription
Read in an audio file containing speech and listen to it.
[y,fs] = audioread("speech_dft.wav"); soundsc(y,fs)
speechClient object that uses the wav2vec 2.0 pretrained network. This requires installing the pretrained network. If the network is not installed, the function provides a link with instructions to download and install the pretrained model.
transcriber = speechClient("wav2vec2.0");
speech2text to obtain a transcription of the audio signal.
transcript = speech2text(transcriber,y,fs)
transcript=12×2 table Transcript Confidence ___________ __________ "the" 0.97605 "discreet" 0.91927 "fourier" 0.84546 "transform" 0.89922 "of" 0.66676 "a" 0.50026 "real" 0.88796 "valued" 0.89913 "signal" 0.8041 "is" 0.53891 "conjugate" 0.98438 "symmetric" 0.89333
 Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” 2020. https://doi.org/10.48550/ARXIV.2006.11477.
 Kürzinger, Ludwig, Dominik Winkelbauer, Lujun Li, Tobias Watzel, and Gerhard Rigoll. “CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition.” In Speech and Computer, edited by Alexey Karpov and Rodmonga Potapova, 12335:267–78. Cham: Springer International Publishing, 2020. https://doi.org/10.1007/978-3-030-60276-5_27.
Introduced in R2022b