Transcribe speech signal to text
speech2text with the third-party speech services, you
must download the extended Audio Toolbox™ functionality from File Exchange. The File Exchange submission includes a tutorial to get started
with the third-party services.
Using wav2vec 2.0 requires Deep Learning Toolbox™ and installing the pretrained model.
Download wav2vec 2.0 Network
Download and install the pretrained wav2vec 2.0 model for speech-to-text transcription.
speechClient("wav2vec2.0") into the command line. If the pretrained model for wav2vec 2.0 is not installed, the function provides a download link. To install the model, click the link to download the file and unzip it to a location on the MATLAB path.
Alternatively, execute the following commands to download the wav2vec 2.0 model, unzip it to your temporary directory, and then add it to your MATLAB path.
downloadFile = matlab.internal.examples.downloadSupportFile("audio","wav2vec2/wav2vec2-base-960.zip"); wav2vecLocation = fullfile(tempdir,"wav2vec"); unzip(downloadFile,wav2vecLocation) addpath(wav2vecLocation)
Check that the installation is successful by typing
speechClient("wav2vec2.0") into the command line. If the model is installed, then the function returns a
ans = Wav2VecSpeechClient with properties: Segmentation: 'word' TimeStamps: 0
Perform Speech-to-Text Transcription
Read in an audio file containing speech and listen to it.
[y,fs] = audioread("speech_dft.wav"); soundsc(y,fs)
speechClient object that uses the wav2vec 2.0 pretrained network. This requires installing the pretrained network. If the network is not installed, the function provides a link with instructions to download and install the pretrained model.
transcriber = speechClient("wav2vec2.0");
speech2text to obtain a transcription of the audio signal.
transcript = speech2text(transcriber,y,fs)
transcript=12×2 table Transcript Confidence ___________ __________ "the" 0.97605 "discreet" 0.91927 "fourier" 0.84546 "transform" 0.89922 "of" 0.66676 "a" 0.50026 "real" 0.88796 "valued" 0.89913 "signal" 0.8041 "is" 0.53891 "conjugate" 0.98438 "symmetric" 0.89333
clientObj — Client object
Client object, specified as an object returned by
speechClient. The object is an interface to a pretrained wav2vec 2.0 model
or to a third-party speech service.
speech2text with wav2vec 2.0 requires Deep Learning Toolbox and installing the pretrained wav2vec 2.0 model. If the model is not
"wav2vec2.0" provides a link to download and install the
To use any of the third-party speech services, you must download the extended Audio Toolbox functionality from File Exchange. The File Exchange submission includes a tutorial to get started with the third-party services.
audioIn — Audio input
Audio input signal, specified as a column vector (single channel).
fs — Sample rate (Hz)
Sample rate in Hz, specified as a positive scalar.
timeout — Time to wait for server connection in seconds
Time to wait for initial server connection in seconds, specified as a positive
scalar. This sets the
TimeOut property of
This argument applies only when
clientObj interfaces with one
of the third-party speech services.
transcript — Speech transcript
table | string
Speech transcript of the input audio signal, returned as a table with a column
containing the transcript and another column containing the associated confidence
metrics. If the
Segmentation property of
speech2text returns the transcript as
The returned table can have additional columns depending on the
properties and server options.
rawOutput — Unprocessed server output
ResponseMessage | structure
Unprocessed server output, returned as a
matlab.net.http.ResponseMessage object containing the HTTP response from the
third-party speech service. If the third-party speech service is Amazon®,
speech2text returns the server output as a
This output argument does not apply if
interfaces with the wav2vec 2.0 pretrained model.
 Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” 2020. https://doi.org/10.48550/ARXIV.2006.11477.
Introduced in R2022b