How can I perform speaker verification for X-Vectors based on the ivectorsystem documentation?

2 次查看(过去 30 天)
I am trying to create a basic voice based attendance system as a beginner project for biometric based security. I am using MathWorks' implementation of X-Vector systems for this project. Based on this link's implementation of X-Vector based speaker verification : https://www.mathworks.com/help/audio/ug/speaker-recognition-using-x-vectors.html, I have already trained the TDNN, X-Vector system and PLDA scoring. I have also obtained thresholds for the PLDA and Cosine Similarity scoring here based on the Detection Error Tradeoff figure using the X-axis values of the EER.
Since the above link states that I-Vector and X-Vector share the same classifier backend ("The x-vector system backend, or classifier, is the same as developed for i-vector systems. For details on the backend, see Speaker Verification Using i-vectors and ivectorSystem."), how would I adapt the ivectorsystem's verify() function in the speaker verification using I-Vectors example to use X-Vectors instead per this link : https://www.mathworks.com/help/audio/ref/ivectorsystem.html? Presumably, in the X-Vector speaker recognition link, all the helper functions were probably wrapper functions for X-Vector.

采纳的回答

Brian Hemmat
Brian Hemmat 2024-5-6
I don't think you can reuse the verify method for your purpose, but here's generally the steps you need to be taking:
To perform speaker verification, you need a ground truth speaker embedding. It can be an i-vector, an x-vector, etc. If you've already trained the x-vector model using the recipe in the example, you'll want to perform preprocessing and prediction using the same pipeline. Speaker Diarization Using x-vectors uses the x-vector model and walks through the preprocessing steps. Here is just a sketch of what it would look like:
x = knownspeechsignal;
features = (extract(afe,x)-globalPrecomputedMean)./globalPrecomputedSTD;
embeddingTemplate = predict(model,dlarray(features,'TCB'),Outputs="fc_1");
When you have unkown speech, you perform the same steps.
x = unknownspeechsignal;
features = (extract(afe,x)-globalPrecomputedMean)./globalPrecomputedSTD;
embeddingUnknown = predict(model,dlarray(features,'TCB'),Outputs="fc_1");
To perform speaker verification, you score the two features using either PLDA or CSS. Here's an example of CSS:
css = dot(embeddingTemplate,embeddingUnknown)/norm(embeddingTemplate)*norm(embeddingUnknown);
speakerisverified = css < threshold
You'll need to maintain a list of template embeddings to look up when attempting to perform speaker verification.
Here's a sketch of it all together.
% Create templates for known speakers
x = knownspeechsignal_1;
features = (extract(afe,x)-globalPrecomputedMean)./globalPrecomputedSTD;
embeddingTemplate_1 = predict(model,dlarray(features,'TCB'),Outputs="fc_1");
x = knownspeechsignal_2;
features = (extract(afe,x)-globalPrecomputedMean)./globalPrecomputedSTD;
embeddingTemplate_2 = predict(model,dlarray(features,'TCB'),Outputs="fc_1");
% Create an enrollment list
enrolledSpeakers = dictionary(["speaker 1","speaker 2"],[embeddingTemplate_1,embeddingTemplate_1]);
% Extract embedding from unknown speaker
x = unknownspeechsignal;
features = (extract(afe,x)-globalPrecomputedMean)./globalPrecomputedSTD;
embeddingUnknown = predict(model,dlarray(features,'TCB'),Outputs="fc_1");
% Unknown speaker purports to be speaker 1, verify that:
claimedidentity = "speaker 1";
embeddingTemplate = enrolledSpeakers("speaker 1");
css = dot(embeddingTemplate,embeddingUnknown)/norm(embeddingTemplate)*norm(embeddingUnknown);
speakerisverified = css < threshold
The PLDA model is not currently offered standalone, you can use the internal version that ivectorSystem has at your own risk (it is not intended to be user-facing and may change at any time). To see an example of using it, step through either the x-vector training example or diarization example. Alternatively, this example walks through the nitty-gritty of the entire i-vector system including the G-PLDA scoring: Speaker Verification Using i-vectors.
Also, depending on the difficulty of your speaker verification task, you might consider using the speakerRecognition function to return a pretrained i-vector system.
Please ask for any clarifying questions--I'm hoping to add some examples where the whole detection error tradeoff, identification, verification, are componentized.
  2 个评论
Chieh Vee
Chieh Vee 2024-5-12
编辑:Chieh Vee 2024-5-12
Hi Brian! Sorry for the late reply. I tried your general code and with some minor modifications to the enrollment table code as well as creating an average of x-vector embeddings during enrollment, your general structure worked perfectly for me. I made a small working example using a file selector for the enrollment and verification parts and it worked well(though it did make me realise I would have to clean my audios before hand since using my laptop microphone as the recording device introduced a lot of noise, which resulted in 2 different users who had recorded audio using my laptop microphone yielding similarity scores of over 0.85, but I suppose this can be mitigated by cleaning the audio or being more stringent) Thanks to your code, I think I can start implementing it into a record attendance app designer window now.
I have some questions with regards to the detection error tradeoff, do you mind explaining a general idea as to how it would work?
Thus far, based on the original "Speaker Recognition using X-Vectors" article the following was done to make x-vector embeddings for a separate DET set and then the DET was graphed.
xvecsDET = minibatchpredict(dlnet,xDET, ...
MiniBatchSize=1, ...
Outputs="fc_1");
xvecsDET = xvecsDET';
xvecsDETP = projMat*xvecsDET;
detTable = helperDetectionErrorTradeoff(xvecsDETP,tDET,enrollmentTable,plda)
Based on this, if I were to segregate my own dataset of user recorded audio (assuming around 20 audios per user for 30 users, and perhaps followed the general guide of it) into say xvecsDETNew, would it be able to work seamlessly? Assuming I am reusing the same PLDA model that was trained in the article. If not, is there a different way of doing so? Also, what would be a decent manual way to obtain FAR, FRR, EER and thresholds for Cosine Similarity of this system then?
Brian Hemmat
Brian Hemmat 2024-5-16
编辑:Brian Hemmat 2024-5-16
I don't follow the first question--I would say try it and if it doesn't work, provide some code that lead to the error.
Regarding the second qeustion about general way to obtain a DET plot and calculate the FAR, FRR, and EER, that's also done explicitly here: Speaker Verification Using i-vectors. There are different ways to calculate the DET in terms of what data you use. Often there's explicit pairs you want to score against each other (at least--that's how competitions on the subject usually work).
I've found that just exhaustively pairing all embeddings gives about the same results. Below is a sketch of that.
Assume we have a matrix of embedding vectors output from your model.
embeddingLength = 200;
numEmbeddings = 20*30;
embeddings = rand(embeddingLength,numEmbeddings);
Each embedding vector has a corresponding label. So the labels elements correspond to the columns of embeddings.
labels = categorical(repelem(1:30,20));
Calculate scores for all pairs of embeddings--we'll throw away the repetitions later.
allscores = css(embeddings,embeddings);
Create a matrix that says whether the labels below to the same or different speakers.
uniqueLabels = unique(labels);
class_matrix = labels'==labels;
Isolate the scores that correspond to matched pairs and the scores that correspond to unmatched pairs.
n = size(scoresmat,1);
lower_triangular_logical = tril(ones(n, n), -1) == 1;
scoresmat(~lower_triangular_logical) = nan;
scoreLike = scoresmat(class_matrix);
scoreUnlike = scoresmat(~class_matrix);
scoreLike(isnan(scoreLike)) = [];
scoreUnlike(isnan(scoreUnlike)) = [];
Define a range of thresholds to test
numThresholdsInSweep = 1000;
Thresholds = linspace(min(scoreUnlike),max(scoreLike),numThresholdsInSweep);
Calculate the false reject rate for each threshold in the sweep.
FRR = mean(scoreLike(:)<Thresholds(:)',1);
Calculate the false acceptance rate for each threshold in the sweep.
FAR = mean(scoreUnlike(:)>=Thresholds(:)',1);
Get the threshold where the FRR and FAR intersect (a better version of this would interpolate the points before and after).
[~,EERThresholdIdx] = min(abs(FRR-FAR));
EERThreshold = Thresholds(EERThresholdIdx);
Calculate the EER.
EER = mean([FAR(EERThresholdIdx),FRR(EERThresholdIdx)]);
Plot the results.
figure
plot(Thresholds,FRR,"k"), hold on
plot(Thresholds,FAR,"b")
plot(EERThreshold,EER,"ro",MarkerFaceColor="r")
title(["Equal Error Rate = " + round(EER,4),"Threshold = " + round(EERThreshold,4)])
xlabel('Threshold')
ylabel('Error Rate')
legend('FAR','FRR','Equal Error Rate (EER)')
grid on
axis tight
hold off
Supporting Functions
function y = css(w1,wt)
% This calculates the css of all pairs in w1 and wt in a vectorized way.
% Add this to your path to use.
y = squeeze(sum(w1.*reshape(wt,size(wt,1),1,[]),1)./(vecnorm(w1).*reshape(vecnorm(wt),1,1,[])));
end

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Pretrained Models 的更多信息

产品


版本

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by