Cosine Similarity using BERT

Question

Nicholas Ang 2021-6-30

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/868608-cosine-similarity-using-bert

评论： Nicholas Ang 2021-6-30

采纳的回答： Divyam Gupta

I am using BERT to calculate similarities in Question Answering. I have encoded my Question data using

data.Tokens = encode(mdl.Tokenizer,data.Questions) which returns me a cell array.

Next, I proceeded to encode new text to test the similiarity with the already encoded Questions in the database: testTokens = encode(mdl.Tokenizer,text)

However, I am imable to use the cosineSimilarity(data.Tokens,testTokens) and I receive an error that says:

Input must be a matrix, a tokenizedDocument array, a bagOfWords model, a bagOfNgrams model, a string array of words, or a cell array of character vectors.

Do I need padding here or reshape of my cell vectors?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Divyam Gupta 2021-6-30

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/868608-cosine-similarity-using-bert#answer_736543

Hi Nicholas, I notice that you're facing an issue while computing the cosine similarity using a text encoder. As per the documentation mentioned at https://www.mathworks.com/help/textanalytics/ref/cosinesimilarity.html#d123e8335 the cosineSimilarity function takes a matrix to compute the similarity between two documents.

Since the encoded vector sizes for each of the questions is different, constructing a matrix might be difficult. You can do a pairwise comparision between the data.Tokens and the testTokens to compute the similarities. This can be achieved by running a nested loop while simultaneously storing the similarity scores.

Hope this helps.