how to find the similarity between two text documents
7 次查看(过去 30 天)
显示 更早的评论
i have two text document.
For example, a.txt file contains ' Hai How R U'.
and b.txt file contains 'Hai How are U'.
How I can calculate the cosine similarity or Euclidean Distance for these two documents (text files).
thanks in advance.
2 个评论
Jan
2012-12-19
The Euclidean Distance requires vektors of the same size. There are different Edit Distances, but I do not know the cosine distance. Perhaps it is better that you explain the details that that we search in WikiPedia.
info info
2020-3-20
i think the best way to give the similarity text is "shinling"
Shingling, a common technique of representing documents as sets. Given the document, its k-shingle is said to be all the possible consecutive substring of length k found within it. An example with k = 3 is given below :
## $Original
## [1] "The sky is blue and the sun is bright."
##
## $Shingled
## [1] "the sky is" "sky is blue" "is blue and" "blue and the"
## [5] "and the sun" "the sun is" "sun is bright"
then we virify if find in our textes
## doc_1 doc_2 doc_3
## the sky is 1 1 1
## sky is blue 1 0 1
## is blue and 1 0 0
## blue and the 1 0 0
## and the sun 1 0 0
## the sun is 1 0 0
## sun is bright 1 0 1
## the sun in 0 1 0
## sun in the 0 1 0
## in the sky 0 1 0
## sky is bright 0 1 0
## we can see 0 0 1
## can see sun 0 0 1
## see sun is 0 0 1
## is bright the 0 0 1
## bright the sky 0 0 1
then calculate .and take the big valeur
回答(1 个)
Jan
2012-12-19
Searching in the FEX is a good point to start from:
- http://www.mathworks.com/matlabcentral/fileexchange/32449-edit-distances
- http://www.mathworks.com/matlabcentral/fileexchange/39049-edit-distance-algorithm
- http://www.mathworks.com/matlabcentral/fileexchange/36981-find-nearest-matching-string-from-a-set
- http://www.mathworks.com/matlabcentral/fileexchange/213-editdist-m
- http://www.mathworks.com/matlabcentral/fileexchange/17585-calculation-of-distance-between-strings
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Characters and Strings 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!