how to find the similarity between two text documents

7 次查看(过去 30 天)
i have two text document.
For example, a.txt file contains ' Hai How R U'.
and b.txt file contains 'Hai How are U'.
How I can calculate the cosine similarity or Euclidean Distance for these two documents (text files).
thanks in advance.
  2 个评论
Jan
Jan 2012-12-19
The Euclidean Distance requires vektors of the same size. There are different Edit Distances, but I do not know the cosine distance. Perhaps it is better that you explain the details that that we search in WikiPedia.
info info
info info 2020-3-20
i think the best way to give the similarity text is "shinling"
Shingling, a common technique of representing documents as sets. Given the document, its k-shingle is said to be all the possible consecutive substring of length k found within it. An example with k = 3 is given below :
## $Original
## [1] "The sky is blue and the sun is bright."
##
## $Shingled
## [1] "the sky is" "sky is blue" "is blue and" "blue and the"
## [5] "and the sun" "the sun is" "sun is bright"
then we virify if find in our textes
## doc_1 doc_2 doc_3
## the sky is 1 1 1
## sky is blue 1 0 1
## is blue and 1 0 0
## blue and the 1 0 0
## and the sun 1 0 0
## the sun is 1 0 0
## sun is bright 1 0 1
## the sun in 0 1 0
## sun in the 0 1 0
## in the sky 0 1 0
## sky is bright 0 1 0
## we can see 0 0 1
## can see sun 0 0 1
## see sun is 0 0 1
## is bright the 0 0 1
## bright the sky 0 0 1
then calculate .and take the big valeur

请先登录,再进行评论。

回答(1 个)

Jan
Jan 2012-12-19

类别

Help CenterFile Exchange 中查找有关 Characters and Strings 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by