searching a string for a word

Question

Lauren Harkness 2017-10-13

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/361096-searching-a-string-for-a-word

编辑： Cedric 2017-10-13

So I have a text file, and i am looking for the frequency of appearance of those words in the text file. I have used strfind, but the problem is if one of the words I am searching for is small say "and" then it can be found in other words like "band", but I only want it to appear when it is standing alone. I tried searching for when the word only had a space before and after it (so when it stands alone) but this ignores if the word is first or last on a line in the text file. code is attached.

A = fileread(txt)
fh = fopen(txt,'r')
B = strfind(A, firstword);
line = fgetl(fh)
C = strfind(A,secondword);
vec = [length(B),length(C)];

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Cedric 2017-10-13

编辑：Cedric 2017-10-13

在 MATLAB Online 中打开

Part of the code is useless. The following

A = fileread(txt)

already opens the file, reads it as text, and closes it. After it is executed, A contains the full content of the file. So then there is no need to open the file again and read one line (and forget to close it).

As explained by Per below, STRFIND matches strict occurrences of the text that you are looking for. You could observe that it is difficult to use it for matching patterns (situations a little more flexible than the simple occurrence of letters). Looking for white spaces before and after was a good first attempt, but there are cases where it fails .. and there is the upper/lower case issue.

All these considerations are a good signal that you need an approach a little more elaborate based on pattern matching, using regular expressions. This is what Per develops. Note that he uses REGEXPI and not REGEXP, to provide a case-insensitive solution.

Your code should look a bit like the following:

 textContent = fileread( textFile ) ;
 countWord1  = length( regexpi( textContent, ... )) ;
 countWord2  = length( regexpi( textContent, ... )) ;
 counts      = [countWord1, countWord2] ;

where ... are appropriate arguments (at least the pattern). Even better:

 wordsToFind = {'and', 'here', 'not'} ;
 textFile    = 'MyFile.txt' ;
 counts      = zeros( size( wordsToFind )) ;
 textContent = fileread( textFile ) ;
 for wordId = 1 : numel( wordsToFind )
    pattern = sprintf( '\\<%s\\>', wordsToFind{wordId} ) ;
    counts(wordId) = length( regexpi( textContent, pattern )) ;
 end

where we loop over a series of words defined in a cell array, and we build the pattern proposed by Per dynamically.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

per isakson 2017-10-13

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/361096-searching-a-string-for-a-word#answer_285576

编辑：per isakson 2017-10-13

在 MATLAB Online 中打开

Try

>> regexpi( 'And, and other words and_ 2and and', '(^|\W)and(\W|$)', 'start' )
ans =
     1     5    31

The search term includes the character before the word, and. Thus the value returned will often point at a space.

Better

>> regexpi( 'And, and other words and_ 2and and', '\<and\>', 'start' )
ans =
     1     6    32

Why read line by line and not the entire text in one go

str = fileread( filespec );
pos = regexpi( str, '\<and\>', 'start' );

Doc says:

\W Any character that is not alphabetic, numeric, or underscore. For English character sets, \W is equivalent to [^a-zA-Z_0-9]
\<expr Beginning of a word.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

searching a string for a word

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

searching a string for a word

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论