searching a string for a word

20 次查看(过去 30 天)
Lauren Harkness
Lauren Harkness 2017-10-13
编辑: Cedric 2017-10-13
So I have a text file, and i am looking for the frequency of appearance of those words in the text file. I have used strfind, but the problem is if one of the words I am searching for is small say "and" then it can be found in other words like "band", but I only want it to appear when it is standing alone. I tried searching for when the word only had a space before and after it (so when it stands alone) but this ignores if the word is first or last on a line in the text file. code is attached.
A = fileread(txt)
fh = fopen(txt,'r')
B = strfind(A, firstword);
line = fgetl(fh)
C = strfind(A,secondword);
vec = [length(B),length(C)];
  1 个评论
Cedric
Cedric 2017-10-13
编辑:Cedric 2017-10-13
Part of the code is useless. The following
A = fileread(txt)
already opens the file, reads it as text, and closes it. After it is executed, A contains the full content of the file. So then there is no need to open the file again and read one line (and forget to close it).
As explained by Per below, STRFIND matches strict occurrences of the text that you are looking for. You could observe that it is difficult to use it for matching patterns (situations a little more flexible than the simple occurrence of letters). Looking for white spaces before and after was a good first attempt, but there are cases where it fails .. and there is the upper/lower case issue.
All these considerations are a good signal that you need an approach a little more elaborate based on pattern matching, using regular expressions. This is what Per develops. Note that he uses REGEXPI and not REGEXP, to provide a case-insensitive solution.
Your code should look a bit like the following:
textContent = fileread( textFile ) ;
countWord1 = length( regexpi( textContent, ... )) ;
countWord2 = length( regexpi( textContent, ... )) ;
counts = [countWord1, countWord2] ;
where ... are appropriate arguments (at least the pattern). Even better:
wordsToFind = {'and', 'here', 'not'} ;
textFile = 'MyFile.txt' ;
counts = zeros( size( wordsToFind )) ;
textContent = fileread( textFile ) ;
for wordId = 1 : numel( wordsToFind )
pattern = sprintf( '\\<%s\\>', wordsToFind{wordId} ) ;
counts(wordId) = length( regexpi( textContent, pattern )) ;
end
where we loop over a series of words defined in a cell array, and we build the pattern proposed by Per dynamically.

请先登录,再进行评论。

回答(1 个)

per isakson
per isakson 2017-10-13
编辑:per isakson 2017-10-13
Try
>> regexpi( 'And, and other words and_ 2and and', '(^|\W)and(\W|$)', 'start' )
ans =
1 5 31
The search term includes the character before the word, and. Thus the value returned will often point at a space.
Better
>> regexpi( 'And, and other words and_ 2and and', '\<and\>', 'start' )
ans =
1 6 32
Why read line by line and not the entire text in one go
str = fileread( filespec );
pos = regexpi( str, '\<and\>', 'start' );
Doc says:
  • \W Any character that is not alphabetic, numeric, or underscore. For English character sets, \W is equivalent to [^a-zA-Z_0-9]
  • \<expr Beginning of a word.

类别

Help CenterFile Exchange 中查找有关 Characters and Strings 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by