searching a string for a word
20 次查看(过去 30 天)
显示 更早的评论
So I have a text file, and i am looking for the frequency of appearance of those words in the text file. I have used strfind, but the problem is if one of the words I am searching for is small say "and" then it can be found in other words like "band", but I only want it to appear when it is standing alone. I tried searching for when the word only had a space before and after it (so when it stands alone) but this ignores if the word is first or last on a line in the text file. code is attached.
A = fileread(txt)
fh = fopen(txt,'r')
B = strfind(A, firstword);
line = fgetl(fh)
C = strfind(A,secondword);
vec = [length(B),length(C)];
1 个评论
Cedric
2017-10-13
编辑:Cedric
2017-10-13
Part of the code is useless. The following
A = fileread(txt)
already opens the file, reads it as text, and closes it. After it is executed, A contains the full content of the file. So then there is no need to open the file again and read one line (and forget to close it).
As explained by Per below, STRFIND matches strict occurrences of the text that you are looking for. You could observe that it is difficult to use it for matching patterns (situations a little more flexible than the simple occurrence of letters). Looking for white spaces before and after was a good first attempt, but there are cases where it fails .. and there is the upper/lower case issue.
All these considerations are a good signal that you need an approach a little more elaborate based on pattern matching, using regular expressions. This is what Per develops. Note that he uses REGEXPI and not REGEXP, to provide a case-insensitive solution.
Your code should look a bit like the following:
textContent = fileread( textFile ) ;
countWord1 = length( regexpi( textContent, ... )) ;
countWord2 = length( regexpi( textContent, ... )) ;
counts = [countWord1, countWord2] ;
where ... are appropriate arguments (at least the pattern). Even better:
wordsToFind = {'and', 'here', 'not'} ;
textFile = 'MyFile.txt' ;
counts = zeros( size( wordsToFind )) ;
textContent = fileread( textFile ) ;
for wordId = 1 : numel( wordsToFind )
pattern = sprintf( '\\<%s\\>', wordsToFind{wordId} ) ;
counts(wordId) = length( regexpi( textContent, pattern )) ;
end
where we loop over a series of words defined in a cell array, and we build the pattern proposed by Per dynamically.
回答(1 个)
per isakson
2017-10-13
编辑:per isakson
2017-10-13
Try
>> regexpi( 'And, and other words and_ 2and and', '(^|\W)and(\W|$)', 'start' )
ans =
1 5 31
The search term includes the character before the word, and. Thus the value returned will often point at a space.
Better
>> regexpi( 'And, and other words and_ 2and and', '\<and\>', 'start' )
ans =
1 6 32
Why read line by line and not the entire text in one go
str = fileread( filespec );
pos = regexpi( str, '\<and\>', 'start' );
Doc says:
- \W Any character that is not alphabetic, numeric, or underscore. For English character sets, \W is equivalent to [^a-zA-Z_0-9]
- \<expr Beginning of a word.
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Characters and Strings 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!