Read csv strings, keep or create surrounding whitespace
3 次查看(过去 30 天)
显示 更早的评论
I have a list of stop words that currently exists as a comma-separated list in a .txt file. The goal is to use that list to remove those words from some target text, but only when a given word (e.g. "and") appears by itself - remove "and", but don't make "sand" into "s". To that end, I tried manually putting spaces around all the words in the list, so "a,able,about" became " a , able , about ". However, the txtscan function stripped the spaces out. Is there a way to prevent it from doing that? Alternatively, if I use the original form of the list, can I tell txtscan to surround each string with spaces?
1 个评论
采纳的回答
Cedric
2014-6-20
编辑:Cedric
2014-6-20
Here is an example that I can refine if you provide more information. It writes some keywords in upper case..
key = {'lobster', 'and'} ;
str = 'Lobster anatomy includes the cephalothorax which fuses the head and the thorax, both of which are covered by a chitinous carapace, and the abdomen. The lobster''s head bears antennae, antennules, mandibles, the first and second maxillae, and the first, second, and third maxillipeds. Because lobsters live in a murky environment at the bottom of the ocean, they mostly use their antennae as sensors.' ;
for kId = 1 : length( key )
pat = sprintf( '(?<=\\W?)%s(?=(s |\\W))', key{kId} ) ;
str = regexprep( str, pat, upper( key{kId} ), 'ignorecase' ) ;
end
Running this, you get
>> str
str =
LOBSTER anatomy includes the cephalothorax which fuses the head AND the thorax, both of which are covered by a chitinous carapace, AND the abdomen. The LOBSTER's head bears antennae, antennules, mandibles, the first AND second maxillae, AND the first, second, AND third maxillipeds. Because LOBSTERs live in a murky environment at the bottom of the ocean, they mostly use their antennae as sensors.
The REXEXP-based approach makes it possible to code for..
- only if framed by non alphanumeric characters (e.g. ,),
- unless following character is an 's',
- unless at the beginning of the string.
21 个评论
Cedric
2014-6-23
编辑:Cedric
2014-6-23
You're welcome! Note that it could do its job all at once if you were passing a pattern which contains all keywords in an OR operation. Yet, it's often more efficient to apply several times a simple pattern than passing once an extra-long/complex one. That could/should be profiled for your specific case though if you wanted to optimize.
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Language Support 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
