Fastest way to replace multipe substrings with a single new string?
    3 次查看(过去 30 天)
  
       显示 更早的评论
    
Hello Everyone,
I'm trying to replace 7k different substrings with the same Tag in a 50 milllion words dataset (cell array of size 1 million of strings of average size 50 words). and as you can see, using replace or regexprep takes a long time. I tried using strrep the same way as replace but it gives me this error. 
Error using strrep
All nonscalar inputs must be the same size.
I want to ask, what is the fastest and less memory consuming way to do it?
Here is the code:
%using replace
Tag='IMPORTANT'
substr={'very','much'} % a cell array of +7k words
reptag=cell(1,size(substr,2));
tagcell=cellfun(@(x) Tag,reptag,'Uniformoutput',false);
maintext=replace(maintext,substr,tagcell);
% using regexprep 
ev='(';
for evi=1:size(substr,2)
    ev=[ev substr '|'];
end
ev=[ev(1:end-1) ')'];
maintext=regexprep(maintext,ev,Tag);
回答(1 个)
  Mohammad Sami
      
 2020-6-11
        After some experimentations I think that if you tokenize your sentences, you can use a hashmap to lookup the words to replace.
An example code is as follows. If you want case insensitive matching, use function lower on both the words and sentences.
substr = cellstr(substr);
w = containers.Map(substr,substr); %create a hashmap of substring you want to replace
m2 = cellstr(sentences);
m5 = cell(length(m2),1);
for i = 1:length(m2)
m3 = split(m2{i},' '); % tokenize the sentence
m4 = w.isKey(m3); % lookup which words to replace
m3(m4) = {'IMPORTANT'}; % replace the words
m5(i) = join(m3,' '); % store the updated sentence
end
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!



