How can I speed up my script? I'm using loops and contains function and I have to screen 14K X 167K variables

1 次查看(过去 30 天)
Hello, I'm doing text mining in attempt to organize my data. I have a table with more then 200K rows and 12 columns and I want extract some information from one of columns. Indeed, I'm looking for names that match with my reference table (approx. 14K names). For that, I'm using contains function. For make this search, I'm using two loops. First to lock one of 14K names and second for look for this name in the 200K rows. This takes a very long time. Could help to speed up my script? Thanks
Here I show you the code:
if true
% code
1st loop (reference name table)
for k=2:14045;
clear test
clear Genr
test=DNP(k,10);
virgula=',';
Space= ' ';
Genr=Space+test+Space;
Second loop (my raw table with more than 200K rows)
for i=15001:16000;
clear Presence
clear A
clear B
clear C
BiolSource=DNP(i,3);
Presence=contains(BiolSource, Genr, 'IgnoreCase',true);
if Presence ==1;
A=DNP(i,13);
B=DNP(k,11);
DNP(i,13)=A+virgula+Space+test+Space+B;
C=DNP(i,13);
DNP(i,13)=erase(C,"0, ");
end
end
end
  5 个评论
Guillaume
Guillaume 2018-4-24
编辑:Guillaume 2018-4-24
So, to be clear, you want to identify in column C of Ask3 which term is the genera. All possible genera are stored in column A of Ask2?
Assumption: there is always one and only one genera in column C.

请先登录,再进行评论。

采纳的回答

Guillaume
Guillaume 2018-4-24
reference = readtable('MATLAB ASk2.xls', 'ReadVariableNames', false, 'Range', 'A:B');
raw = readtable('MATLAB ASk3.xls');
genera = lower(reference.Var1); %convert everything to lower case for easier comparison
matched_genera = rowfun(@(t) {genera(ismember(genera, strsplit(lower(t))))}, raw, 'InputVariables', 'Text', 'ExtractCellContent', true, 'OutputVariableNames', 'matched');
Each row of matched_genera is a cell array contain 0, 1 or more of the genera found in 'Text' (case insensitive). You can concatenate that with the original table if you wish:
newraw = [raw, matched_genera]
  2 个评论
Guillaume
Guillaume 2018-4-25

The easiest way to do that is to create a separate m file for the rowfun function:

In its own match_raw.m file:

function [matched_genera, family] = match_raw(raw_row, genera_lower, family)
     ismatch = ismember(genera_lower, strsplit(lower(raw_row)));
     matched_genera = {genera_lower(ismatch)};
     family = {family(ismatch)};
end

The rowfun call then becomes:

matches = rowfun(@(t) match_raw(t, genera, reference.Var2), raw, 'InputVariables', 'Text', 'ExtractCellContent', true, 'OutputVariableNames', {'matched_genera', 'family'})

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Loops and Conditional Statements 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by