Streamline for loops and indexing of large files

Question

Louise Wilson 2020-10-11

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/610861-streamline-for-loops-and-indexing-of-large-files

回答： Seth Furman 2020-10-15

I am wondering if anyone has any advice on how I could streamline a script I am working on.

The purpose of a script is to subset a large file (e.g. 7000*72000) according to time values which occur in the first column of the large file and the first column of a second file of interest. So, I have two datasheets where some of the rows match in time, and I want to pull out the rows which match. I have 10 of these very large files and only one or two of them will contain times that match, so the first task is to find out which of the very large files I have, corresponds to the file of interest, before loading it in.

folder=('Y:\AIS data\CPA_FILT\CPA_ACOU\'); %folder containing files of interest
CPA_ACOU_files=dir(fullfile(folder,'*.csv')); %list each file in folder
count=1; %index for output
for j=1:length(CPA_ACOU_files) %for each file
    %read in 'interesting' files
    CPA=readtable(fullfile(folder,CPA_ACOU_files(j).name));
    CPA_t=datetime(CPA.Var1,'ConvertFrom','datenum'); %convert time to datetime
    CPA_times=timeofday(CPA_t(:,1)); %remove times
    CPA_datesonly=CPA_t(:,1)-CPA_times; %get dates of time vector
    %read in times of large datasheets (PSD) and find any that match interesting files
    tvec_folder=('Y:\AIS data\CPA_FILT\TVEC\'); 
    tvec_files=dir(fullfile(tvec_folder,'*.csv'));
    
    %for each tvec, do ismember, and if any times match interesting times, store name of tvec file
    %(shortcut to load PSD so we don't have to do it manually)
    for a=1:length(tvec_files) 
        tvec{a}=readtable(fullfile(tvec_folder,tvec_files(a).name)); %read in tvec
        tvec{a}=datetime(tvec{a}.Var1,'ConvertFrom','datenum'); %covnert times to datetime
        tvec_times=timeofday(tvec{a});
        tvec_datesonly=tvec{a}-tvec_times;
        [matches]=ismember(CPA_datesonly,tvec_datesonly);
        B=any(matches);
        
        if (B==1)
            
           matchingfiles(a,:)=string(tvec_files(a).name);
            
        else
            %do nothing
        end
        
    end
    matchingfiles=rmmissing(matchingfiles); 
    %tvec files with times which match interesting file times
        
    for c=1:length(matchingfiles) %read in corresponding PSD file
    
        %remove TVEC from start of filename to get filename which matches PSD file
        filename_parts=strsplit(matchingfiles(c),'_'); 
        filename_parts(:,1)=[];
        filename=strjoin(filename_parts,'_');
        %read in PSD file
        PSDfolder=('Y:\SoundTrap\Boats\Hydrophone\PSD Output\Duty cycle data\');
        PSDfiles{c}=readtable(fullfile(PSDfolder,filename));
    end
    %for each PSD, subset by rows which match interesting files
    %(if they occur over multiple PSD files, they are concatenated into one output file)
    
    for i=1:length(PSDfiles)
        PSD_t=datetime(PSDfiles{1,i}.Var1,'ConvertFrom','datenum');
        PSD_t(:,2)=PSD_t+minutes(2); %time window for filtering PSD files 
        [rows,cols]=size(CPA);
        for k=1:rows %for each hydrophone recording with CPA
            acou_CPA_file=CPA.Var1(k) %get time of file
            acou_CPA_file=datetime(acou_CPA_file,'ConvertFrom','datenum')
            for m=1:length(PSD_t) %check during which PSD file it occurs
                starttime=PSD_t(m,1); %two minute period to check between
                endtime=PSD_t(m,2);
                if (acou_CPA_file>=starttime) && (acou_CPA_file<=endtime)
                    CPA_PSD(count,:)=PSDfiles{1,i}{m,:}; %store PSD data in output
                    count=count+1;
                else
                    %do nothing
                end
            end
        end
    CPA_PSD_t=datetime(CPA_PSD(:,1),'ConvertFrom','datenum'); %check times
    %outfolder=('Y:\AIS data\CPA_FILT\CPA_PSD\');
    %writetable(CPA_PSD,strcat(outfolder,'acouCPA_',PSDfile(i).name));
    end
end

2 个评论
显示无隐藏无

Mathieu NOE 2020-10-12

hello

this is my 2 cents suggestion ; instead of having to load very large files into matlab, I'd rather try to find a pre processing method to copy the time vector only from each data file into a new data_time file , and run the matlab code on those much smaller files. Once you know the files and time index of interest, you can do the post processing of the data in matlab.

if you have to do that not very often, you can try a powerfull text editor like Textpad (https://www.textpad.com/home)

otherwise maybe a batch exe file could do the trick

Louise Wilson 2020-10-12

Hi Mathieu, thank you! I have actually already saved and stored the time vector from each .csv into a separate data_time file for each of the large files. But, how would I then access each row of interest without having to load in the entire file?

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Seth Furman 2020-10-15

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/610861-streamline-for-loops-and-indexing-of-large-files#answer_515213

在 MATLAB Online 中打开

1) Consider using innerjoin or outerjoin to find the rows with matching row times between two tables.

https://www.mathworks.com/help/matlab/ref/innerjoin.html

https://www.mathworks.com/help/matlab/ref/outerjoin.html

2) Consider using tall tables for data that are too big to fit into memory.

https://www.mathworks.com/help/matlab/import_export/tall-arrays.html#bvayjn0

For example,

>> tt1 = table2timetable(tall(tabularTextDatastore('tt1.csv')))
tt1 =
  M×1 tall timetable
       Time        Data
    ___________    ____
    15-Oct-2020     1  
    16-Oct-2020     2  
    17-Oct-2020     3  
    18-Oct-2020     4  
    19-Oct-2020     5  
    20-Oct-2020     6  
    21-Oct-2020     7  
    22-Oct-2020     8  
         :          :
         :          :
>> tt2 = table2timetable(tall(tabularTextDatastore('tt2.csv')))
tt2 =
  4×1 tall timetable
       Time        Data
    ___________    ____
    16-Oct-2020     11 
    20-Oct-2020     12 
    22-Oct-2020     13 
    25-Oct-2020     14 
>> tJoin = innerjoin(tt1,tt2,'Keys','Time')
tJoin =
  M×2 tall timetable
    Time    Data_tt1    Data_tt2
    ____    ________    ________
     ?         ?           ?    
     ?         ?           ?    
     ?         ?           ?    
     :         :           :
     :         :           :
Preview deferred. Learn more.
>> tJoin = gather(tJoin)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 7: Completed in 0.31 sec
- Pass 2 of 7: Completed in 0.17 sec
- Pass 3 of 7: Completed in 0.39 sec
- Pass 4 of 7: Completed in 0.47 sec
- Pass 5 of 7: Completed in 0.32 sec
- Pass 6 of 7: Completed in 0.49 sec
- Pass 7 of 7: Completed in 0.71 sec
Evaluation completed in 3.4 sec
tJoin =
  3×2 timetable
       Time        Data_tt1    Data_tt2
    ___________    ________    ________
    16-Oct-2020       2           11   
    20-Oct-2020       6           12   
    22-Oct-2020       8           13

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Streamline for loops and indexing of large files

2 个评论
显示无隐藏无

采纳的回答

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Streamline for loops and indexing of large files

2 个评论 显示 无隐藏 无

采纳的回答

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

2 个评论
显示无隐藏无

0 个评论
显示 -2更早的评论隐藏 -2更早的评论