How do I split a sing column .txt file by line?

Question

EL 2019-8-28

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/477993-how-do-i-split-a-sing-column-txt-file-by-line

评论： hamad javaid 2020-6-17

testv2.txt

Hey Guys,

How would I split a .txt file into smaller files by the number of lines? This was simple to do in linux, but I can't seem to do it here.

An example of a file is attached (testv2.txt)

EDIT: The .txt files I'm working with are very large, and I need to split them into files with 72,000,000 lines. I can't split the files by size, because for some reason some files are different sizes, and the script I'm using tells time by using the # of lines.

Thanks for the help guys!

4 个评论
显示 2更早的评论隐藏 2更早的评论

Adam Danz 2019-8-28

What version of matlab are you using?

EL 2019-8-28

2018a

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

dpb 2019-8-28

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/477993-how-do-i-split-a-sing-column-txt-file-by-line#answer_389523

在 MATLAB Online 中打开

Again, I'd suggest there's no need to actually create multiple text files to do this...several options exist in MATLAB; the simplest is probably to just process the file in chunks of whatever size you wish and calculate statistics or do whatever on each section...something like

fid=fopen('yourfile.txt','r');
NperSet=72E6;                                          % set number elements to read per section
ix=0;                                                  % initialize group index counter
while ~feof(fid)                                       % go thru the file until run out of data
  ix=ix+1;                                             % increment counter
  data=cell2mat(textscan(fid,'%\t%f',NperSet));        % read the data chunk of set size, skip \t
  stats(ix,:)=[mean(data) std(data) ...];              % compute, save the stats of interest
  ...                                                  % do whatever else needed w/ this dataset here
end

You'll want to preallocate the stats array to some reasonable approximation of the size expected and check for overflow, but that's the basic idea...simpler than creating and having to traverse thru a bunch of files to simply process in sequence.

The alternative is to use tall arrays or memmapfile or other of the features TMW has provided for large datasets. See <Large-files-and-big-data link>

29 个评论
显示 27更早的评论隐藏 27更早的评论

EL 2019-8-28

I can't do it this way. There is a script that already exists that does some crazy calculations involving fast fourier transfoms and wave flattening that I'm not going to touch. I've already looked at it for a month and I've concluded I'm too new/dumb at MATLAB to even touch it or understand what's going on. Perhaps in the future, but my job isn't to make the script, it's to get data. So I have to muscle this through.

I'm trying to write a script that incorporates that script. I need to to chop a large data file into smaller chunks, then to have that script run each data chunk in order. Then I will recombine all that data into a super graph so I can have data I can edit togethor.

I mean, this is the correct answer for most people, so I'll mark it as such, but not for me currently. Perhaps when I become great at MATLAB I'll venture into figuring out how the script that was given to me works, but for now I'm too stupid. I just need something that works until I can optimize it. I'm definantly writing this into my notebook.

Thank you to both of you

EL 2019-8-29

编辑：dpb 2019-8-29

在 MATLAB Online 中打开

%% Populate filenames for LINUX command line operation
clear
close all
clc
[FileNames PathNames]=uigetfile('Y:\........\*.txt', 'Choose files to load:','MultiSelect','on'); %It opens the window for file selection  
prompt = 'Enter save-name according to: file_mmddyyyy_signal  ';
Filenamesave = input(prompt,'s');
Filenamesave = strcat(PathNames,Filenamesave,'.mat');
PathNames=strrep(PathNames,'L:','Data');
PathNames=strrep(PathNames,'\','/');
PathNamesSave=strcat('/',PathNames);
save(Filenamesave,'FileNames','PathNames','PathNamesSave');
clear;
clc;
close all;
%% Retrieve files
tstart=tic
prompt = 'Enter the name of a .mat file to run  (e.g. file_mmddyyyy_signal.mat) ';
files = input(prompt,'s');
load(files);
%% Terminate or proceed
if isequal(FileNames{1},0)
   disp('User selected Cancel')   %display value of variable
else
  for i=1:numel(FileNames)
    clear ii
    c=class(FileNames{i});
    if c=='char'
      FileNames{i}=cellstr(FileNames{i});
    end
    for ii=1:numel(FileNames{i})
      disp(['User selected ', fullfile(PathNames{i}, FileNames{i}{ii})])
    end
  end
  %% User query to run/save Figure output
  %Code for commandline inputs
  prompt = 'Enter file name for figures according to: mmddyyyy_bug_media_oC  ';
  Filenamesave = cellstr(input(prompt,'s'));
  prompt = 'Enter signal type for files';
  signal = input(prompt,'s');
  if isempty(signal)
    signal = 'DEF';
  end
prompt = 'Make EPFL plots? Y/N [Y]: ';
epflplot = input(prompt,'s');
if isempty(epflplot)
    epflplot = 'Y';
end
prompt = 'Make raw signal and variance plots? Y/N [Y]: ';
rawplot = input(prompt,'s');
if isempty(rawplot)
    rawplot = 'Y';
end
prompt = 'Make FFT plots? Y/N [Y]: ';
fftplot = input(prompt,'s');
if isempty(fftplot)
    fftplot = 'Y';
end
prompt = 'Enter the data collection rate (Hz)[20000]: ';
ft = input(prompt);
if isempty(ft)
    ft = 20000;
end
% Binning inputs
prompt = 'Would you like to bin the data? [Y]/N ';
bo = input(prompt);
if isempty(bo)
    bo = 'Y';
end
if bo=='Y' 
% user inputs the bin time, if no input, script continues
prompt = 'Enter the bin time in minutes  ';
bt = input(prompt);
end
%% Import and concatonate data segments
for i=1:length(FileNames)  
  y{i}=[];
  for ii=1:length(FileNames{i})
    Path{i}{ii}=fullfile(PathNames{i}, FileNames{i}{ii});  
    fileID{ii} = fopen(Path{i}{ii});
    c=[];
    while ~feof(fileID{ii})
      C = textscan(fileID{ii},'\t%f',10000);
      c=[c,C'];
    end
    fclose(fileID{ii});
    c = c(~cellfun(@isempty, c));
    gg=cell2mat(c');
    yp{ii}=gg;
    test=regexp(FileNames{i}{1}, 'txt','ONCE');
    if isempty(test)
      Names{i}=FileNames{i}{1};
    else 
      [~,Names{i}]=fileparts(FileNames{i}{1});    
    end    
    %waitbar (i/length(FileNames),h);
  end
  for ii= 1:length(FileNames{i})
    y{i}=[y{i};yp{ii}]; % concatonates all y data for each batch
  end
end
Names=cellstr(Names);   
Names2=strrep(Names,'_',' ');
%close(h);
disp('Files are loaded successfully!')
    
clear ans c C fileID gg yp h i ii Path question test prompt
%    clear ans c C fileID gg h i Path question test prompt
%% Bin the data
% this reallocates y{i}, if not used, script still runs 
if isempty(bt)
    return
else
    
bd=bt*60*ft; % number of data points in the bin    
for i=1:length(y)
        f(i)=floor(length(y{i})/bd);
%         yb{1}=y{1}(1:bd);
%         N{1}=Names{i}; %replicate name to each bin
%         N2{1}=Names2{i}; %replicate name to each bin
%    if f>1 % multiple bins allocated or data cut to single bin size
    for ii=1:f(i)
    bds=(bd*ii)-(bd-1); % bin data start
    bde=bds+(bd-1); % bin data end
    yb{(sum(f(1:i-1))+ii)}=y{i}(bds:bde); % bin the data
    N{(sum(f(1:i-1))+ii)}=Names{i}; %replicate name to each bin
    N2{(sum(f(1:i-1))+ii)}=Names2{i}; %replicate name to each bin
    end
%    end
end
y=yb; % reallocate to y vector
Names=N; % update to reflect new bins
Names2=N2; % update to reflect new bins
clear yb N N2
end
%% Process the data  
    for i=1:length(y)
    x{i}=(1:length(y{i}))';
    end
    % Window for flattening
    n = 400000;
    for ii=1:length(y)
        for k=1:length(y{ii})/n
            coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
            yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
        end
    end
    for ii=1:length(y)
    dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
    dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
    end
    for i=1:length(dd2)
    t{i}=zeros(1,length(dd2{i}));
    end
    t{1}=1:length(dd2{1});
    for i=2:length(dd2)
    t{i}= t{i-1}(end)+[1:length(dd2{i})];
    end
    clear coeffs dd i ii k prompt ws yfit
    yRaw=y; xRaw=x;
    x=t; y=dd2;
    clear dd2 t
    % Window for variance
    n = 100000;
    for i=1:length(y)
            for k=1:length(y{i})/n
            a(k)=var(y{i}(n*(k-1)+1:k*n));
            end
            v{i}=a;
            a=[];
    end
    t{1}=1:length(v{1});  
    for i=2:length(y)
           t{i}=t{i-1}(end)+[1:length(v{i})];
    end
    
   
    
%%%%%In between here is 500 lines of graph making. I'm omitting this for now. Afterwards, there's...
%% Rename variables 
for i=1:length(t)
tm{i} = t{i}.*((n/ft)/60); % time in minutes
end
%Save variable output.
%v = variance
save(strcat(PathNamesSave{1},Filenamesave{1},'_',signal), 't', 'tm', 'v','xRaw', 'x', 'yRaw', 'y','n', 'ft','Filenamesave','FileNames','PathNames','PathNamesSave','Names','Names2','-v7.3');
disp('Done')

So from here, a .mat file is saved, which I can use to make a graph. teh .mat file though, is as large as the original file itself, say 20gb. I think extracting data from here to build graphs is difficult.

dpb 2019-8-29

在 MATLAB Online 中打开

"The top portion lets you pick the files you want in order and saves that in a .mat file. The second portion loads that file. That's what gets analyzed."

But, then it just goes and reassembles the files you just (apparently) split up by concatenating chunks of 10,0000 records each. Whassup w/ that??? (Parenthetically, I note the author also skips the \t character on each record).

...
%% Import and concatonate data segments
for i=1:length(FileNames)  
  y{i}=[];
  for ii=1:length(FileNames{i})
    Path{i}{ii}=fullfile(PathNames{i}, FileNames{i}{ii});  
    fileID{ii} = fopen(Path{i}{ii});
    c=[];
    while ~feof(fileID{ii})
      C = textscan(fileID{ii},'\t%f',10000);
      c=[c,C'];
    end
    fclose(fileID{ii});
    c = c(~cellfun(@isempty, c));
    gg=cell2mat(c');
    yp{ii}=gg;
    test=regexp(FileNames{i}{1}, 'txt','ONCE');
    if isempty(test)
      Names{i}=FileNames{i}{1};
    else 
      [~,Names{i}]=fileparts(FileNames{i}{1});    
    end    
    %waitbar (i/length(FileNames),h);
  end
  for ii= 1:length(FileNames{i})
    y{i}=[y{i};yp{ii}]; % concatonates all y data for each batch
  end
end
...

I'm w/ Stephen -- your time would be better spent initially if this is an ongoing project as sounds as is to refactor the guts of this code into something that can be generalized going forward.

If there were a clear description of the desired results, it's possible somebody here could manage to find the time to do some of the heavy lifting even. But, it surely looks like it would not be terribly difficult to take the guts of the actual calculations and recast them to use a flow pattern as I or Adam wrote--if one had a definition of just what it is being done w/o having to try to divine that just from reading the (very convoluted w/ all the unnecessary cell array indexing) existing code.

dpb 2019-8-29

在 MATLAB Online 中打开

I think I pretty-much follow the code with the possible exception of the following code section--is the author available to answer one or two Q?

I believe could revamp to be called repetitively pretty easily and probably simplified quite a lot at the same time if knew just one or two more details. In the following snippet:

for ii=1:length(y)
  for k=1:length(y{ii})/n
    coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
    yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
  end
end
for ii=1:length(y)
  dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
  dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
end
for i=1:length(dd2)
  t{i}=zeros(1,length(dd2{i}));
end
t{1}=1:length(dd2{1});
for i=2:length(dd2)
  t{i}= t{i-1}(end)+[1:length(dd2{i})];
end

the first loop fits a linear line and then evaluates it for the data (altho why didn't use polyval() for yfit is a puzzle, it's a nit).

dd is a dereferencing of all the coefficients for the sections within each time series in 2D array (why didn't just put into that form to begin with is another puzzle if that's way wanted them, but again relatively minor). It's dd2 that's the real puzzle--it is taking a subset of the y the length of which looks to be 2X the y since there are two coefficients per fit for a first order polynomial and then subtracts the value of those coefficients from the y values. That calculation doesn't seem to make any sense at all that I can see.

Nor why using that length to pick those points makes any direct sense to use to compute what appears later to be a time vector to be associated with the data.

I'll have to dig through again to follow the actual relationship of the end y after the binning operation to see if I can see the logic behind this, but at first blush it seems peculiar at best.

dpb 2019-8-30

编辑：dpb 2019-8-30

Well, about the only way I see given the way the script is constructed to do it repetitively and semi-automagically would be mostly to do that reconstruction.

It would be simple enough to rerun a case to demonstrate it produces the same results altho it would be far more convenient and amenable to the forum to have a smaller dataset and to cut the size numbers down to something manageable--the logic wouldn't be any different if the averaging were over 100 as 400,000, just changing a few constants.

The alternative way would be to use the dataset you posted with that substitution and generate a test output from them and then compare results to those. That would give a high confidence in the two being the same and then a rerun on a real dataset to confirm with the actual constants.

You mentioned there are other sets of data that aren't nearly as long as these -- the same logic applied to larger datasets is immaterial to the size of the dataset. Providing one of those with a known result would also work...or, cut one of the 6 hour experiments back to just 1 or 2 hours...the results will be the same for the section one does have already but don't need more data than one section to prove the calculations are the same.

dpb 2019-8-30

" I know the script analyzes the data correctly."

Are you sure you actually do know that or is it just that this is the code that has always been used and the results accepted as being correct?

That section of subtracting a set of fitted coefficients from an observation surely looks peculiar at the very least. Knowing what you know of the process/measurement, can you think of what meaning that could have and rationalize doing such an operation?

Or, as above, have you just inherited the job of running this script and producing output without actually anybody questioning further what that output really is? Just because it may appear ok doesn't prove that it is.

I can't say unequivocally it isn't ok...just that the particular operation surely is suspicious-looking in form.

Not having any of the subsequent plotting code or any of the produced output, it's hard to judge what effect that might have--maybe the coefficients are small compared to the data and so the effect is immaterial even if it is a mistake.

Unfortunately, there are no comments in the attached code in the section in question other than purely superfluous ones that are self-evident from the code itself. Here, where a comment could have really helped, nary a thing. That, of course, is the way a tremendous amount of code is commented, unfortunately. It takes real discipline to do better.

Adam Danz 2019-8-31

编辑：Adam Danz 2019-8-31

I've only caught up on some of the comments above but I'd like to make a very strong point.

"...the people who made the script spent a few years developing it and what may seem odd or weird probably has a justification for it."

Years back I ran a 4-year experiment on stimulus-producing code that was developed by very smart people over the course of years and I therefore trusted the lengthy code without thorough investigation and understanding. After 4 years of data collection I realized there was a single line with a mistake and a comment dated 1999 (13 years prior). About 3/4 of the data I collected over 4 years had to be thrown out because of this mistake on 1 line of code made by a very smart person 13 years ago. But in the end, it was my fault for not being intimately familiar with the code I used five days a week for years. As dpb mentioned, the code you shared is not a masterpiece. Investing time now into understanding each line and optimizing weak areas will be a good investment.

"My intention is make this script automatically chop up 27 hours of data in 1 hour bins, and have those files loaded in order in the same way they're loaded here. "

That's exactly what my answer does if you wish to go down the route.

dpb 2019-8-31

编辑：dpb 2019-8-31

""My intention is make this script automatically chop up 27 hours of data in 1 hour bins, and have those files loaded in order in the same way they're loaded here. ""

"That's exactly what my answer does if you wish to go down the route."

Or what mine does just without actually making files but using the data from the full file one piece at a time...which is the same result w/o the intermediate step.

I had made a start towards the factorization but life has intervened and now prevents me from investing more time at this time...I'll attach the beginnings of converting the script to functions altho had just gotten to the point of considering the main calculations so nothing there to report...

There is no optimization or reduction of superfluous intermediaries in the above as yet--strictly a factoring out of the initial portions to be callable functions in an eventual script.

My vision/intent was to remove the reliance upon splitting files and having the user specify the actual experiment file(s) wanted to be analyzed and then process those piecewise by whatever amount of memory is available to read/hold the data at one time. Understanding the sequence of which files and how those files were built was the point behind the last Q? of just what that list of files initially read actually represents.

If could manage to reduce a bunch of the machinations on doubly-dimensioned cell arrays and such along the way, that would have been gravy in reducing overhead in both memory and speed.

Adam Danz 2019-8-31

Yeah I (still) agree that there's no need to store the segmented data in text files and that dpb's approach is the better one.

dpb 2019-8-31

On the comment about hidden and accepted bugs -- just for the record I did err in my earlier post regarding the comparison/subtraction of polynomial coefficients from observations; the code at that point indeed does correctly detrend the data for the x values selected.

I was, however, still at the point that I hadn't quite determined just why the x values were/are being selected as they are for the independent variable in the plots...it probably is ok if they have used this successfully for so long, but it still seems a peculiar way to have coded it if it is just piecing back together the time series/building a time vector from a fixed sample rate that I hadn't yet got my head around just what is behind having been done the way it is.

请先登录，再进行评论。

Answer 2

Adam Danz 2019-8-28

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/477993-how-do-i-split-a-sing-column-txt-file-by-line#answer_389522

编辑：Adam Danz 2019-8-29

在 MATLAB Online 中打开

This solution is quite fast and uses fgetl() to read in blocks of a text file and saves those blocks to a new text file. You can set the number of rows per block and other parameters at the top of the code. See comments within the code for more detail.

% Set the max number of lines per file. The last file may have less rows.
nLinesPerFile = 10000;   
% Set the path where the files should be saved
newFilePath = 'C:\Users\name\Documents\MATLAB\datafolder'; 
% Set the base filename of each new file.  They will be appended with a file number.
% For example, 'data' will become 'data_1.txt', 'data_2.txt' etc.
newFileName = 'data';    
% Set the file that will be read (better to include the full path)
basefile = 'testv2.txt';
% Open file for reading
fid = fopen(basefile);
fnum = 0; % file number
done = false; %flag that ends while-loop.
while ~done
    
    % Read in the next block; this assumes the data starts
    % at row 1 of the txt file.  If that is not the case,
    % adapt this so that the header rows are skipped.
    tempVec = nan(nLinesPerFile,1);
    for i = 1:nLinesPerFile
        nextline = fgetl(fid);
        if nextline == -1
            done = true;
            tempVec(isnan(tempVec)) = [];
            continue
        else
            tempVec(i) = str2double(nextline);
        end
    end
    
    % Write the block to a new text file.
    if ~isempty(tempVec)
        fnum = fnum+1;
        tempFilename = sprintf('%s_%d.txt',newFileName,fnum);  % better to include a full
        tempFile = fullfile(newFilePath,tempFilename);
        fid0 = fopen(tempFile,'wt');
        fprintf(fid0,'%.6f\n',tempVec);
        fclose(fid0);
        % (optional) display link to folder
        disp(['<a href="matlab: winopen(''',newFilePath,''') ">',tempFilename,'</a>', ' saved.'])
    end
end
fclose(fid);

5 个评论
显示 3更早的评论隐藏 3更早的评论

Adam Danz 2020-6-15

My answer pertains to the main question which asks about text files that have a single column of data.

In your case, check out readmatrix(). If you read the documentation for that function, you'll see optional inputs that specify what line number your numeric data start which will be useful in your case. Also check out readtable() for an alternative.

hamad javaid 2020-6-17

Thank you for your time

请先登录，再进行评论。

How do I split a sing column .txt file by line?

4 个评论
显示 2更早的评论隐藏 2更早的评论

采纳的回答

29 个评论
显示 27更早的评论隐藏 27更早的评论

更多回答（1 个）

5 个评论
显示 3更早的评论隐藏 3更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

How do I split a sing column .txt file by line?

4 个评论 显示 2更早的评论隐藏 2更早的评论

采纳的回答

29 个评论 显示 27更早的评论隐藏 27更早的评论

更多回答（1 个）

5 个评论 显示 3更早的评论隐藏 3更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

4 个评论
显示 2更早的评论隐藏 2更早的评论

29 个评论
显示 27更早的评论隐藏 27更早的评论

5 个评论
显示 3更早的评论隐藏 3更早的评论