How do I split a sing column .txt file by line?
17 次查看(过去 30 天)
显示 更早的评论
Hey Guys,
How would I split a .txt file into smaller files by the number of lines? This was simple to do in linux, but I can't seem to do it here.
An example of a file is attached (testv2.txt)
EDIT: The .txt files I'm working with are very large, and I need to split them into files with 72,000,000 lines. I can't split the files by size, because for some reason some files are different sizes, and the script I'm using tells time by using the # of lines.
Thanks for the help guys!
4 个评论
Adam Danz
2019-8-28
After answering your question I realized that my answer only addresses half of your question. How would you like to split the data into subfiles? Are you sure you want to split up the data rather than keeping it all together?
EL
2019-8-28
Absolutely. I have 1.4 billions lines of data, and I need to split them into managable sizes to precise number of lines so I can perform good statistics. Ideally, I'd like it split the .txt into new .txt files. So I'd have the original, unadulterated file (backup data), and new .txt files that are 72,000,000 line sections of the original data. I'm not too worried about the empty first column.
采纳的回答
dpb
2019-8-28
Again, I'd suggest there's no need to actually create multiple text files to do this...several options exist in MATLAB; the simplest is probably to just process the file in chunks of whatever size you wish and calculate statistics or do whatever on each section...something like
fid=fopen('yourfile.txt','r');
NperSet=72E6; % set number elements to read per section
ix=0; % initialize group index counter
while ~feof(fid) % go thru the file until run out of data
ix=ix+1; % increment counter
data=cell2mat(textscan(fid,'%\t%f',NperSet)); % read the data chunk of set size, skip \t
stats(ix,:)=[mean(data) std(data) ...]; % compute, save the stats of interest
... % do whatever else needed w/ this dataset here
end
You'll want to preallocate the stats array to some reasonable approximation of the size expected and check for overflow, but that's the basic idea...simpler than creating and having to traverse thru a bunch of files to simply process in sequence.
The alternative is to use tall arrays or memmapfile or other of the features TMW has provided for large datasets. See <Large-files-and-big-data link>
29 个评论
Adam Danz
2019-8-28
Just just for the record, I agree with dpb on this approach. The i loop in my answer produces the tempVec vector which you could perform statistics on iteratively.
EL
2019-8-28
I can't do it this way. There is a script that already exists that does some crazy calculations involving fast fourier transfoms and wave flattening that I'm not going to touch. I've already looked at it for a month and I've concluded I'm too new/dumb at MATLAB to even touch it or understand what's going on. Perhaps in the future, but my job isn't to make the script, it's to get data. So I have to muscle this through.
I'm trying to write a script that incorporates that script. I need to to chop a large data file into smaller chunks, then to have that script run each data chunk in order. Then I will recombine all that data into a super graph so I can have data I can edit togethor.
I mean, this is the correct answer for most people, so I'll mark it as such, but not for me currently. Perhaps when I become great at MATLAB I'll venture into figuring out how the script that was given to me works, but for now I'm too stupid. I just need something that works until I can optimize it. I'm definantly writing this into my notebook.
Thank you to both of you
dpb
2019-8-28
Attach this magic script...if nothing else, just wrap it into a function and pass the same data that would otherwise be read from the file.
Or, of course, you could use the above approach and write a temporary file after each iteration if you were truly that desperate.
EL
2019-8-29
编辑:dpb
2019-8-29
%% Populate filenames for LINUX command line operation
clear
close all
clc
[FileNames PathNames]=uigetfile('Y:\........\*.txt', 'Choose files to load:','MultiSelect','on'); %It opens the window for file selection
prompt = 'Enter save-name according to: file_mmddyyyy_signal ';
Filenamesave = input(prompt,'s');
Filenamesave = strcat(PathNames,Filenamesave,'.mat');
PathNames=strrep(PathNames,'L:','Data');
PathNames=strrep(PathNames,'\','/');
PathNamesSave=strcat('/',PathNames);
save(Filenamesave,'FileNames','PathNames','PathNamesSave');
clear;
clc;
close all;
%% Retrieve files
tstart=tic
prompt = 'Enter the name of a .mat file to run (e.g. file_mmddyyyy_signal.mat) ';
files = input(prompt,'s');
load(files);
%% Terminate or proceed
if isequal(FileNames{1},0)
disp('User selected Cancel') %display value of variable
else
for i=1:numel(FileNames)
clear ii
c=class(FileNames{i});
if c=='char'
FileNames{i}=cellstr(FileNames{i});
end
for ii=1:numel(FileNames{i})
disp(['User selected ', fullfile(PathNames{i}, FileNames{i}{ii})])
end
end
%% User query to run/save Figure output
%Code for commandline inputs
prompt = 'Enter file name for figures according to: mmddyyyy_bug_media_oC ';
Filenamesave = cellstr(input(prompt,'s'));
prompt = 'Enter signal type for files';
signal = input(prompt,'s');
if isempty(signal)
signal = 'DEF';
end
prompt = 'Make EPFL plots? Y/N [Y]: ';
epflplot = input(prompt,'s');
if isempty(epflplot)
epflplot = 'Y';
end
prompt = 'Make raw signal and variance plots? Y/N [Y]: ';
rawplot = input(prompt,'s');
if isempty(rawplot)
rawplot = 'Y';
end
prompt = 'Make FFT plots? Y/N [Y]: ';
fftplot = input(prompt,'s');
if isempty(fftplot)
fftplot = 'Y';
end
prompt = 'Enter the data collection rate (Hz)[20000]: ';
ft = input(prompt);
if isempty(ft)
ft = 20000;
end
% Binning inputs
prompt = 'Would you like to bin the data? [Y]/N ';
bo = input(prompt);
if isempty(bo)
bo = 'Y';
end
if bo=='Y'
% user inputs the bin time, if no input, script continues
prompt = 'Enter the bin time in minutes ';
bt = input(prompt);
end
%% Import and concatonate data segments
for i=1:length(FileNames)
y{i}=[];
for ii=1:length(FileNames{i})
Path{i}{ii}=fullfile(PathNames{i}, FileNames{i}{ii});
fileID{ii} = fopen(Path{i}{ii});
c=[];
while ~feof(fileID{ii})
C = textscan(fileID{ii},'\t%f',10000);
c=[c,C'];
end
fclose(fileID{ii});
c = c(~cellfun(@isempty, c));
gg=cell2mat(c');
yp{ii}=gg;
test=regexp(FileNames{i}{1}, 'txt','ONCE');
if isempty(test)
Names{i}=FileNames{i}{1};
else
[~,Names{i}]=fileparts(FileNames{i}{1});
end
%waitbar (i/length(FileNames),h);
end
for ii= 1:length(FileNames{i})
y{i}=[y{i};yp{ii}]; % concatonates all y data for each batch
end
end
Names=cellstr(Names);
Names2=strrep(Names,'_',' ');
%close(h);
disp('Files are loaded successfully!')
clear ans c C fileID gg yp h i ii Path question test prompt
% clear ans c C fileID gg h i Path question test prompt
%% Bin the data
% this reallocates y{i}, if not used, script still runs
if isempty(bt)
return
else
bd=bt*60*ft; % number of data points in the bin
for i=1:length(y)
f(i)=floor(length(y{i})/bd);
% yb{1}=y{1}(1:bd);
% N{1}=Names{i}; %replicate name to each bin
% N2{1}=Names2{i}; %replicate name to each bin
% if f>1 % multiple bins allocated or data cut to single bin size
for ii=1:f(i)
bds=(bd*ii)-(bd-1); % bin data start
bde=bds+(bd-1); % bin data end
yb{(sum(f(1:i-1))+ii)}=y{i}(bds:bde); % bin the data
N{(sum(f(1:i-1))+ii)}=Names{i}; %replicate name to each bin
N2{(sum(f(1:i-1))+ii)}=Names2{i}; %replicate name to each bin
end
% end
end
y=yb; % reallocate to y vector
Names=N; % update to reflect new bins
Names2=N2; % update to reflect new bins
clear yb N N2
end
%% Process the data
for i=1:length(y)
x{i}=(1:length(y{i}))';
end
% Window for flattening
n = 400000;
for ii=1:length(y)
for k=1:length(y{ii})/n
coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
end
end
for ii=1:length(y)
dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
end
for i=1:length(dd2)
t{i}=zeros(1,length(dd2{i}));
end
t{1}=1:length(dd2{1});
for i=2:length(dd2)
t{i}= t{i-1}(end)+[1:length(dd2{i})];
end
clear coeffs dd i ii k prompt ws yfit
yRaw=y; xRaw=x;
x=t; y=dd2;
clear dd2 t
% Window for variance
n = 100000;
for i=1:length(y)
for k=1:length(y{i})/n
a(k)=var(y{i}(n*(k-1)+1:k*n));
end
v{i}=a;
a=[];
end
t{1}=1:length(v{1});
for i=2:length(y)
t{i}=t{i-1}(end)+[1:length(v{i})];
end
%%%%%In between here is 500 lines of graph making. I'm omitting this for now. Afterwards, there's...
%% Rename variables
for i=1:length(t)
tm{i} = t{i}.*((n/ft)/60); % time in minutes
end
%Save variable output.
%v = variance
save(strcat(PathNamesSave{1},Filenamesave{1},'_',signal), 't', 'tm', 'v','xRaw', 'x', 'yRaw', 'y','n', 'ft','Filenamesave','FileNames','PathNames','PathNamesSave','Names','Names2','-v7.3');
disp('Done')
So from here, a .mat file is saved, which I can use to make a graph. teh .mat file though, is as large as the original file itself, say 20gb. I think extracting data from here to build graphs is difficult.
dpb
2019-8-29
OK, I'll grant you've been handed a mess...I feel for you if you have to deal with this.
Are you actually running this over and over and having to go through all the manual input every time????
Surely there's a way to determine what it is that is needed to be done and just go do it instead?
I can't imagine trying to use something like this more than once in a lifetime.
EL
2019-8-29
The top portion lets you pick the files you want in order and saves that in a .mat file. The second portion loads that file. That's what gets analyzed.
This program was fine analyzing small experiments, say 6 hours or so, but we've moved towards 27 hour experiments, and that data is too large to run so I pretty much have to sit behind a computer all day and pump out the data. I have an autorun version of this autopopulates the prompts, however I've only tested it with one dataset. My intention is make this script automatically chop up 27 hours of data in 1 hour bins, and have those files loaded in order in the same way they're loaded here. This should automate everything and conserve memory so data gets processed quickly. Then I'll have the last part of the script repopulate graphs by extracting datasets. Then hopefully delete those chopped up datasegments to save space.
Stephen23
2019-8-29
"This program was fine analyzing small experiments, say 6 hours or so, but ..."
It is no surprise that you are struggling with this, because fundamentally that code was designed and written in a way that prevents it from easily expanding to larger data, from being automated or included as a part of a larger process, or from being easily repeatable and testable. Basically you have reached the point where you need to change how the code is written. In particular:
- Write function/s instead of one script.
- Provide input parameters in one structure or as input arguments, rather than using (awful untestable inconvenient annoying) input.
Some other code tips:
- Use fullfile instead of concatenating strings to build paths.
- load into an output variable, rather than directly into the workspace.
- Align the code consistently.
- It is very unikely that you need to call clear anywhere.
dpb
2019-8-29
"The top portion lets you pick the files you want in order and saves that in a .mat file. The second portion loads that file. That's what gets analyzed."
But, then it just goes and reassembles the files you just (apparently) split up by concatenating chunks of 10,0000 records each. Whassup w/ that??? (Parenthetically, I note the author also skips the \t character on each record).
...
%% Import and concatonate data segments
for i=1:length(FileNames)
y{i}=[];
for ii=1:length(FileNames{i})
Path{i}{ii}=fullfile(PathNames{i}, FileNames{i}{ii});
fileID{ii} = fopen(Path{i}{ii});
c=[];
while ~feof(fileID{ii})
C = textscan(fileID{ii},'\t%f',10000);
c=[c,C'];
end
fclose(fileID{ii});
c = c(~cellfun(@isempty, c));
gg=cell2mat(c');
yp{ii}=gg;
test=regexp(FileNames{i}{1}, 'txt','ONCE');
if isempty(test)
Names{i}=FileNames{i}{1};
else
[~,Names{i}]=fileparts(FileNames{i}{1});
end
%waitbar (i/length(FileNames),h);
end
for ii= 1:length(FileNames{i})
y{i}=[y{i};yp{ii}]; % concatonates all y data for each batch
end
end
...
I'm w/ Stephen -- your time would be better spent initially if this is an ongoing project as sounds as is to refactor the guts of this code into something that can be generalized going forward.
If there were a clear description of the desired results, it's possible somebody here could manage to find the time to do some of the heavy lifting even. But, it surely looks like it would not be terribly difficult to take the guts of the actual calculations and recast them to use a flow pattern as I or Adam wrote--if one had a definition of just what it is being done w/o having to try to divine that just from reading the (very convoluted w/ all the unnecessary cell array indexing) existing code.
EL
2019-8-29
The 100,000 is a data segement chuck of 5 seconds that was deduced to be long enough to draw accurate data from. The data flattening occurs because we have small waves inside of large waves, and the large noise waves needs to be renoved.
I'd rather not rewrite the code until I understand it better. I know this code works, and I'd prefer to rerun this code as a function of another. Even if it's innefficient it'll still be done in the morning.
dpb
2019-8-29
But nothing is done with the data while it is in those 100,000 point chunks--it's just put into a cell array and then that cell array is turned back into the full array without any calculations in between. Then that full set is copied into yet another cell array y. Nothing has happened to do any smoothing at all at that point -- a lot of manipulations to no obvious purpose at all.
EL
2019-8-29
On days I need to process data, it requires me to be there all day. I was basically saying I cpuld start the data process in the ecening when I leave and have it done by morning so I could look at it and spend my time elsewhere
dpb
2019-8-29
I think I pretty-much follow the code with the possible exception of the following code section--is the author available to answer one or two Q?
I believe could revamp to be called repetitively pretty easily and probably simplified quite a lot at the same time if knew just one or two more details. In the following snippet:
for ii=1:length(y)
for k=1:length(y{ii})/n
coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
end
end
for ii=1:length(y)
dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
end
for i=1:length(dd2)
t{i}=zeros(1,length(dd2{i}));
end
t{1}=1:length(dd2{1});
for i=2:length(dd2)
t{i}= t{i-1}(end)+[1:length(dd2{i})];
end
the first loop fits a linear line and then evaluates it for the data (altho why didn't use polyval() for yfit is a puzzle, it's a nit).
dd is a dereferencing of all the coefficients for the sections within each time series in 2D array (why didn't just put into that form to begin with is another puzzle if that's way wanted them, but again relatively minor). It's dd2 that's the real puzzle--it is taking a subset of the y the length of which looks to be 2X the y since there are two coefficients per fit for a first order polynomial and then subtracts the value of those coefficients from the y values. That calculation doesn't seem to make any sense at all that I can see.
Nor why using that length to pick those points makes any direct sense to use to compute what appears later to be a time vector to be associated with the data.
I'll have to dig through again to follow the actual relationship of the end y after the binning operation to see if I can see the logic behind this, but at first blush it seems peculiar at best.
EL
2019-8-30
The author isn't available, and this script was written quickly. Part of the notes are, ""written quickly without care to memory".
I'd prefer to have this script as part of a function. I know the script analyzes the data correctly. I'm sure you could rewrite it the way it probably could have been done, but I'd have no way to verify that, and I can't go to my boss with a justification like that.
I have one of the people from another team coming to the country in a few weeks, so I'll know more when that time comes. Not sure how much he'll know because he doesn't code too often
dpb
2019-8-30
编辑:dpb
2019-8-30
Well, about the only way I see given the way the script is constructed to do it repetitively and semi-automagically would be mostly to do that reconstruction.
It would be simple enough to rerun a case to demonstrate it produces the same results altho it would be far more convenient and amenable to the forum to have a smaller dataset and to cut the size numbers down to something manageable--the logic wouldn't be any different if the averaging were over 100 as 400,000, just changing a few constants.
The alternative way would be to use the dataset you posted with that substitution and generate a test output from them and then compare results to those. That would give a high confidence in the two being the same and then a rerun on a real dataset to confirm with the actual constants.
You mentioned there are other sets of data that aren't nearly as long as these -- the same logic applied to larger datasets is immaterial to the size of the dataset. Providing one of those with a known result would also work...or, cut one of the 6 hour experiments back to just 1 or 2 hours...the results will be the same for the section one does have already but don't need more data than one section to prove the calculations are the same.
dpb
2019-8-30
" I know the script analyzes the data correctly."
Are you sure you actually do know that or is it just that this is the code that has always been used and the results accepted as being correct?
That section of subtracting a set of fitted coefficients from an observation surely looks peculiar at the very least. Knowing what you know of the process/measurement, can you think of what meaning that could have and rationalize doing such an operation?
Or, as above, have you just inherited the job of running this script and producing output without actually anybody questioning further what that output really is? Just because it may appear ok doesn't prove that it is.
I can't say unequivocally it isn't ok...just that the particular operation surely is suspicious-looking in form.
Not having any of the subsequent plotting code or any of the produced output, it's hard to judge what effect that might have--maybe the coefficients are small compared to the data and so the effect is immaterial even if it is a mistake.
Unfortunately, there are no comments in the attached code in the section in question other than purely superfluous ones that are self-evident from the code itself. Here, where a comment could have really helped, nary a thing. That, of course, is the way a tremendous amount of code is commented, unfortunately. It takes real discipline to do better.
EL
2019-8-30
We usbtract to remove noise. Our instrument has a natural drift to it we must remove.
When I get home, I'll share the code in its entirety
dpb
2019-8-30
OK, in between I see where I got my eyes crossed and, in fact, other than that I don't yet follow the (1:length(dd{ii})) part in dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}'; I now agree it is ok.
I had a mental faux pas when I saw the coeff on RHS and that got stuck in head had assigned the coefficients instead of the fitted result.
OK, it's not actually what I'd call noise removal but drift so are detrending.
I'm left at this point only wondering about the selection logic but running out of time for tonight.
EL
2019-8-30
In the end, I don't want to change the way the data is analyzed. There is sinply too much variation in the machine, and the people who made the script spent a few years developing it and what may seem odd or weird probably has a justification for it. I don't want to change stuff. I've already changed a few things, such as windows for variation, without realizing the consequences or why a window of n=100000 was chosen. I am fairly new to using the instrument, and without anyone to go guide me through the minute details,I can't answer most of tour questions. What I know is this is what everyone else who's used it is essentially using, and I have no reason to change it
EL
2019-8-30
编辑:dpb
2019-8-30
%% Prototype Script
% A small script for automated analysis of samples from the prototype.
% Written quickly, with not much care about memory.
% formerly titled/ updated from: PrototypeScript_commentary_20min_v1.m
% updated to enable combining data files
....snip...
DPB attached as file and deleted long text attachment...
dpb
2019-8-30
I wasn't proposing to change the result, simply to streamline how those results are produced in order to get to where you're trying to go (and not only you, but the whole institution apparently) -- being able to generalize this analysis tool to handle changing demands. It appears the payback could be VERY high in real labor savings which translates directly to $$ which management ought to appreciate highly any employee who can provide that to the organization!
Just out of curiosity altho has no real bearing on anything, just what is the instrument and the measurement?
dpb
2019-8-30
OK, this is how the script is kicked off--
prompt = 'Enter the name of a .mat file to run (e.g. file_mmddyyyy_signal.mat) ';
files = input(prompt,'s');
load(files);
So what is the content of this .mat file? Is it data or a list of files created elsewhere to be processed, maybe?
Can you attach a sample of it for context/orientation?
Looks to me like should be able to just start with the big file to process and instead of making a zillion files, use a formula such as what I posted or Adam's alternative and then just use that section of data to process a piece at a time without all the extra files--they're doing nothing but being temporary repositories; one can make that temporary directly from the big file just as easily as reading another file.
EL
2019-8-30
编辑:EL
2019-8-30
We use the following code to turn our multiple data files a list that can be read, and converting them to linux format so they can be run remotely. This is so we can simply select which files in whatever order, instead of typing them in by hand.
It's a motion detector, we're measuring metabolism vibrations
%% Populate filenames for LINUX command line operation
clear
close all
clc
[FileNames PathNames]=uigetfile('Y:\(path)*.txt', 'Choose files to load:','MultiSelect','on'); %It opens the window for file selection
prompt = 'Enter save-name according to: file_mmddyyyy_signal ';
Filenamesave = input(prompt,'s');
Filenamesave = strcat(PathNames,Filenamesave,'.mat');
PathNames=strrep(PathNames,'L:','Data');
PathNames=strrep(PathNames,'\','/');
PathNamesSave=strcat('/',PathNames);
save(Filenamesave,'FileNames','PathNames','PathNamesSave');
dpb
2019-8-30
So it would be entirely equivalent to just process that list, correct?
Are those the full files or the broken-up ones--or, either, depending on how long an experiment was run?
Adam Danz
2019-8-31
编辑:Adam Danz
2019-8-31
I've only caught up on some of the comments above but I'd like to make a very strong point.
"...the people who made the script spent a few years developing it and what may seem odd or weird probably has a justification for it."
Years back I ran a 4-year experiment on stimulus-producing code that was developed by very smart people over the course of years and I therefore trusted the lengthy code without thorough investigation and understanding. After 4 years of data collection I realized there was a single line with a mistake and a comment dated 1999 (13 years prior). About 3/4 of the data I collected over 4 years had to be thrown out because of this mistake on 1 line of code made by a very smart person 13 years ago. But in the end, it was my fault for not being intimately familiar with the code I used five days a week for years. As dpb mentioned, the code you shared is not a masterpiece. Investing time now into understanding each line and optimizing weak areas will be a good investment.
"My intention is make this script automatically chop up 27 hours of data in 1 hour bins, and have those files loaded in order in the same way they're loaded here. "
That's exactly what my answer does if you wish to go down the route.
dpb
2019-8-31
编辑:dpb
2019-8-31
""My intention is make this script automatically chop up 27 hours of data in 1 hour bins, and have those files loaded in order in the same way they're loaded here. ""
"That's exactly what my answer does if you wish to go down the route."
Or what mine does just without actually making files but using the data from the full file one piece at a time...which is the same result w/o the intermediate step.
I had made a start towards the factorization but life has intervened and now prevents me from investing more time at this time...I'll attach the beginnings of converting the script to functions altho had just gotten to the point of considering the main calculations so nothing there to report...
There is no optimization or reduction of superfluous intermediaries in the above as yet--strictly a factoring out of the initial portions to be callable functions in an eventual script.
My vision/intent was to remove the reliance upon splitting files and having the user specify the actual experiment file(s) wanted to be analyzed and then process those piecewise by whatever amount of memory is available to read/hold the data at one time. Understanding the sequence of which files and how those files were built was the point behind the last Q? of just what that list of files initially read actually represents.
If could manage to reduce a bunch of the machinations on doubly-dimensioned cell arrays and such along the way, that would have been gravy in reducing overhead in both memory and speed.
Adam Danz
2019-8-31
Yeah I (still) agree that there's no need to store the segmented data in text files and that dpb's approach is the better one.
dpb
2019-8-31
On the comment about hidden and accepted bugs -- just for the record I did err in my earlier post regarding the comparison/subtraction of polynomial coefficients from observations; the code at that point indeed does correctly detrend the data for the x values selected.
I was, however, still at the point that I hadn't quite determined just why the x values were/are being selected as they are for the independent variable in the plots...it probably is ok if they have used this successfully for so long, but it still seems a peculiar way to have coded it if it is just piecing back together the time series/building a time vector from a fixed sample rate that I hadn't yet got my head around just what is behind having been done the way it is.
更多回答(1 个)
Adam Danz
2019-8-28
编辑:Adam Danz
2019-8-29
This solution is quite fast and uses fgetl() to read in blocks of a text file and saves those blocks to a new text file. You can set the number of rows per block and other parameters at the top of the code. See comments within the code for more detail.
% Set the max number of lines per file. The last file may have less rows.
nLinesPerFile = 10000;
% Set the path where the files should be saved
newFilePath = 'C:\Users\name\Documents\MATLAB\datafolder';
% Set the base filename of each new file. They will be appended with a file number.
% For example, 'data' will become 'data_1.txt', 'data_2.txt' etc.
newFileName = 'data';
% Set the file that will be read (better to include the full path)
basefile = 'testv2.txt';
% Open file for reading
fid = fopen(basefile);
fnum = 0; % file number
done = false; %flag that ends while-loop.
while ~done
% Read in the next block; this assumes the data starts
% at row 1 of the txt file. If that is not the case,
% adapt this so that the header rows are skipped.
tempVec = nan(nLinesPerFile,1);
for i = 1:nLinesPerFile
nextline = fgetl(fid);
if nextline == -1
done = true;
tempVec(isnan(tempVec)) = [];
continue
else
tempVec(i) = str2double(nextline);
end
end
% Write the block to a new text file.
if ~isempty(tempVec)
fnum = fnum+1;
tempFilename = sprintf('%s_%d.txt',newFileName,fnum); % better to include a full
tempFile = fullfile(newFilePath,tempFilename);
fid0 = fopen(tempFile,'wt');
fprintf(fid0,'%.6f\n',tempVec);
fclose(fid0);
% (optional) display link to folder
disp(['<a href="matlab: winopen(''',newFilePath,''') ">',tempFilename,'</a>', ' saved.'])
end
end
fclose(fid);
5 个评论
hamad javaid
2019-10-8
Dear Adam,
I have .txt files with 8 or 14 columns and continuous rows in thousands. i used this code to split the file into separate blocks of 2500 rows and columns and converted completely, but output files were created with NAN and only one column (NAN written 2500 times). As I have comma separated file.
Any suggestions please?
Adam Danz
2019-10-8
Hi Hamad,
I would use the debug feature.
Put a break point at the top of your code and step through each line, looking at the outputs. "tempVec " produces a vector of NaNs. Maybe those values are never being filled?
hamad javaid
2020-6-15
Dear Adam,
I have worked on this and it goes very well when tetx file has single column. I have attached my sample file when I use this code, it creates the subfiles but only one cloumn with NAN.Kindly guide me through this. As code is working fine. May b I am having trouble with comma '' ,'' as delimiter?
Thank you so much
Adam Danz
2020-6-15
My answer pertains to the main question which asks about text files that have a single column of data.
In your case, check out readmatrix(). If you read the documentation for that function, you'll see optional inputs that specify what line number your numeric data start which will be useful in your case. Also check out readtable() for an alternative.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Whos 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!发生错误
由于页面发生更改,无法完成操作。请重新加载页面以查看其更新后的状态。
您也可以从以下列表中选择网站:
如何获得最佳网站性能
选择中国网站(中文或英文)以获得最佳网站性能。其他 MathWorks 国家/地区网站并未针对您所在位置的访问进行优化。
美洲
- América Latina (Español)
- Canada (English)
- United States (English)
欧洲
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
亚太
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)