Hi, I want to read a long file (with unknown lines) then split it into 'x' lines files and read the data for each new splited data. Any help on how to do this will be very appreciated,
Thanks in advance

8 个评论

Split a file into blocks of 'x' lines is easily done. However, it's not clear what you mean by "read the data for each new splited data". What does read actually mean? And why would you not read the whole data and do the splitting afterward (if it's actually needed)?
Hi Guillaume, first, thanks for your contributtion.
Sencondly, what I'm trying to do is from a 800.000 lines .txt file get 36000 lines .txt files. Then I will use those 'shorted' files in my .m but also some of them will me required in another application.
that's why I need to actually create the files
As you mentioned: Split a file into blocks of 'x' lines is easily done. can you provide me the code that makes this?
At this point I have:
fid = fopen("input_file.txt", "rt");
format = "%s %f %s %f %s %f %s %f %s %f %s %f %s %f %s %f %s %f";
data = textscan(fid, format);
fclose(fid);
% the important data
x = cell2mat(data(:,14));
z = cell2mat(data(:,18));
Which 360000 lines out of 800000 ? Is it the first 360000 lines, or lines chosen at random?
From the original 800000lines file I want new 36000 lines files. Let's say:
new_file_1 = lines (1:36000) of old_800000_file
new_file_2 = lines (36001:72000) of old_800000_file
new_file_3 = lines (72001:108000) of old_800000_file
and so until no more lines of the 800000 file are remainig
Would you happen to be using Mac, or Linux?
No, windows. Matlab version 2017b
Do you have an example text file we can test code against?

请先登录,再进行评论。

 采纳的回答

Your file has a bit of an odd format, in particular some lines have an extra *** at the end. It's not clear if it's significant or not and if it needs to be preserved. Since you weren't reading it with your textscan format, I assume not.
It's also not clear what formatting should go in your output file. Since you're using textscan, I assume it doesn't need to be exactly identical to the input.
An very easy way to read a file in blocks of fixed size is with the datastore and co. functions. It's all implemented for you. The following works on R2019b. There are been many improvements to the datastores since 2017b, so it may not work as well for you:
%create a datastore (tabulartext in this case) to read the file
%note that * is treated as a delimeter simply so that it is ignored
ds = tabularTextDatastore('input_file.TXT', 'Delimiter', {' ', '*'}, 'NumHeaderLines', 0, 'MultipleDelimitersAsOne', true);
%specify how many rows to read at once
ds.ReadSize = 36000;
%output folder, and basename (with formatting for file number)
outfolder = 'C:\somewhere\somefolder';
basename = 'split_%3d.txt';
%read blocks in a loop. Save them somewhere. do extra processing
blockindex = 0;
while ds.hasdata
blockindex = blockindex + 1;
data = ds.Read; %read a block
outname = fullfile(outfolder, sprintf(basename, blockindex)); %construct full name of outputfile
writetable(data, outname, 'WriteVariableNames', false); %write to output file
%... some more processing
end

4 个评论

Hi there again, following Guillaume advice I've implemented a code that read all the data and then splits it. The data is then saves in a matrix with equal length columns where each colum will representa a new data vector splited from the original file.
My code is:
input_file = "input_file.TXT";
lines = 36000;
fid = fopen(input_file, "rt");
format = "%s %f %s %f %s %f %s %f %s %f %s %f %s %f %s %f %s %f";
data = textscan(fid, format);
fclose(fid);
aux_x = cell2mat(data(:,14));
aux_z = cell2mat(data(:,18));
n = floor( length(cell2mat(data(:,2))) / lines);
for i = 0:n
for j = i*lines+1:(i+1)*lines
x (j,i+1) = aux_x(j);
z (j,i+1) = aux_z(j);
end
end
My problem now is that I end up with a huge triangular matrix with top values of zero. Something like
|1 0 0...|
|2 0 0...|
|3 0 0...|
|0 4 0...|
|0 5 0...|
|0 0 6...|
I want to remove those zeros and align all my values to the matrx's raw 1
Any help?
I'm confused at what you're trying to achieve exactly. What's the ultimate goal? What is this going to be used for?
Since you only seem to care about column 14 and 18, you should change your format to:
format = "%*s %*f %*s %*f %*s %*f %*s %*f %*s %*f %*s %*f %*s %f %*s %*f %*s %f"; %only 14th and 18th fields don't have *
You need to learn how to manipulate cell arrays, the proper way to get the content of cell 14 and 18 is with:
aux_x = data{:, 14}; %use {} to get the content of a cell
aux_z = data{:, 18};
Alternatively, you could use readtable, with a fixedWidthImportOptions object:
opts = fixedWidthImportOptions('NumVariables', 18, 'VariableWidths', [5 5 5 5 5 5 5 6 5 6 5 6 5 7 5 7 5 7], 'VariableTypes', repmat({'char', 'double'}, 1, 9));
opts.ExtraColumnsRule = 'ignore'; %for the rows that have a ***
opt.VariableNames([14, 18]) = {'aux_x', 'aux_z'}; %optional
opts.SelectedVariableNames = opts.VariableNames([14, 18]);
data = readtable('input_file.TXT', opts);
Or you can also use the datastore as I've shown in my answer. As with tables you can tell the datastore to only keep the columns of interest
%before reading from the datastore:
ds.SelectedVariableNames = ds.VariableNames([14, 18]);
if you're trying to reshape the two columns into a mnatrix each, then use reshape. Of course, since all columns of a matrix must have the same height, you may need to pad the last columns (with NaNs or 0s) beforehand:
%example using a table. If using output of textscan, remove the data.
lines = 36000;
numpadding = mod(-height(data), lines); %if using textscan output use size(aux_x, 1) instead of height(data)
aux_x = reshape([data.aux_x; nan(numpadding, 1)], lines, []);
aux_z = reshape([data.aux_z; nan(numpadding, 1)], lines, []);
Carolina Sainz's comment mistakenly posted as an answer moved here:
Hi again,
I've created a function that splits the file as I want to:
function [x,z,hora] = SplitFile (input_file, lines)
fid = fopen(input_file, "rt");
format = "%*s %*f %*s %*f %*s %*f %*s %*f %*s %*f %*s %*f %*s %f %*s %*f %*s %f";
data = textscan(fid, format);
fclose(fid);
aux_x = data{:,1};
aux_z = data{:,2};
n = floor(length(aux_x))/lines;
j = 1;
for i = 0:n-1
for k = 1:lines
x (k,i+1) = aux_x(j);
z (k,i+1) = aux_z(j);
j = j+1;
end
end
hora = 0:n-1;
end
But I find a problem as sometimes data is saved as NaN. I think it might be because of the *** at the end of some lines. As this *** are not common to every line can you help me overcome this NaN issue?
Many thanks
can you help me overcome this NaN issue?
I've already given you two ways to overcome the issue. The datastore option and the readtable option. Both of which I specifically wrote to cope with the *** issue.
I've also given you a much more efficient and faster way of producing your x and z matrices, one which doesn't need loop and does the job in just 3 lines.
So, I'm a bit puzzled by your replies. It doesn't appear you take my answers on board.
If you want to continue using textscan, the easiest way to cope with the *** is with:
data = textscan(fid, format, 'CommentStyle', '***');

请先登录,再进行评论。

更多回答(0 个)

标签

尚未输入任何标签。

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by