- Don't use a text file, go binary
- Split your text file in manageable chunks beforehand.
- Use a database instead
Data is not saving to the workspace
14 次查看(过去 30 天)
显示 更早的评论
Aaron Smith
2017-2-10
I have a large text file composed of a single row of 52480000 numbers separated by semicolons. I'm attempting to organize the data into 51250 rows of 1024 numbers and then separate this into distinct blocks of 1025 x 1024. The numbers need to stay in the same order they were in in the original file (with every 1025th number being the start of a new row) I have tried using a while and if loop.
R = 51250;
C = 1024;
fid = fopen( 'TEST_A.asc');
k = 0;
while ~feof(fid)
z = textscan( fid, '%d', R*C, 'EndOfLine', ';');
if ~isempty(z{1})
k = k + 1;
s = fprintf( 'TEST_A.asc', ';');
dlmwrite( s, reshape( z{1}, 1025, []), ';')
end
end
fclose(fid);
This code does not create an initial cell of 52480000 numbers, which means that none of the subsequent data sets (s & z) are created in the workspace. The problem is that if I textscan the data into Matlab before formatting it, the file creates a memory error. Does anyone notice anything that I don't about this code or have any pointers?
26 个评论
José-Luis
2017-2-10
编辑:José-Luis
2017-2-10
What is the size of that file? If the numbers had been stored in a binary file in double precision, that would still be more than 400MB. A text file is bound to be much larger and despite impressive progress GB files are a pain to process.
There are several ways of tackling this. Off the top of my head:
There are other ways but I can't be more specific without knowing what you are trying to achieve.
Stephen23
2017-2-10
编辑:Stephen23
2017-2-10
See earlier question:
"I'm attempting to organize the data into 51250 rows of 1024 numbers and then separate this into distinct blocks of 1025 x 1024"
Why do you need this intermediate step?
My answer showed you how to to simply process exactly those blocks of 1025*1024, avoiding that intermediate matrix entirely. What do gain by creating that huge matrix that you don't even want? My code shows how you can go directly to the smaller matrices (which seems to be your aim) without having to read the whole file data into MATLAB and without needing to use the intermediate step of rearranging all of the data into one pointlessly huge matrix.
Why not just read the blocks you need (1025*1024) instead of wasting time and memory with that huge matrix?
"The numbers need to stay in the same order they were in in the original file (with every 1025th number being the start of a new row) "
Yes, and that is what my answer does. Change R = 51250; back to R = 1025; and this code will work too.
Aaron Smith
2017-2-10
Like I said, using your code, There is no output data. z and s do not appear in the workspace, and when I made alterations that did give s and z in the workspace, they were empty cells
Aaron Smith
2017-2-10
The same problem occurs with that R value. I changed the values, hedging my bets but it didn't make a difference to the result
Aaron Smith
2017-2-10
z is 1 in my workspace and k is also 1. There is now an error occurring with reshape: Error using reshape Product of known dimensions, 1025, not divisible into total number of elements, 1.
Stephen23
2017-2-10
编辑:Stephen23
2017-2-10
"z is 1" z is actually a cell array, so it cannot be equal to one. What do you really mean?
textscan is not reading the data file. Possibly the format is not as expected. Do the numbers have decimal digits, or exponent notation? Please run this and tell me exactly what values out has (it will be slow):
fid = fopen('file.txt','rt');
out = [];
while ~feof(fid)
tmp = unique(fgets(fid,1e5));
out = union(out,double(tmp));
end
fclose(fid);
disp(out)
And also show exactly what this displays:
fid = fopen('file.txt','rt');
str = fgets(fid,60)
fclose(fid);
Aaron Smith
2017-2-10
>> fid = fopen( 'TEST_A.asc', 'rt');
>> out = [];
>> while ~feof(fid)
tmp = unique(fgets(fid, 1e5));
out = union(out, double(tmp));
end
>> fclose(fid);
>> tmp
tmp =
067
>> out
out =
10 48 49 50 51 52 53 54 55 56 57
The data are all integers between 0 and 1000, though some may be over 1000. I just haven't been able to spot any numbers over 800. The file does have over 50 million numbers though.
fid = fopen( 'TEST_A.asc', 'rt' );
>> str = fgets(fid, 60)
str =
1
>> fclose(fid);
Stephen23
2017-2-10
编辑:Stephen23
2017-2-10
@Aaron Smith: the file contains newline characters (char 10), which means your original description of the file format "I have a very large text file composed of, in essence one row of numbers." is incorrect. Also your original question had code where you used textscan with semicolon delimiter. But there is not one single semicolon in the whole file.
As a result that code tells textscan to read a file with a particular format, but it is not the format that that file has. Because I wrote that code based on what you told me.
You can either experiment with textscan's options (e.g. EndOfLine, Delimiter, etc) yourself, or you can tell us exactly what format the file really has. If you want help then please upload a sample text file (the first two thousand numbers or so) in a new comment.
Aaron Smith
2017-2-10
The file did not unzip correctly today so the file was not correct. I downloaded it again and unzipped it again
>> fid = fopen( 'TEST_A.asc', 'rt' );
str = fgets(fid, 60)
fclose(fid);
str =
1;658;671;661;686;672;662;645;654;669;675;650;688;666;664;66
This is the other test code you wrote
>> fid = fopen( 'TEST_A.asc', 'rt');
out = [];
while ~feof(fid)
tmp = unique(fgets(fid, 1e5));
out = union(out, double(tmp));
end
fclose(fid);
tmp
out
When I tried the original code with the newly properly unzipped file
R = 1025;
C = 1024;
fid = fopen('TEST_A.asc');
k = 0;
while ~feof(fid)
z = textscan( fid, '%d', R*C, 'EndOfLine', ';');
if ~isempty(z{1})
k = k + 1;
s = sprintf( 'TEST_A.asc', ';');
dlmwrite( s, reshape( z{1}, R, []), ';')
end
end
fclose(fid);
This gave an output for z which was a 1025 x 1 cell. This cell is the first row
Aaron Smith
2017-2-10
R = 1025;
C = 1024;
opt = { 'EndofLine', ';', 'CollectOutput', true};
fid = fopen('TEST_A.asc');
k = 0;
while ~feof(fid)
z = textscan( fid, '%d', R*C, opt{:});
if ~isempty(z{1})
k = k + 1;
s = sprintf( 'TEST_A.asc', ';');
dlmwrite( s, reshape( z{1}, R, []), ';')
end
end
Error using reshape
Product of known dimensions, 1025, not divisible into total number of elements, 1.
I tried it again and got a different error
Stephen23
2017-2-10
编辑:Stephen23
2017-2-10
@Aaron Smith: What is k's value when you get that error?
You have been asked twice to upload a sample file. It will be difficult to help your further without it.
I know my code works: I tested it. I even gave you the code that I used to generate the fake data file. If there is any problem then it is because your data file does not match the expected format somehow. So we need to see it.
Could it be that the number of values in the file is not divisible by 50*1025 ? If so then you might need a special case to handle the last matrix. Again, knowing the value of k and a sample file would be helpful.
Stephen23
2017-2-10
编辑:Stephen23
2017-2-10
@Aaron Smith: Try this, it saves all blocks of 1025x1024 values in their own files, and if there are any values left over at the end it saves them in one row in new file:
sbd = 'tempDir';
R = 1025;
C = 1024;
opt = {'EndOfLine',';', 'CollectOutput',true};
fid = fopen(fullfile(sbd,'temp0.txt'),'rt');
k = 0;
while ~feof(fid)
k = k+1;
Z = textscan(fid,'%d', R*C, opt{:});
S = fullfile(sbd,sprintf('temp0_%02d.txt',k));
if rem(numel(Z{1}),R)==0
dlmwrite(S,reshape(Z{1},[],R).',';')
else
dlmwrite(S,Z{1},';')
end
end
fclose(fid);
Note that I also added a transpose to get the data in the correct order.
Aaron Smith
2017-2-13
编辑:Aaron Smith
2017-2-13
Thanks so much Stephen. I got an error on the code but i think it might be a problem with the file itself or save path
sbd = 'tempDir';
R = 1025;
C = 1024;
opt = {'EndOfLine', ';', 'CollectOutput', true};
fid = fopen(fullfile( sbd, 'TEST_A.asc' ), 'rt');
k = 0;
while ~feof(fid)
k = k+1;
Z = textscan(fid, '%d', R*C, opt{:});
S = fullfile( sbd, sprintf( 'TEST_A_A.asc', k ));
if rem(numel( z{1}), R)==0
dlmwrite(S, reshape( z{1}, [], R).', ';')
else
dlmwrite( S, z{1}, ';')
end
end
fclose(fid);
Error using feof
Invalid file identifier. Use fopen to generate a valid file identifier
I'm sure I'll be able to fix that. What does sbd do? Is it system build which builds the blocks or does it make fullfile create separate files for the blocks rather than build a full file from parts the way fullfile usually does or is it just the temporary name of the files?
Walter Roberson
2017-2-13
sbd is the name of the subdirectory to save the individual files into. You can set it to '' if you do not want to use a subdirectory to store them
Stephen23
2017-2-13
编辑:Stephen23
2017-2-13
sbd = 'tempDir';
is a subdirectory of the current directory. I put all of the files into this subdirectory because I did not want them cluttering up my current directory. You can make the subdir '' if you want to use the current directory, or (even better) learn to use directory paths and put your data in its own subdirectory.
Aaron Smith
2017-2-13
Yeah, I worked that out from reading pages on Matlab and by writing a description of the code. Thanks guys. Any idea what the problem with the file identifier might be? It came up before and seemed to just go away after a few times typing it out. That hasn't worked this time. It isn't the save path or the file name that is causing the problem as far as i know
Stephen23
2017-2-13
@Aaron Smith: get the second output from fopen:
[fileID,errmsg] = fopen(...)
and read the error message. It always turns out to be a spelling mistake, folder permissions, or the file not being in the location that they are looking in.
Aaron Smith
2017-2-14
编辑:Aaron Smith
2017-2-14
When using fopen outside of the code itself, it works fine and doesn't create an error. The only thing I can think it could be is the fullfile and sbd in the fopen command. I tried taking it out, moving it but that creates errors with the code. Is there a way to put the fullfile(sbd, ...) part in a separate line?
sbd = 'tempdir';
R = 1025;
C = 1024;
opt = { 'EndOfLine', ';', 'CollectOutput', true };
>> fid = fopen(fullfile(sbd,'TEST_A.asc'),'rt');
>> k = 0;
while ~feof(fid)
k = k + 1;
Z = textscan( fid, '%d', R*C, opt{:});
S = fullfile( sbd, sprintf( 'TEST_ASA.asc', k ));
if rem( numel( Z{1}), R)==0
dlmwrite( S, reshape( Z{1}, [], R).', ';')
else
dlmwrite( S, Z{1}, ';')
end
end
Error using feof
Invalid file identifier. Use fopen to generate a valid file identifier.
>> [fid, errmsg] = fopen( 'TEST_A.asc' )
fid =
9
errmsg =
''
I was thinking, looking at the fullfile page on mathworks, Should i set up a folder to be a destination for the file?
f = fullfile('myfolder','mysubfolder','myfile.m')
I'm thinking it may be the subdirectory (sbd) that is causing the error
Stephen23
2017-2-14
@Aaron Smith: just get rid of the fullfile if you don't want it.
However I would recommend learning to use filepaths to access data files, as it makes your code faster and more reliable (e.g. compared to cd or other buggy ideas). Note that the file path I used is relative to the current directory, and that this may be different for the command window and the code that is being called: that path needs to exist relative to where the code runs from. One simple resolution is to always specify the an absolute path. The internet is full of help on understanding relative/absolute paths, but you might as well start here:
"Is there a way to put the fullfile(sbd, ...) part in a separate line" Sure, it is just a function, you can put it wherever you want to.
Aaron Smith
2017-2-15
Is there a way for me to share my data file with you so that you can try your code with the actual data? The file is approximately 200mb
采纳的回答
Stephen23
2017-2-15
编辑:Stephen23
2017-2-15
Thank you for the file. What did I learn from the actual data file: that it is not "composed of a single row", but in fact there are 51200 rows in the file that I received.
Why is this important? Because computers are stupid, and they do exactly what they are told to do. Knowing how to read a file correctly requires knowing what format the file has. In this case it is also quite handy for us, because it is trivial to read and write lines without much processing.
The code below worked correctly for me, reading the 200 MB file, and creating 50 smaller files with the rows following the same order as the original file.
sbd = 'temp';
f2d = fopen(fullfile(sbd,'temp_01.asc'),'wt');
f1d = fopen(fullfile(sbd,'TEST_A.asc'),'rt');
k = 0;
while ~feof(f1d)
str = fgetl(f1d);
if sscanf(str,'%d')==1
k = k+1;
fclose(f2d);
fnm = fullfile(sbd,sprintf('temp_%02d.asc',k));
f2d = fopen(fnm,'wt');
end
fprintf(f2d,'%s\n',str);
end
fclose(f1d);
fclose(f2d);
Note that:
- the size of the output matrices is 1024x1025 (because there are 1025 numbers per line). This is correct because the first number of each line is simply a line count (check the files and you will see).
- the lines are exactly the same as the original file.
- MATLAB hold one line at a time: the lines are simply read from the large file and written directly to a new file.
- as a result: no matrix, no converting from string to numeric and back to string.
- it is slow because the file is large... reading and writing 51200 lines of 1025 numbers each will take some time.
7 个评论
Aaron Smith
2017-2-16
Thanks Stephen. I knew about the line count number, i was just attributing it to columns rather than rows. I did think it it was all one single row of data. Anyway, thanks so much for your continued help. There is an error message showing up but i'm not sure if there is a fix for it.
>> sbd = 'temp';
>> fid2 = fopen(fullfile( sbd, 'temp_01.asc'), 'w');
>> fid1 = fopen(fullfile( sbd, 'TEST_A.asc' ), 'r');
>> k = 0;
>> while ~feof(fid1)
str = fgetl(fid1);
if sscanf( str, '%d' )==1
k = k + 1;
fclose(fid2);
fnm = fullfile( sbd, sprintf( 'temp_%02d.asc', k));
fid2 = fopen( fnm, 'w');
end
fprintf(fid2, '%s\n', str);
end
Error using feof
Invalid file identifier. Use fopen to generate a valid file identifier.
[fid1, errmsg] = fopen( 'TEST_A.asc' )
fid1 =
6
errmsg =
''
>> [fid2, errmsg] = fopen( 'test_01.asc', 'w')
fid2 =
7
errmsg =
''
Stephen23
2017-2-16
编辑:Stephen23
2017-2-17
"i'm not sure if there is a fix for it."
You need to provide the correct filepath for your files. I put all of my files into one sub-directory of the current path named "temp". That worked for me. Do you see "temp" at the start of my code?
Imagine that you tell MATLAB (or any other programming language that has ever existed) to open this file 'C:\Temp\myfile.txt' But what should happen if there is no such file in that location? Then the programming language cannot read your mind: it cannot guess that you actually meant another location, e.g. 'C:\Temp\testfiles\myfile.txt', or that the file is actually called 'my_mistake.csv'. YOU are the one who has to know where you files are, and YOU have to provide the correct path to fopen (via fullfile if used).
So look at my code: I used a sub-directory named "temp". My files were all in that sub-directory. So I told MATLAB to look in that sub-directory. But when you test for those files like this:
[fid1, errmsg] = fopen( 'TEST_A.asc' )
Where is it looking?: ONLY IN THE CURRENT DIRECTORY. You did not tell fopen to look in any sub-directory, or in any other directories anywhere in your computer, or even anywhere else in the known universe. Just the current directory. Let me ask a question: is the file 'TEST_A.asc' in the current directory? If the answer is no, then why are you telling MATLAB to look for it in the current directory?
fopen failures are most commonly caused by one thing: users not giving the correct path (which includes spelling mistakes of the name).
"i'm not sure if there is a fix for it."
The fix is that you provide fopen with the correct path.
PS: [fid2, errmsg] = fopen( 'test_01.asc', 'w') is a pointless test because it just creates that file wherever you tell it too: see the "w" option? That creates a file. It does not care where.
PPS: Why did you get rid of the t option? You should keep it (unless you plan on doing strange things with EOL characters). Removing random things is not a good way of making code work.
Walter Roberson
2017-2-16
"fopen failures are caused by one thing: users not giving the correct path"
Well, that and permission errors. And networked file access to a server that is not accessible. And bugs in file sharing applications like DropBox. And bugs in using UNC paths. And VPN setup. And encryption certificate problems. And full disks. ...
Aaron Smith
2017-2-20
Thanks Stephen. I did eventually get the code working. The problem was with the save path. I had to specify destinations with the entire path (C\ files\ folder\ folder). You mentioned the first number on each line, the line number (1, 2, 3, 4 etc). Is there a way to remove or ignore this the way the headerlines command in the textscan function does?
Stephen23
2017-2-20
编辑:Stephen23
2017-2-21
You could use the sscanf call to get an index, e.g.:
>> str = '10;123;456;789;0;123;';
>> [row,~,~,idx] = sscanf(str,'%d')
row =
10
idx =
3
Or in my answer (untested):
sbd = 'temp';
f2d = fopen(fullfile(sbd,'temp_01.asc'),'wt');
f1d = fopen(fullfile(sbd,'TEST_A.asc'),'rt');
k = 0;
while ~feof(f1d)
str = fgetl(f1d);
[row,~,~,idx] = sscanf(str,'%d');
if row==1
k = k+1;
fclose(f2d);
fnm = fullfile(sbd,sprintf('temp_%02d.asc',k));
f2d = fopen(fnm,'wt');
end
fprintf(f2d,'%s\n',str(idx+1:end));
end
fclose(f1d);
fclose(f2d);
Aaron Smith
2017-2-21
Thanks Stephen, that code works as far as I can see. What may I ask are the two ~ in the code doing?
Stephen23
2017-2-21
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Low-Level File I/O 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!发生错误
由于页面发生更改,无法完成操作。请重新加载页面以查看其更新后的状态。
您也可以从以下列表中选择网站:
如何获得最佳网站性能
选择中国网站(中文或英文)以获得最佳网站性能。其他 MathWorks 国家/地区网站并未针对您所在位置的访问进行优化。
美洲
- América Latina (Español)
- Canada (English)
- United States (English)
欧洲
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
亚太
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)