convert files into matrix

hi,
I have 177000 files, I have to create matrix contain all values in these files.
Each file was split using textscan to get
c{1},c{2},........
then convert it into matrix.
Then convert these matrices into one matrix.
the problem is these files contain some similar values, so I have to specify the similar values ,and drew all other attached values(row) with these values.
I tried running with 100 files to know running time , I found out the running time is very long for just 100 files.
I think if I find function can compare among c{1}for all files, and among c{2} for all files ,...etc . I think that will save time. I'm facing problem with this code:
targetdir = 'd:\social net\dataset\netflix\training_set';
targetfiles = '*.txt';
fileinfo = dir(fullfile(targetdir, targetfiles));
k=0;arr(:,:)=0; inc=0;k=0;y=1;
for i = 1: length(fileinfo)
thisfilename = fullfile(targetdir, fileinfo(i).name);
f=fopen(thisfilename,'r'); f1=fscanf(f,'%c'); f1(1:2)=[];
f2=fopen(thisfilename,'w'); fprintf(f2,'%c',f1);
f3=fopen(thisfilename,'r');
c = textscan(f,'%f %f %s','Delimiter',',','headerLines',1);
c1=c{1};c2=c{2}; c3=c{3};z=1;z1=1;z2=1;z3=0;
for k=1+k:length(c1)+inc
no=c1(z); arr1=arr(:,1); p=find(arr1==no);
if isempty(p)
j=1;
arr(y,j)=c1(z); arr(y,j+1)=i; arr(y,j+2)=c2(z);j=j+3;y=y+1;
else
ind(i,z1)=p;
L=arr(p,:);len=0;
for h=1:length(L)
if L(h)~=0
len=len+1;
end
end
len;
arr(p,len+1)=i;
arr(p,len+2)=c2(z);
z1=z1+1;
end
z=z+1;
end
inc=inc+length(c1);
[u,u1] =size(arr);
end
f4=fopen('netfile.txt','w');
for i=1:u
for j=1:u1
fprintf(f4,'%d ',arr(i,j));
end
fprintf(f4,'\n');
end
fclose all;
thanks

1 个评论

please need advices about the above code.
may one can add some improvements to make it run easily .
please why it is running is very very slow.
how can improve it.
thanks in advance

请先登录,再进行评论。

 采纳的回答

Daniel Shub
Daniel Shub 2011-11-9

0 个投票

What version of MATLAB are you using? It looks like arr is growing in your loop. Prior to r2011a (???) preallocating a variable can speed things up. If you do not know the final size, reallocating in large chunks can speed things up.
Where are the files saved (locally, network drive, flash drive, external harddrive)? A fast internal harddrive will give you the fastest read times.
Have you tried using the profiler to find bottlenecks in the code.

3 个评论

Pre-allocation is extremly useful in 2011b also.
Is this a regression from 2011a to 2011b, or are the improvements in 2011a are not as great as I thought: http://blogs.mathworks.com/steve/2011/05/16/automatic-array-growth-gets-a-lot-faster-in-r2011a/
thanks Daniel,
What version of MATLAB are you using?
matlab7
It looks like arr is growing in your loop.
yes
Prior to r2011a (???) preallocating a variable can speed things up.
how do preallocate and reallocate
Where are the files saved (locally, network drive, flash drive, external harddrive)? A fast internal harddrive will give you the fastest read times.
my files are stored in partition D:\ in my computer
Have you tried using the profiler to find bottlenecks in the code.
please tell me hoe use profile.
this code is very important for me.
thanks

请先登录,再进行评论。

更多回答(1 个)

Some general advices for improving the speed:
  • One command per line only - otherwise the JIT acceleration looses its power.
  • Avoid dump commands as "len;" - it wastes time.
  • Deleting the 1st two bytes from the file needs a lot of time. Better open the file, read two bytes and call TEXTSCAN afterwards.
  • Close every file as soon as possible properly by fclose(fid). Do not leave all files open until the final fclose('all'). Open files consume resources.
  • Use the vectorizing of fprintf. Instead of for j=1:u1, fprintf(f4,'%d ',arr(i,j)); end prefer fprintf(f4, '%d ', arr(i, :)).
  • Counting the number of non-zero elements in L does not need a loop. Faster: len = sum(L ~= 0);.
  • arr(:, :) = 0 is not useful, because it is equal to a = 0. k is defined twice.
I cannot insert a pre-allocation, because I do not know the maximal possible size of "arr". But this should be faster already:
function wwq
targetdir = 'd:\social net\dataset\netflix\training_set';
targetfiles = '*.txt';
fileinfo = dir(fullfile(targetdir, targetfiles));
arr = 0; % Better pre-allocate
inc = 0;
kk = 0;
y = 1;
for i = 1:length(fileinfo)
thisfilename = fullfile(targetdir, fileinfo(i).name);
f = fopen(thisfilename,'r');
fread(f, 2, 'uint8'); % Skip two bytes
c = textscan(f, '%f %f %s', 'Delimiter', ',', 'headerLines', 1);
fclose(f);
c1 = c{1};
c2 = c{2};
% c3=c{3}; % Not used
z = 1;
% z1 = 1; % Not used
% z2 = 1; % Not used
% z3 = 0; % Not used
kknew = length(c1) + inc;
for k = (1 + kk):kknew % Avoid k as counter *and* in loop index
no = c1(z);
p = find(arr(:, 1) == no);
if isempty(p)
arr(y, 1) = c1(z);
arr(y, 2) = i;
arr(y, 3) = c2(z);
% j = j+3; % Not used
y = y + 1;
else
% ind(i,z1) = p; % Not used
L = arr(p, :);
len = sum(L ~= 0);
arr(p, len + 1) = i;
arr(p, len + 2) = c2(z);
% z1 = z1 + 1; % Not used
end
z = z + 1;
end
kk = kknew;
inc = inc + length(c1);
u = size(arr, 1);
end
f = fopen('netfile.txt','w');
for i = 1:u
fprintf(f, '%d ', arr(i, :));
fprintf(f,'\n');
end
fclose(f);

2 个评论

thanks Jan,
I try to run the code you wrote it.
but in this part
fread(f, 2, 'uint8'); % Skip two bytes
c = textscan(f, '%f %f %s', 'Delimiter', ',', 'headerLines', 1);
c will return just the second line , i need read from second line to end line
thanks
hi jan
I tried your code, but the same problem.
I tried for just 1000 files, but the running time is very very long may 45 minutes for just 1000 files . what if I run 177000 files.
sparse matrix can solve this problem
thanks

请先登录,再进行评论。

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by