What exactly is going on with this code? I'm having difficulty understanding it
7 次查看(过去 30 天)
显示 更早的评论
Hello,
I'm struggling with understanding the code below. I was hoping to give my best interpretation of what's going on, and hope that I'm right.
for FilesToRead = 1 : length(FileNames)
y = load(fullfile(Directory, FileNames{FilesToRead}));
y = {y};
n = 400000;
for i=1:length(y)
x{i}=(1:length(y{i}))';
end
for ii=1:length(y)
for k=1:length(y{ii})/n
coeffs{ii}(k,:)=polyfit(x{ii}(n*(k-1)+1 : k*n),y{ii}(n*(k-1)+1 : k*n),1);
yfit{ii}(k,:)=coeffs{ii}(k,2)+coeffs{ii}(k,1)*x{ii}(n*(k-1)+1 : k*n);
end
end
for ii=1:length(y)
dd{ii}= subsref(yfit{ii}.', substruct('()', {':'})).';
dd2{ii}=y{ii}(1:length(dd{ii}))-dd{ii}';
end
for i=1:length(dd2)
t{i}=zeros(1,length(dd2{i}));
end
t{1}=1:length(dd2{1});
for i=2:length(dd2)
t{i}= t{i-1}(end)+[1:length(dd2{i})];
end
clear coeffs dd i ii k prompt ws yfit
yRaw=y; xRaw=x;
x=t; y=dd2;
clear dd2 t
% Window for variance - 5 seconds = 100,000
n = 100000;
for i=1:length(y)
for k=1:length(y{i})/n
a(k)=var(y{i}(n*(k-1)+1:k*n));
end
v{i}=a;
a=[];
end
t{1}=1:length(v{1});
for i=2:length(y)
t{i}=t{i-1}(end)+[1:length(v{i})];
end
for i=1:length(t)
tm{i} = t{i}.*((n/20000)/60); % time in minutes
end
end
So essentially, every 400,000 data points, a 1st degree polynomial is fitted to the 400,000 points. Then, the data is zeroed, or flattened, where the slope of the polynomial was turned to 0, and every data point had an equal number removed from from the area on the slope the data point was located.
This is done to the entire data set. Those new values are then stored, and variance data is gathered from every 100,000 data points.
Is this the general gist?
9 个评论
dpb
2019-9-24
...
%rootfolder = '/Data/';
% ***Don't forget to swith the slashes!
rootfolder = 'Z:Data\'; %%MAJOR ISSUE AREA. THE DIRECTORY HERE ALWAYS CHANGES! WATCH OUT!!!!
% Chopped = '/Chopped';
Chopped = '\Chopped\';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory = strcat(rootfolder,DataDate,Chopped);
Directory2=strcat(rootfolder,DataDate,'\');
...
We talked about this issue previously I think, too...???
Use fullfile() to build qualified filenames without specifying the system-specific delimiter character and it will automagically recognize which system are running on and do the right thing for you.
%% This selects all .txt files in the folder.
FileNames=dir(Directory);
FileNames = {FileNames.name};
FileNames = FileNames(contains(FileNames, '.txt'));
...
for FilesToRead = 1 : length(FileNames)
y = load(fullfile(Directory, FileNames{FilesToRead}));
...
is just
rootfolder = 'Z:Data';
Chopped = 'Chopped';
prompt = 'Enter date of experiment to be analyzed: ';
DataDate = input(prompt, 's');
Directory=fullfile(rootfolder,DataDate,Chopped);
FileNames=dir(fullfile(Directory,'*.txt');
...
for FilesToRead=1:length(FileNames)
y=load(fullfile(Directory,FileNames(FilesToRead).name));
...
and will include only those with the .txt extension and have the system-dependency handled automatically.
I don't think there's any point in then turning y into a cell array instead of just operating on the resulting vector as native double; that alone should speed things up noticeably as well as reduce memory footprint.
Then, as G has noted (and I did in our previous conversations as well) the loop and initial x is completely superfluous and waste of time and memory.
Also as I noted before, one could then compute the statistics on N elements of the vector piecewise by reshape() to NxM columns, use the inbuilt MATLAB facility to compute by default over columns for builtin routines like mean and var or iterate over the M columns for those which haven't been vectorized such as polyfit. At that point you may find it convient to store the coefficients in cell array since there are two per column, but it isn't clear there's any need to actually save them at all but to just process each column (section) and move on to the next. Doing it this way whether save the intermediates or not would eliminate almost all the convoluted indexing that makes so much of the code so difficult to parse.
Once have done that, then just working through the rest of the machinations should be relatively straightforward with the same idea although I still contend would be far simpler to just use a much smaller dataset initially with N=400 or somesuch so don't get so overwhelmed by just the size and can see results quickly and on screen instead of being innundated by the sheer mass of numbers that obfuscates the underlying problem. Once do that, then you can go back and verify your revised code produces same results for a real data set over that time length.
Again, it would also be wise to recode so that these magic numbers aren't buried in the bowels of the code but are easily set and modifiable so the code could be used generically if conditions do change in future.
采纳的回答
Guillaume
2019-9-24
编辑:Guillaume
2019-9-24
Removing all the unnecessary cruft, this is probably what the code reduces to. I've gone with the fact that your files are 400,000 elements long to remove the badly coded split into 400,000 elements.
I haven't tested the code so maybe got it wrong. i've kept the meaningless variable names. If I had written it, I would have used much better variable names (seriously, a as a variable name?)
%inputs:
y = load(fullfile(Directory, FileNames{FilesToRead})); %since the file is text, it would be better to use csvread or similar instead of load
x = (1:numel(y))';
coeffs = polyfit(x, y(:), 1);
yfit = polyval(coeffs, x); %much simpler than coding the polynomial yourself
dd2 = y - yfit;
n = 100000;
a = var(reshape(dd2, n, []));
tm = x * n / 20000 / 60;
I've not kept the t variable that is recreated again and agian as it doesn't do anything useful. If you still needed, it's the same as x.
It would be simple to add the split into 400,000 elements if it's really needed.
edit: as discussed below, I missed that dd2 gets renamed to y, replacing the original y. Seriously, who does that? What was wrong with the original dd2 (other than both being meaningless)?
18 个评论
dpb
2019-9-29
"Not using ybt anymore, so we don’t need it."
That doesn't answer the Q? underlying -- can you hold the full vector in memory on the systems you're running on? What you use for the variable name is immaterial. If so, then the full file can be operated on at once and there's no need to loop through the file at all. If not, then you need the loop to read segments and process them...but you don't need to (and shouldn't) separate out into individual files to do that.
"data date and the signal type is enough to differentiate between data"
Yeah, you've said that but you're going at it in what appears to be very awkward ways...if you would illustrate the naming pattern used, I'm certain a wildcard pattern could be built from the inputs you've asked for from the user which will have to come from somewhere, no matter how you actually end up running the script (which should undoubtedly be turned into a function for data integrity and scoping) so could have dir return only those of interest directly.
Anyway, details of interface and some minor efficiencies aside, looks like you've got a handle on how to proceed from here; I'll reiterate on the idea of playing at the command line with small dataset so you can understand what the reshape stuff is doing and how.
Think I'll bow out now...good luck.
更多回答(1 个)
dpb
2019-9-25
编辑:dpb
2019-9-25
"If you set the directory to the three fiels linked, you'll process them in about 30 seconds."
I did set breakpoint in debugger and ran the first of the three to the point of computing dd2
As we already knew, the result of
y=load('x000.txt');
ypr=detrend(y); % use the Signal Processing TB detrend function
resulted in
K>> dd2{1}(1:10)
ans =
-0.0011
0.0005
0.0041
-0.0033
0.0033
-0.0001
0.0023
0.0001
-0.0003
0.0022
K>> ypr(1:10)
ans =
-0.0011
0.0005
0.0041
-0.0033
0.0033
-0.0001
0.0023
0.0001
-0.0003
0.0022
K>>
which are the same result in one line of code instead of the 50 or so in your function and runs in a fraction of a second. And, in fact,
K>> max(abs(dd2{1}-ypr))
ans =
2.8449e-16
K>>
shows the difference in the result by different computation method is within machine precision of a double precision variable so we've verified that's what the gosh-awful mess actually does.
So, all you now need is to compute the variance of the detrended time series over the time length of choice -- as G shows and I recommended, the easiest way to do this is to reshape() the vector to Nx[] array and let Matlab vector operation by column on the array compute the values for each column in one call.
It appears the only thing left if the computation of the tm array and then plotting the results.
All the calculations could be in a loop of 20 lines at the absolute outside; more like 10 or even fewer it would seem probable.
Wrap that in an outer loop that opens the big file and reads the desired number of records (400K or whatever were chosen) and computes each section.
4 个评论
Guillaume
2019-9-25
@EL, were you the one asking the same question on discord?
If so, then start a new question (here on answers) for just that, explaining in a lot more details what you want to save, to what kind of file (mat, text, excel, binary, whatever), and why it is incremental.
Note that if your input data files are huge, then using datastore and tall arrays would probably be a good idea. Again, this requires discarding the current code.
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!