Largest portion of largest correlation coefficient
2 次查看(过去 30 天)
显示 更早的评论
I have n measurments as follow:
(x1,x2), (x2,x3),...,(xn,yn).
Now I want the largest portion that gives a correlation coefficient of more than a prespecified value.
How can I do that?
2 个评论
Torsten
2023-9-22
You want to extract m <= n point pairs (xi,yi) with m as large as possible such that their correlation coefficient exceeds a given specified value ? Is this the correct interpretation of your question ?
采纳的回答
Bruno Luong
2023-9-22
% dummy data
x = rand(1,500);
x = x + +0.01*randn(size(x));
x = sort(x);
y = x.^2+0.01*randn(size(x));
cthreshold = 0.99;
n=length(x);
int_se = nchoosek(1:n,2);
l=int_se(:,2)-int_se(:,1)+1;
[l,is]=sort(l,'descend');
int_se = int_se(is,:);
for k=1:size(int_se,1)
subidx = int_se(k,1):int_se(k,2);
xs = x(subidx);
ys = y(subidx);
R = corrcoef(xs,ys);
rxy = R(1,2);
if rxy > cthreshold
break
end
end
l = length(xs)
plot(xs,ys,'.')
0 个评论
更多回答(1 个)
William Rose
2023-9-22
I will suggest an approach, but first, why do you want to do this? It sounds suspicious: like you are selecting a subset of the data to get a correlation that is high. Why could this ever be justified?
Do a correlation through all your data. If the correlation is below the threshold, then find the biggest outlier, by checking how much each point deviates from the regression line. Eliminate that point, and recalculate the regression without that point. Repeat this process, eliminating the biggest remaining outlier each time, outlier until the correlation reaches the desired level.
6 个评论
William Rose
2023-9-22
编辑:William Rose
2023-9-22
If I understand your response correctly, then the approach you must use is already prescribed: you drop the first or last element of the vector each time until the correlation reaches the desired level.
I do not understand the rationale for this approach, but I am sure there is one. It is not obvious to me that the approach you have descirbed will work very well. But maybe it is good for data with certain typical error properties.
If myunderstanding is correct, then this seems pretty straightforward. If you are new to Matlab, or new to programming in general, then maybe it is not obvious how to do it.
corrGoal=0.9;
noiseAmpl=10;
N=50; x0=1:N;
y0=x0+noiseAmpl*randn(1,N); % data to analyze
% Next line creates data with noise that is largest at the
% beginning and the end
% y0=x0+noiseAmpl*(2/N)*(-(N-1)/2:(N-1)/2).*randn(1,N);
x=x0; y=y0;
rhoMtx=corrcoef(x,y); % 2x2 matrix of correlation coefficients
rho=rhoMtx(1,2); % correlation between x and y
p=polyfit(x,y,1);
yfit=polyval(p,x);
yresid=y-yfit;
fprintf('Initial correlation (N=%d): %.3f\n',N,rho)
while abs(rho)<corrGoal
if yresid(1)>yresid(N)
x=x(2:N); y=y(2:N); % discard initial point
else
x=x(1:N-1); y=y(1:N-1); % discard last point
end
N=N-1; % decrement N
p=polyfit(x,y,1);
yfit=polyval(p,x);
yresid=y-yfit;
rhoMtx=corrcoef(x,y);
rho=rhoMtx(1,2);
end
fprintf('Corr=%.3f, slope=%.2f, intercept=%.2f, N=%d.\n',rho,p(1),p(2),N);
plot(x0,y0,'b+',x,y,'ro',x,yfit,'-r'); % plot results
xlabel('X'); ylabel('Y'); axis equal; grid on
legend('original','final','final regression','Location','southeast')
On each pass, the code above eliminates the first or last point, whichever deviates more from the regression line.
The script runs without error. With noiseAmpl=10, the desired correlation is attained when there are two points remaining, at which point abs(corr)=1.00, of course. The two-point fit does not reflect the overall relationship between the original vectors x0 and y0. Therefore is not very interesting or satisfying.
If the original y values had larger noise at the start and end than in the middle, I expect you would get a more pleasing result, i.e. a result in which only a few points at the start and end would be eliminated. I included a commented-out line in the script, which does this. You can un-comment it, and see what happens.
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!