Largest portion of largest correlation coefficient

2 次查看(过去 30 天)
I have n measurments as follow:
(x1,x2), (x2,x3),...,(xn,yn).
Now I want the largest portion that gives a correlation coefficient of more than a prespecified value.
How can I do that?
  2 个评论
Torsten
Torsten 2023-9-22
You want to extract m <= n point pairs (xi,yi) with m as large as possible such that their correlation coefficient exceeds a given specified value ? Is this the correct interpretation of your question ?
Roohollah
Roohollah 2023-9-22
Yes it is.
But the extracted points must be in the same order as original observation.
For example, you cannot omi 5th, 19th and 25th observations and they say it is ok, the correlation is higher than the minimum. You can just omit data from beginning and the end of observations.

请先登录,再进行评论。

采纳的回答

Bruno Luong
Bruno Luong 2023-9-22
% dummy data
x = rand(1,500);
x = x + +0.01*randn(size(x));
x = sort(x);
y = x.^2+0.01*randn(size(x));
cthreshold = 0.99;
n=length(x);
int_se = nchoosek(1:n,2);
l=int_se(:,2)-int_se(:,1)+1;
[l,is]=sort(l,'descend');
int_se = int_se(is,:);
for k=1:size(int_se,1)
subidx = int_se(k,1):int_se(k,2);
xs = x(subidx);
ys = y(subidx);
R = corrcoef(xs,ys);
rxy = R(1,2);
if rxy > cthreshold
break
end
end
l = length(xs)
l = 337
plot(xs,ys,'.')

更多回答(1 个)

William Rose
William Rose 2023-9-22
I will suggest an approach, but first, why do you want to do this? It sounds suspicious: like you are selecting a subset of the data to get a correlation that is high. Why could this ever be justified?
Do a correlation through all your data. If the correlation is below the threshold, then find the biggest outlier, by checking how much each point deviates from the regression line. Eliminate that point, and recalculate the regression without that point. Repeat this process, eliminating the biggest remaining outlier each time, outlier until the correlation reaches the desired level.
  6 个评论
Roohollah
Roohollah 2023-9-22
No man. This is not the case. If such thing happens in your observations, it means that there is definitely something wrong in your experiment and you have to do it again.
William Rose
William Rose 2023-9-22
编辑:William Rose 2023-9-22
If I understand your response correctly, then the approach you must use is already prescribed: you drop the first or last element of the vector each time until the correlation reaches the desired level.
I do not understand the rationale for this approach, but I am sure there is one. It is not obvious to me that the approach you have descirbed will work very well. But maybe it is good for data with certain typical error properties.
If myunderstanding is correct, then this seems pretty straightforward. If you are new to Matlab, or new to programming in general, then maybe it is not obvious how to do it.
corrGoal=0.9;
noiseAmpl=10;
N=50; x0=1:N;
y0=x0+noiseAmpl*randn(1,N); % data to analyze
% Next line creates data with noise that is largest at the
% beginning and the end
% y0=x0+noiseAmpl*(2/N)*(-(N-1)/2:(N-1)/2).*randn(1,N);
x=x0; y=y0;
rhoMtx=corrcoef(x,y); % 2x2 matrix of correlation coefficients
rho=rhoMtx(1,2); % correlation between x and y
p=polyfit(x,y,1);
yfit=polyval(p,x);
yresid=y-yfit;
fprintf('Initial correlation (N=%d): %.3f\n',N,rho)
Initial correlation (N=50): 0.870
while abs(rho)<corrGoal
if yresid(1)>yresid(N)
x=x(2:N); y=y(2:N); % discard initial point
else
x=x(1:N-1); y=y(1:N-1); % discard last point
end
N=N-1; % decrement N
p=polyfit(x,y,1);
yfit=polyval(p,x);
yresid=y-yfit;
rhoMtx=corrcoef(x,y);
rho=rhoMtx(1,2);
end
fprintf('Corr=%.3f, slope=%.2f, intercept=%.2f, N=%d.\n',rho,p(1),p(2),N);
Corr=-1.000, slope=-4.53, intercept=102.94, N=2.
plot(x0,y0,'b+',x,y,'ro',x,yfit,'-r'); % plot results
xlabel('X'); ylabel('Y'); axis equal; grid on
legend('original','final','final regression','Location','southeast')
On each pass, the code above eliminates the first or last point, whichever deviates more from the regression line.
The script runs without error. With noiseAmpl=10, the desired correlation is attained when there are two points remaining, at which point abs(corr)=1.00, of course. The two-point fit does not reflect the overall relationship between the original vectors x0 and y0. Therefore is not very interesting or satisfying.
If the original y values had larger noise at the start and end than in the middle, I expect you would get a more pleasing result, i.e. a result in which only a few points at the start and end would be eliminated. I included a commented-out line in the script, which does this. You can un-comment it, and see what happens.

请先登录,再进行评论。

产品


版本

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by