Largest portion of largest correlation coefficient

Question

Roohollah 2023-9-22

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2024477-largest-portion-of-largest-correlation-coefficient

编辑： William Rose 2023-9-22

采纳的回答： Bruno Luong

I have n measurments as follow:

(x1,x2), (x2,x3),...,(xn,yn).

Now I want the largest portion that gives a correlation coefficient of more than a prespecified value.

How can I do that?

2 个评论
显示无隐藏无

Torsten 2023-9-22

You want to extract m <= n point pairs (xi,yi) with m as large as possible such that their correlation coefficient exceeds a given specified value ? Is this the correct interpretation of your question ?

Roohollah 2023-9-22

Yes it is.

But the extracted points must be in the same order as original observation.

For example, you cannot omi 5th, 19th and 25th observations and they say it is ok, the correlation is higher than the minimum. You can just omit data from beginning and the end of observations.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Bruno Luong 2023-9-22

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2024477-largest-portion-of-largest-correlation-coefficient#answer_1316267

在 MATLAB Online 中打开

% dummy data

x = rand(1,500);

x = x + +0.01*randn(size(x));

x = sort(x);

y = x.^2+0.01*randn(size(x));

cthreshold = 0.99;

n=length(x);

int_se = nchoosek(1:n,2);

l=int_se(:,2)-int_se(:,1)+1;

[l,is]=sort(l,'descend');

int_se = int_se(is,:);

for k=1:size(int_se,1)

subidx = int_se(k,1):int_se(k,2);

xs = x(subidx);

ys = y(subidx);

R = corrcoef(xs,ys);

rxy = R(1,2);

if rxy > cthreshold

break

end

l = length(xs)

l = 337

plot(xs,ys,'.')

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 2

William Rose 2023-9-22

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2024477-largest-portion-of-largest-correlation-coefficient#answer_1316112

@Roohollah,

I will suggest an approach, but first, why do you want to do this? It sounds suspicious: like you are selecting a subset of the data to get a correlation that is high. Why could this ever be justified?

Do a correlation through all your data. If the correlation is below the threshold, then find the biggest outlier, by checking how much each point deviates from the regression line. Eliminate that point, and recalculate the regression without that point. Repeat this process, eliminating the biggest remaining outlier each time, outlier until the correlation reaches the desired level.

6 个评论
显示 4更早的评论隐藏 4更早的评论

Roohollah 2023-9-22

No man. This is not the case. If such thing happens in your observations, it means that there is definitely something wrong in your experiment and you have to do it again.

William Rose 2023-9-22

编辑：William Rose 2023-9-22

在 MATLAB Online 中打开

@Roohollah,

If I understand your response correctly, then the approach you must use is already prescribed: you drop the first or last element of the vector each time until the correlation reaches the desired level.

I do not understand the rationale for this approach, but I am sure there is one. It is not obvious to me that the approach you have descirbed will work very well. But maybe it is good for data with certain typical error properties.

If myunderstanding is correct, then this seems pretty straightforward. If you are new to Matlab, or new to programming in general, then maybe it is not obvious how to do it.

corrGoal=0.9;

noiseAmpl=10;

N=50; x0=1:N;

y0=x0+noiseAmpl*randn(1,N); % data to analyze

% Next line creates data with noise that is largest at the

% beginning and the end

% y0=x0+noiseAmpl*(2/N)*(-(N-1)/2:(N-1)/2).*randn(1,N);

x=x0; y=y0;

rhoMtx=corrcoef(x,y); % 2x2 matrix of correlation coefficients

rho=rhoMtx(1,2); % correlation between x and y

p=polyfit(x,y,1);

yfit=polyval(p,x);

yresid=y-yfit;

fprintf('Initial correlation (N=%d): %.3f\n',N,rho)

Initial correlation (N=50): 0.870

while abs(rho)<corrGoal

if yresid(1)>yresid(N)

x=x(2:N); y=y(2:N); % discard initial point

else

x=x(1:N-1); y=y(1:N-1); % discard last point

end

N=N-1; % decrement N

p=polyfit(x,y,1);

yfit=polyval(p,x);

yresid=y-yfit;

rhoMtx=corrcoef(x,y);

rho=rhoMtx(1,2);

end

fprintf('Corr=%.3f, slope=%.2f, intercept=%.2f, N=%d.\n',rho,p(1),p(2),N);

Corr=-1.000, slope=-4.53, intercept=102.94, N=2.

plot(x0,y0,'b+',x,y,'ro',x,yfit,'-r'); % plot results

xlabel('X'); ylabel('Y'); axis equal; grid on

legend('original','final','final regression','Location','southeast')

On each pass, the code above eliminates the first or last point, whichever deviates more from the regression line.

The script runs without error. With noiseAmpl=10, the desired correlation is attained when there are two points remaining, at which point abs(corr)=1.00, of course. The two-point fit does not reflect the overall relationship between the original vectors x0 and y0. Therefore is not very interesting or satisfying.

If the original y values had larger noise at the start and end than in the middle, I expect you would get a more pleasing result, i.e. a result in which only a few points at the start and end would be eliminated. I included a commented-out line in the script, which does this. You can un-comment it, and see what happens.

请先登录，再进行评论。

Largest portion of largest correlation coefficient

2 个评论
显示无隐藏无

采纳的回答

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

更多回答（1 个）

6 个评论
显示 4更早的评论隐藏 4更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Largest portion of largest correlation coefficient

2 个评论 显示 无隐藏 无

采纳的回答

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

更多回答（1 个）

6 个评论 显示 4更早的评论隐藏 4更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

2 个评论
显示无隐藏无

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

6 个评论
显示 4更早的评论隐藏 4更早的评论