Linear regression on data with asymmetric measurement error

Question

Katrina 2023-11-10

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2045375-linear-regression-on-data-with-asymmetric-measurement-error

回答： Jeff Miller 2023-11-14

I am looking to perform a linear regression on measured data that takes into account an asymmetric error in the data. I've created some dummy data to illustrate what I mean:

The blue curve represents the measured data, while the red curve is the lower bound and is notably closer to the measured data than the orange curve, which represents the upper bound.

Snippet of code to create dummy data:

xdata = linspace(0,10, 20);
ydata = 2*xdata+1.5*rand(1,length(xdata));
y_err_low = 0.3*xdata+1.5*rand(1,length(xdata));
y_err_high = 0.6*xdata+1.5*rand(1,length(xdata));
ylowbnd = ydata - y_err_low;
yupbnd = ydata + y_err_high;
plot(xdata, ydata,'o-', 'LineWidth', 2, 'DisplayName', 'measured data') 
hold on
plot(xdata, ylowbnd, 'x--', 'LineWidth', 2, 'DisplayName', 'lower bound') 
plot(xdata, yupbnd, 's--', 'LineWidth', 2, 'DisplayName', 'upper bound') 
xlabel('x')
ylabel('y')
legend('Location','northwest')

I have linear regression approaches that rely on the error in y being symmetric about the measured datapoint, but am struggling to find a way to weight my regression based on an asymmetric error.

Things I've been digging into:

fmincon (for both fmincon and lsqcurvefit, the bounds, equalities, and inequalities do not appear to allow to input a bound/etc with vectors, e.g., , where anonymous function to fit the data would be and the objective for fmincon would be )
lsqcurvefit
Method of Maximum Likelihood (here the examples I've been seeing rely on Gaussian distribution around each ydata point, so not asymmetric)

I would appreciate any help in how I can go about giving the fit more (or less) freedom to roam as matches with the asymmetric error associated with each data point.

Thanks!

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Torsten 2023-11-14

Where do the error curves come from ? What do they represent ?

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Mathieu NOE 2023-11-10

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2045375-linear-regression-on-data-with-asymmetric-measurement-error#answer_1350370

在 MATLAB Online 中打开

hello Katrina

maybe this ?

you can force the mean curve to get closer from either the upper or the lower bound by adjusting the a coefficient

a = 0.7; % a = 1 is equivalent to standard linear averaging (no weighting)

% a<1 shift the mean towards the lower bound, a>1 towards the upper bound

full code (dummy data slightly different from your version, sorry !)

% "true" data

x2 = (0:30);

y2 = 2*x2+1.5*rand(1,length(x2));

dx = mean(diff(x2));

% upper bound

x1 = x2 + dx/3;

y1 = 2.6*x1+1.5*rand(1,length(x1));

% lower bound

x3 = x2 + dx*2/3;

y3 = 1.7*x3-1.5*rand(1,length(x3));

% measurement = all data (contatenated)

x = [x1 x2 x3];

[x,ind] = sort(x);

y = [y1 y2 y3];

y = y(ind);

%%%% main loop %%%%

n = 15; % buffer size

a = 0.7; % a = 1 is equivalent to standard linear averaging (no weighting)

% a<1 shift the mean towards the lower bound, a>1 towards the upper bound

yy = myspecialavg(y, n ,a);

plot(x2, y2,'b',x, y,'*-c',x,yy,'r', 'LineWidth', 2, 'DisplayName', 'measured data')

legend('"true data"','noisy data','my solution');

xlabel('x')

ylabel('y')

legend('Location','northwest')

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = myspecialavg(in, N, a)

% OUTPUT_ARRAY = MYSLIDINGAVG(INPUT_ARRAY, N)

%

% The function 'slidingavg' implements a one-dimensional weighted filtering, applying a sliding window to a sequence. Such filtering replaces the center value in

% the window with the average value of all the points within the window. When the sliding window is exceeding the lower or upper boundaries of the input

% vector INPUT_ARRAY, the average is computed among the available points. Indicating with nx the length of the the input sequence, we note that for values

% of N larger or equal to 2*(nx - 1), each value of the output data array are identical and equal to mean(in).

%

% * The input argument INPUT_ARRAY is the numerical data array to be processed.

% * The input argument N is the number of neighboring data points to average over for each point of IN.

%

% * The output argument OUTPUT_ARRAY is the output data array.

if (isempty(in)) | (N<=0) % If the input array is empty or N is non-positive,

disp(sprintf('SlidingAvg: (Error) empty input data or N null.')); % an error is reported to the standard output and the

return; % execution of the routine is stopped.

end % if

if (N==1) % If the number of neighbouring points over which the sliding

out = in; % average will be performed is '1', then no average actually occur and

return; % OUTPUT_ARRAY will be the copy of INPUT_ARRAY and the execution of the routine

end % if % is stopped.

nx = length(in); % The length of the input data structure is acquired to later evaluate the 'mean' over the appropriate boundaries.

if (N>=(2*(nx-1))) % If the number of neighbouring points over which the sliding

out = mean(in)*ones(size(in)); % average will be performed is large enough, then the average actually covers all the points

return; % of INPUT_ARRAY, for each index of OUTPUT_ARRAY and some CPU time can be gained by such an approach.

end % if % The execution of the routine is stopped.

out = zeros(size(in)); % In all the other situations, the initialization of the output data structure is performed.

if rem(N,2)~=1 % When N is even, then we proceed in taking the half of it:

m = N/2; % m = N / 2.

else % Otherwise (N >= 3, N odd), N-1 is even ( N-1 >= 2) and we proceed taking the half of it:

m = (N-1)/2; % m = (N-1) / 2.

end % if

for i=1:nx, % For each element (i-th) contained in the input numerical array, a check must be performed:

dist2start = i-1; % index distance from current index to start index (1)

dist2end = nx-i; % index distance from current index to end index (nx)

if dist2start<m || dist2end<m % if we are close to start / end of data, reduce the mean calculation on centered data vector reduced to available samples

dd = min(dist2start,dist2end); % min of the two distance (start or end)

else

dd = m;

end % if

tmp = sort(in(i-dd:i+dd)); % buffered data , reduced to available samples at both ends of the data vector

win = linspace(1/a,a,numel(tmp));

win = win/sum(win);

out(i) = sum(win.*tmp); % mean of weighted data , reduced to available samples at both ends of the data vector

end % for i

end

4 个评论
显示 2更早的评论隐藏 2更早的评论

Mathieu NOE 2023-11-10

在 MATLAB Online 中打开

same code on another set of dummy data (for the fun) :

smoothdata or movmean or any other averaging method will give a centered line , whereas here you can shift towards one or the other bounds by changing the a factor

% "true" data

n = 150;

x2 = (1:n)/n;

y2 = 20*x2+0.05*randn(1,length(x2));

% with asymetric noise

x = x2;

y = y2;

% larger amplitude positive noise at random x index

ind1 = randi([1,n],round(n/2),1);

ind1 = unique(ind1);

y(ind1) = y(ind1)+ 5*rand(1,length(ind1));

% lower amplitude negative noise at random x index

ind2 = (1:n);

ind2(ind1) = [];

y(ind2) = y(ind2)- 1*rand(1,length(ind2));

%%%% main loop %%%%

buff = 25; % buffer size

a = 0.5; % a = 1 is equivalent to standard linear averaging (no weighting)

% a<1 shift the mean towards the lower bound, a>1 towards the upper bound

yy = myspecialavg(y, buff ,a);

% compare with smoothdata

ys = smoothdata(y,'gaussian',buff);

plot(x2, y2,'b',x, y,'*-c',x,yy,'r',x,ys,'k', 'LineWidth', 2, 'DisplayName', 'measured data')

legend('"true data"','noisy data','my solution','smoothdata');

xlabel('x')

ylabel('y')

legend('Location','northwest')

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function out = myspecialavg(in, N, a)

% OUTPUT_ARRAY = MYSLIDINGAVG(INPUT_ARRAY, N)

%

% The function 'slidingavg' implements a one-dimensional weighted filtering, applying a sliding window to a sequence. Such filtering replaces the center value in

% the window with the average value of all the points within the window. When the sliding window is exceeding the lower or upper boundaries of the input

% vector INPUT_ARRAY, the average is computed among the available points. Indicating with nx the length of the the input sequence, we note that for values

% of N larger or equal to 2*(nx - 1), each value of the output data array are identical and equal to mean(in).

%

% * The input argument INPUT_ARRAY is the numerical data array to be processed.

% * The input argument N is the number of neighboring data points to average over for each point of IN.

%

% * The output argument OUTPUT_ARRAY is the output data array.

if (isempty(in)) | (N<=0) % If the input array is empty or N is non-positive,

disp(sprintf('SlidingAvg: (Error) empty input data or N null.')); % an error is reported to the standard output and the

return; % execution of the routine is stopped.

end % if

if (N==1) % If the number of neighbouring points over which the sliding

out = in; % average will be performed is '1', then no average actually occur and

return; % OUTPUT_ARRAY will be the copy of INPUT_ARRAY and the execution of the routine

end % if % is stopped.

nx = length(in); % The length of the input data structure is acquired to later evaluate the 'mean' over the appropriate boundaries.

if (N>=(2*(nx-1))) % If the number of neighbouring points over which the sliding

out = mean(in)*ones(size(in)); % average will be performed is large enough, then the average actually covers all the points

return; % of INPUT_ARRAY, for each index of OUTPUT_ARRAY and some CPU time can be gained by such an approach.

end % if % The execution of the routine is stopped.

out = zeros(size(in)); % In all the other situations, the initialization of the output data structure is performed.

if rem(N,2)~=1 % When N is even, then we proceed in taking the half of it:

m = N/2; % m = N / 2.

else % Otherwise (N >= 3, N odd), N-1 is even ( N-1 >= 2) and we proceed taking the half of it:

m = (N-1)/2; % m = (N-1) / 2.

end % if

for i=1:nx, % For each element (i-th) contained in the input numerical array, a check must be performed:

dist2start = i-1; % index distance from current index to start index (1)

dist2end = nx-i; % index distance from current index to end index (nx)

if dist2start<m || dist2end<m % if we are close to start / end of data, reduce the mean calculation on centered data vector reduced to available samples

dd = min(dist2start,dist2end); % min of the two distance (start or end)

else

dd = m;

end % if

tmp = sort(in(i-dd:i+dd)); % buffered data , reduced to available samples at both ends of the data vector

win = linspace(1/a,a,numel(tmp));

win = win/sum(win);

out(i) = sum(win.*tmp); % mean of weighted data , reduced to available samples at both ends of the data vector

end % for i

end

Mathieu NOE 2023-11-14

hello Katrina

sorry but for the time being I have no other solution to suggest

Katrina 2023-11-14

That's fine - thanks!

请先登录，再进行评论。

Answer 2

Jeff Miller 2023-11-14

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2045375-linear-regression-on-data-with-asymmetric-measurement-error#answer_1352647

If you have separate measures of the lower and upper directional error associated with each X value (either empirical or derived from some model), then you can probably use least-squares.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Linear regression on data with asymmetric measurement error

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

回答（2 个）

4 个评论
显示 2更早的评论隐藏 2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Linear regression on data with asymmetric measurement error

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

回答（2 个）

4 个评论 显示 2更早的评论隐藏 2更早的评论

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

4 个评论
显示 2更早的评论隐藏 2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论