Apply copulas for estimating a single missing marginal, is it possible?

1 次查看(过去 30 天)
Let's consider this example from matlab documentation (with little changes):
load stockreturns
x = stocks(:,1);
y = stocks(:,2);
z = stocks(:,3);
u = ksdensity(x,x,'function','cdf');
v = ksdensity(y,y,'function','cdf');
w = ksdensity(z,z,'function','cdf');
[Rho,nu] = copulafit('t',[u v w],'Method','ApproximateML')
Rho = 3×3
1.0000 0.7220 0.3652 0.7220 1.0000 0.3659 0.3652 0.3659 1.0000
nu = 1.2692e+08
Now, assume that Rho and nu are known. Let's consider (only for simplicity):
v(50)
ans = 0.6546
And
y(50)
ans = 0.3170
And assume that y has a missing observation:
v(50) = NaN;
y(50) = NaN;
How can I estimate the missing marginal v(50) and accordingly the missing observation y(50) knowing Rho, nu, x, y, z and u, v, w? In other terms: how can I impute the value of a missing observations knowing other marginals?
Thank you in advance for your help.

回答(1 个)

Paras Gupta
Paras Gupta 2023-12-17
Hi Barbab,
I understand that you want to impute the value of a missing observation knowing other marginals.
To provide an estimate of the missing values, we can use the conditional distribution of the t-copula given the known marginals. The following code illustrates one way to achieve the same.
load stockreturns
x = stocks(:,1);
y = stocks(:,2);
z = stocks(:,3);
u = ksdensity(x,x,'function','cdf');
v = ksdensity(y,y,'function','cdf');
w = ksdensity(z,z,'function','cdf');
[Rho,nu] = copulafit('t',[u v w],'Method','ApproximateML');
% Assuming Rho, nu, x, y, z, u, v, w are known and v(50) and y(50) are missing
% Set the missing values to NaN
v(50) = NaN;
y(50) = NaN;
% Find indices of the non-missing data
nonMissingIdx = ~isnan(y);
% Estimate the CDF values for the non-missing y data
v_nonMissing = ksdensity(y(nonMissingIdx), y(nonMissingIdx), 'function', 'cdf');
% Fit the t-copula to the non-missing data
[Rho_nonMissing, nu_nonMissing] = copulafit('t', [u(nonMissingIdx) v_nonMissing w(nonMissingIdx)], 'Method', 'ApproximateML');
% For the missing observation, use the known values of x and z
known_x = x(50);
known_z = z(50);
% Calculate the CDF values of the known x and z
u_known = ksdensity(x, known_x, 'function', 'cdf');
w_known = ksdensity(z, known_z, 'function', 'cdf');
% Calculate the conditional distribution of y given x and z using the fitted t-copula
conditionalCdf = @(v) copulacdf('t', [u_known v w_known], Rho_nonMissing, nu_nonMissing);
% Find the quantile function (inverse CDF) for the non-missing y data
inv_v_nonMissing = @(p) ksdensity(y(nonMissingIdx), p, 'function', 'icdf');
% Use fminbnd to find the v value that makes the conditional CDF equal to 0.5
% This is a median estimate under the conditional distribution
v_estimate = fminbnd(@(v) abs(conditionalCdf(v) - 0.5), 0, 1);
% Convert the v_estimate to the corresponding y value using the inverse CDF
y_estimate = inv_v_nonMissing(v_estimate);
Please note that this is a simplified approach and assumes that the median of the conditional distribution is a reasonable estimate for the missing value. In practice, you may want to use more sophisticated imputation methods or consider the uncertainty in the estimate by sampling from the conditional distribution multiple times
You can refer to the documentation links below for more information on the code above.
Hope this helps.
  1 个评论
Barbab
Barbab 2023-12-18
编辑:Barbab 2023-12-18
Thank for your answer, that was exactly what I was looking for.
If I understand correctly the function conditionalCdf ensures that the dependence structure estimated on the non-missing data is taken into consideration and that the y_estimate value is therefore consistent with the Rho_nonMissing matrix?
When you mention to sample from the conditional distribution, you mean something like this?
rng('default')
% Number of samples to draw from the conditional distribution
numSamples = 1000;
% Preallocate array to store sampled y values
sampled_y_values = NaN(numSamples, 1);
% Perform multiple samples from the conditional distribution
for i = 1:numSamples
% Sample from the conditional distribution
v_sample = fminbnd(@(v) abs(conditionalCdf(v) - rand), 0, 1);
% Convert the sampled v value to the corresponding y value
y_sample = inv_v_nonMissing(v_sample);
% Store the sampled y value
sampled_y_values(i) = y_sample;
end
% Calculate statistics or analyze the sampled y values as needed
y_estimate = mean(sampled_y_values);
% or
y_estimate = median(sampled_y_values);
Why even if Rho_nonMissing values are relatively high, y_estimate is so different from its "true" value?
  • y(50), the true value is 0.3170
  • assuming the median (your code) gives 1.2224
  • using numerical simulations (my code) gives –0.1749 (mean) or –0.0908 (median)

请先登录,再进行评论。

产品


版本

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by