How to know if PCA worked?

Question

Emil Ås 2019-10-30

2
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/488272-how-to-know-if-pca-worked

回答： Emil Ås 2019-12-6

Hi. This question is from an assignment in financial econometrics in university.

I have calculated the principal components from a big set of data. Our data consists of thirty columns of variables which is thirty bonds ranging from 1 to 30 years to maturity. This data is recorded for a long period of time for each day the market has been open, ranging back to about 1960s. Therefore is contains 8461 rows.

So, this is what I have done so far after importing the data:

T1TT = table2timetable(T1); %creating a timetable of T1 (the data)
cr = corr(T1TT{:,:}); % Calculating the correlation between the different bonds
% calculate eigenvectors and eigenvalues of the correlation matrix
[eigenVectors,eigenValues] = eig(cr);
%% eig function returns eigenValues as a diagonal matrix (i.e., zeros elsewhere)
eigenValues = diag(eigenValues);
%% sort eigenValues in decending order
eigenValues = sort(eigenValues, 'descend');

Our job now is to check if the principal component analysis worked. This text is taken from the assignment paper :

Finally, you want to make sure that your principal component analysis worked and it really transformed the correlated explanatory variables into uncorrelated principal components. To standardize each of the thirty time series, subtract from each observation the mean of the time series and divide the result by the standard deviation of the time series. Next, multiply the matrix containing all the standardized time series with the matrix of eigenvectors to compute the time series of the thirty principal components. Calculate all possible correlations between the thirty principal components - but do not report it in your solution paper! Instead describe the pattern the correlation matrix shows. Did your principal component analysis work?

This is what I did to try to answer the question above:

mean = -0.0061; % the mean of the eigenVectors (seen from workspace)
std = std(eigenVectors); % std of the eigenVectors 
standardized = (eigenVectors - mean) / std; % standardizing the time series 
multiplied = standardized .* eigenVectors; % multiplying 
multiplied_corr = corr(multiplied(:,:)); % finding the correlation matrix (they should now be uncorrelated)

However, the problem is that the correlation matrix which is being returned in "multiplied_corr" seems strange to me. This is because the matrix doesnt return uncorrelated principal components which I think it should do, they are still correlated in some way or another.

DOES ANYONE KNOW ABOUT OTHER WAYS FOR SOLVING THAT QUESTION AND CHECK IF THE PC ANALYSIS WORKED?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Ridwan Alam 2019-11-21

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/488272-how-to-know-if-pca-worked#answer_402623

编辑：Ridwan Alam 2019-11-21

在 MATLAB Online 中打开

Finally, you want to make sure that your principal component analysis worked and it really transformed the correlated explanatory variables into uncorrelated principal components. To standardize each of the thirty time series, subtract from each observation the mean of the time series and divide the result by the standard deviation of the time series. Next, multiply the matrix containing all the standardized time series with the matrix of eigenvectors to compute the time series of the thirty principal components. Calculate all possible correlations between the thirty principal components - but do not report it in your solution paper! Instead describe the pattern the correlation matrix shows. Did your principal component analysis work?

Assuming your data is in a table T1 of size = 8641x30 (excluding the day/timeindex)

To standardize each of the thirty time series, subtract from each observation the mean of the time series and divide the result by the standard deviation of the time series.

There are two ways to do this:

standardized_T1 = zscore(table2array(T1));

or,

standardized_T1 = table2array(T1);
standardized_T1 = (standardized_T1 - mean(standardized_T1))./std(standardized_T1);

Next, multiply the matrix containing all the standardized time series with the matrix of eigenvectors to compute the time series of the thirty principal components.

[eigen_vector,eigen_values] = eig(cov(standardized_T1),'vector'); % since you standardized, corr() and cov() will give same output
[eigen_values,descending_index] = sort(eigen_values,'descend'); % sorting is optional for your task, I guess
eigen_vector = eigen_vector(:,descending_index); % sorting is performed to make the first PC capture max variance
pca_scores = standardized_T1*eigen_vector; % size should be 8461x30

Calculate all possible correlations between the thirty principal components - but do not report it in your solution paper! Instead describe the pattern the correlation matrix shows. Did your principal component analysis work?

pca_corr = corr(pca_scores);
% it should be a diagonal matrix
% which means all the 30 PCs are orthogonal, i.e. "uncorrelated"

Hope this helps!!

%% sanity check

using MATLAB's built-in function pca():

[pca_coeff_m,pca_scores_m] = pca(standardized_T1); 
% pca_scores_m should match with the pca_scores above
% then the correlation would give same outcomes :)