Draw first principal component line and regression line on the same scatterplot

Question

N/A 2024-8-27

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2148249-draw-first-principal-component-line-and-regression-line-on-the-same-scatterplot

回答： Image Analyst 2024-8-28

在 MATLAB Online 中打开

Given the data below

WireLength DieHeight

1 125

4 110

5 287

6 200

8 350

10 280

12 400

14 370

13 480

18 420

19 540

22 518

where WireLength is the x axis and Die Height is the y axis

Draw a regression line and the first principal component line on the scatter plot.

Below I have standardized the data and then drew the scatterplot.

After this I drew the regression line using the code below

%regression best fit line
model = fitlm(WireLength_Z, DieHeight_Z);
p = polyfit(WireLength_Z, DieHeight_Z, 1);
f = polyval(p, WireLength_Z);
h = plot(WireLength_Z, DieHeight_Z,'ok', WireLength_Z, f, '-', 'LineWidth', 2, 'MarkerSize', 5)
%filled markers 
set(h, {'MarkerFaceColor'}, get(h,'Color'));
% Set axis limits
xlim([-3.5 3.5]); % Set x-axis limits
ylim([-3.5 3.5]);   % Set y-axis limits
% Add x and y axes
ax = gca; % Get current axes
ax.XAxisLocation = 'origin'; % Set x-axis to the origin
ax.YAxisLocation = 'origin'; % Set y-axis to the origin
ax.XColor = 'k'; % Set x-axis color
ax.YColor = 'k'; % Set y-axis color
% Add a grid for better visualization
grid on;

using this code my scatterplot now looks like

So on the same axis I also want to draw the first PC line only such that it looks something like

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Image Analyst 2024-8-28

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2148249-draw-first-principal-component-line-and-regression-line-on-the-same-scatterplot#answer_1506029

在 MATLAB Online 中打开

It looks like you're doing a simple linear fit. I'm not seeing how you're using pca. It sounds like homework and they want you to compute the principal components, not simply do a fit with polyfit.

help pca
 PCA Principal Component Analysis (PCA) on raw data.
    COEFF = PCA(X) returns the principal component coefficients for the N
    by P data matrix X. Rows of X correspond to observations and columns to
    variables. Each column of COEFF contains coefficients for one principal
    component. The columns are in descending order in terms of component
    variance (LATENT). PCA, by default, centers the data and uses the
    singular value decomposition algorithm. For the non-default options,
    use the name/value pair arguments.
 
    [COEFF, SCORE] = PCA(X) returns the principal component score, which is
    the representation of X in the principal component space. Rows of SCORE
    correspond to observations, columns to components. The centered data
    can be reconstructed by SCORE*COEFF'.
 
    [COEFF, SCORE, LATENT] = PCA(X) returns the principal component
    variances, i.e., the eigenvalues of the covariance matrix of X, in
    LATENT.
 
    [COEFF, SCORE, LATENT, TSQUARED] = PCA(X) returns Hotelling's T-squared
    statistic for each observation in X. PCA uses all principal components
    to compute the TSQUARED (computes in the full space) even when fewer
    components are requested (see the 'NumComponents' option below). For
    TSQUARED in the reduced space, use MAHAL(SCORE,SCORE).
 
    [COEFF, SCORE, LATENT, TSQUARED, EXPLAINED] = PCA(X) returns a vector
    containing the percentage of the total variance explained by each
    principal component.
 
    [COEFF, SCORE, LATENT, TSQUARED, EXPLAINED, MU] = PCA(X) returns the
    estimated mean, MU, when 'Centered' is set to true; and all zeros when
    set to false.
 
    [...] = PCA(..., 'PARAM1',val1, 'PARAM2',val2, ...) specifies optional
    parameter name/value pairs to control the computation and handling of
    special data types. Parameters are:
 
     'Algorithm' - Algorithm that PCA uses to perform the principal
                   component analysis. Choices are:
         'svd'   - Singular Value Decomposition of X (the default).
         'eig'   - Eigenvalue Decomposition of the covariance matrix. It
                   is faster than SVD when N is greater than P, but less
                   accurate because the condition number of the covariance
                   is the square of the condition number of X.
         'als'   - Alternating Least Squares (ALS) algorithm which finds
                   the best rank-K approximation by factoring a X into a
                   N-by-K left factor matrix and a P-by-K right factor
                   matrix, where K is the number of principal components.
                   The factorization uses an iterative method starting with
                   random initial values. ALS algorithm is designed to
                   better handle missing values. It deals with missing
                   values without listwise deletion (see {'Rows',
                   'complete'}).
 
      'Centered' - Indicator for centering the columns of X. Choices are:
          true   - The default. PCA centers X by subtracting off column
                   means before computing SVD or EIG. If X contains NaN
                   missing values, NANMEAN is used to find the mean with
                   any data available.
          false  - PCA does not center the data. In this case, the original
                   data X can be reconstructed by X = SCORE*COEFF'.
 
      'Economy'  - Indicator for economy size output, when D the degrees of
                   freedom is smaller than P. D, is equal to M-1, if data
                   is centered and M otherwise. M is the number of rows
                   without any NaNs if you use 'Rows', 'complete'; or the
                   number of rows without any NaNs in the column pair that
                   has the maximum number of rows without NaNs if you use
                   'Rows', 'pairwise'. When D < P, SCORE(:,D+1:P) and
                   LATENT(D+1:P) are necessarily zero, and the columns of
                   COEFF(:,D+1:P) define directions that are orthogonal to
                   X. Choices are:
          true   - This is the default. PCA returns only the first D
                   elements of LATENT and the corresponding columns of
                   COEFF and SCORE. This can be significantly faster when P
                   is much larger than D. NOTE: PCA always returns economy
                   size outputs if 'als' algorithm is specifed.
          false  - PCA returns all elements of LATENT. Columns of COEFF and
                   SCORE corresponding to zero elements in LATENT are
                   zeros.
 
      'NumComponents' - The number of components desired, specified as a
                   scalar integer K satisfying 0 < K <= P. When specified,
                   PCA returns the first K columns of COEFF and SCORE.
 
      'Rows'     - Action to take when the data matrix X contains NaN
                   values. If 'Algorithm' option is set to 'als, this
                   option is ignored as ALS algorithm deals with missing
                   values without removing them. Choices are:
          'complete' - The default action. Observations with NaN values
                       are removed before calculation. Rows of NaNs are
                       inserted back into SCORE at the corresponding
                       location.
          'pairwise' - If specified, PCA switches 'Algorithm' to 'eig'.
                       This option only applies when 'eig' method is used.
                       The (I,J) element of the covariance matrix is
                       computed using rows with no NaN values in columns I
                       or J of X. Please note that the resulting covariance
                       matrix may not be positive definite. In that case,
                       PCA terminates with an error message.
          'all'      - X is expected to have no missing values. All data
                       are used, and execution will be terminated if NaN is
                       found.
 
      'Weights'  - Observation weights, a vector of length N containing all
                   positive elements.
 
      'VariableWeights' - Variable weights. Choices are:
           - a vector of length P containing all positive elements.
           - the string 'variance'. The variable weights are the inverse of
             sample variance. If 'Centered' is set true at the same time,
             the data matrix X is centered and standardized. In this case,
             PCA returns the principal components based on the correlation
             matrix.
 
    The following parameter name/value pairs specify additional options
    when alternating least squares ('als') algorithm is used.
 
       'Coeff0'  - Initial value for COEFF, a P-by-K matrix. The default is
                   a random matrix.
 
       'Score0'  - Initial value for SCORE, a N-by-K matrix. The default is
                   a matrix of random values.
 
       'Options' - An options structure as created by the STATSET function.
                   PCA uses the following fields:
           'Display' - Level of display output.  Choices are 'off' (the
                       default), 'final', and 'iter'.
           'MaxIter' - Maximum number of steps allowed. The default is
                       1000. Unlike in optimization settings, reaching
                       MaxIter is regarded as convergence.
            'TolFun' - Positive number giving the termination tolerance for
                       the cost function.  The default is 1e-6.
              'TolX' - Positive number giving the convergence threshold
                       for relative change in the elements of L and R. The
                       default is 1e-6.
 
 
    Example:
        load hald;
        [coeff, score, latent, tsquared, explained] = pca(ingredients);
 
    See also PPCA, PCACOV, PCARES, BIPLOT, BARTTEST, CANONCORR, FACTORAN,
    ROTATEFACTORS.

    Documentation for pca
       doc pca

    Other uses of pca

       gpuArray/pca    tall/pca

This is how I'd start it:

clc;    % Clear the command window.
close all;  % Close all figures (except those of imtool.)
clear;  % Erase all existing variables. Or clearvars if you want.
workspace;  % Make sure the workspace panel is showing.
format short g;
format compact;
fontSize = 16;
markerSize = 30;
% Define experimental/sample data.
% [WireLength,    DieHeight]
data = [...
         1           125   
         4           110   
         5           287   
         6           200   
         8           350   
        10           280   
        12           400   
        14           370   
        13           480   
        18           420   
        19           540   
        22           518]
% Plot the original data, which has 12 "observations".
figure('Name', 'PCA Demo by Image Analyst', 'NumberTitle', 'off')
subplot(2, 1, 1);
plot(data(:, 1), data(:, 2), 'b.', 'MarkerSize', markerSize)
title('Data in Original Coordinate Space', 'FontSize', fontSize)
xlabel('Wire Length', 'FontSize', fontSize)
ylabel('Die Height', 'FontSize', fontSize)
grid
% You can tell by looking at the plot that PC1 would go along
% the data from lower left to upper right.  PC2 would then be
% the distance of the data perpendicularly from that line.
%--------------------------------------------------------
% Do a regression to minimize the squared vertical distance from the fitted line.
x = data(:, 1);
y = data(:, 2);
coefficients = polyfit(x, y, 1);
fittedx = [min(x), max(x)];
fittedy = polyval(coefficients, fittedx);
% Label points
for k = 1 : numel(x)
	str = sprintf('  %d', k);
	xt = x(k);
	yt = y(k);
	text(xt, yt, str, 'Color', 'b', 'FontSize', fontSize);
end
% Plot the regression over the original data
hold on;
plot(fittedx, fittedy, 'r-', 'LineWidth', 2);
legend('Original Data', 'Fitted Regression Line', 'Location', 'northwest')
drawnow;
%--------------------------------------------------------
% Do principal components analysis.
[coeff,score,latent,tsquared,explained,mu] = pca(data)
%--------------------------------------------------------

For a demo beyond that, see attached demo m-file, where I draw the axes on the original data.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Draw first principal component line and regression line on the same scatterplot

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

Draw first principal component line and regression line on the same scatterplot

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论