Should I use PCA to order the data points in order to find the mode of the data points?
1 次查看(过去 30 天)
显示 更早的评论
Question 1: I roughly know that PCA is used for reducing dimensions, it is used for reducing the dimension of features not the dimension of data points (observations), right?
Question 2: for example, I have 30 data points randomly created and plotted below. At this moment, each data point is represented with 2 dimensions - x value and y value.
x=100*rand(1,30);
y=100*rand(1,30);
scatter(x,y,'ro','filled')
Now I would like to find a line (shown in black in the picture below) so that all the data points can be projected onto this line. After projection, each data point can be represented with 1 dimension value (lets call this value z) instead of 2 dimensions (x,y) in the original axis plane. See below. I am guessing this is one simple example of using PCA for dimension reduction. However I don't know how to use Matlab to execute this problem.
My requirements:
1) I would like to obtain all the 1-dimensional z values if possible and each of which representing the original data point;
2) Once I project all the data points onto this line, how to clearly find out which point on the line is related to which original data point?
3) In order to find the mode of the points on the projection line, is it just to sort the z values then find the mode of the z values? Once I obtained the mode of the z values, how do I relate that z value to the corresponding parent data point?
At the moment they are all a bit unclear and I would like to seek some help in order to help me fully understand the basics.
1 个评论
John D'Errico
2019-2-26
I'm not sure you really understand PCA.
Yes, it is true that PCA is not used to reduce the number of data points. That would serve no purpose.
The data you show does not seem to have one dimension you can intelligently reduce to, i.e., one dimension that seems to encode most of the information in this data. Yes, you can arbitrarily use PCA to reduce it to one variable, and still n data points.
You cannot recover the original data point, merely from a point along the line though.
I'm also not sure why you want to find the mode of the z values, thus the projected points along the line. Odds are, since those points are projected, they will be just a bunch of real numbers. So there will be no most frequent value.
回答(1 个)
John D'Errico
2019-2-26
编辑:John D'Errico
2019-2-26
Let me try to explain a bit, although, I think you wouldbe best servd by doing some reading online, or atextbook about PCA. I think you are trying to teach yourself about PCA by playing around with some made up data, just playing in MATLAB. At the same time, by playing around with only 2 variables, you are not really understanding what you see.
xy = [1 2] + randn(100,2)*rand(2,2);
plot(xy(:,1),xy(:,2),'o')
So there is clearly a relation between these two variables. In fact, you might decide to reduce that relationship using PCA. You can do the PCA using the function PCA, or you can use SVD.
xybar = mean(xy,1);
[U,S,V] = svd(xy - xybar);
Projline = xybar + linspace(-3,3,100)'*V(:,1)';
hold on
plot(Projline(:,1),Projline(:,2),'-')
So we have the projected subspace, as the green line.
Better still would be to do some reading about PCA. I would suggest the book by Ted Jackson, (actually J.E. Jackson) as a good read.
0 个评论
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!