how to find feature distribution in kmeans clustering

Question

Dhruvin Naik 2022-2-10

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1647290-how-to-find-feature-distribution-in-kmeans-clustering

评论： Image Analyst 2022-2-15

I am trying to to do kmeans clustering on the data available to me. The data consists of information for each student (56 students in total) and their features like scores for each subject, other metrics like performance parameter, etc. There are total 39 features for each student. So the data matrix is (56*39). I used kmeans clustering to group the students in two clusters. I have attached the result of the clustering in the figure below. The data is plotted along the principal components. I want to know how the features are distributed along these clusters ? Something like score1 is high (above certain value) in cluster1 and low in cluster2, score2 is low in cluster 1 and high in cluster2. Is there a way to know how the features are distributed in these two clusters ? I want to find features that contribute to each Kmeans cluster.

i have used idx = kmeans(X,k) function in Matlab

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Image Analyst 2022-2-10

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1647290-how-to-find-feature-distribution-in-kmeans-clustering#answer_893435

编辑：Image Analyst 2022-2-10

You can call pca() to get the loadings and scores for each of the 39 different features for each PC. Like the first column represents PC1 and the 39 different values in the loadings vector represent the weights of the 39 different original feature values. You can also ask pca() for the amount of output variation explained by each of the original feature, like feature 1 (score) explains 60% of the variation, and feature 2 (performance metric 2, like days of class missed or whatever) explains 30% of the variation.

I'm not sure why you're doing kmeans on PCs in the first place. Seems weird to me. I mean all the PC's are supposed to be independent so plotting any of them vs the other would just look like a random shotgun blast, kind of like yours does. There is only very weak correlation, as expected. So why do clustering on them? If anything you'd do kmeans on the original data, not the principal components.

6 个评论
显示 4更早的评论隐藏 4更早的评论

Image Analyst 2022-2-11

Think of PCs as being like a rotation of the axes. Let's say you had a dumbbell-shaped collection of points that was slanted along 45 degrees if you plotted the points feature2 value vs feature1 value. PC1 would go at a 45 degree angle along the axis of the dumbbell. PC2 would go perpendicular to that. So now each point has a new coordinate in the PC2 vs. PC1 coordinate system. Each point was classified in feature space. The fact that it now has additional coordinates in a new tilted coordinate system does not mean the points wouldbe classified differently so the distribution would be the same. Some points will be in one range of PC1 values (like the left ball of the dumbbell), and the other class's points will be in a different range of PC1 values (like the right ball of the dumbbell). I guess I'm not sure what you mean when you say you want to "know how the features are distributed". You can colorize the classes and plot them in PC space if you want so that the major axes of the scatterplot will now go along the x, y, and z axes (PC1, PC2, and PC3) whereas before they might not have (might have been slanted when plotted vs feature1value, feature2value, and feature3 value).

Dhruvin Naik 2022-2-15

I did the PCA on the two clusters and got the principle components for both the clusters. Can you please tell me how should i compare the principle components from two clusters and map it to the original feature so that i can know if a given feature is more dominant in cluster one or cluster two ?

Image Analyst 2022-2-15

The coefficients (first returned variable from pca()) give you that - they give you the relative weights of the original variables that are used when making the PC from the original variable values.

请先登录，再进行评论。

how to find feature distribution in kmeans clustering

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

6 个评论
显示 4更早的评论隐藏 4更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

how to find feature distribution in kmeans clustering

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

6 个评论 显示 4更早的评论隐藏 4更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

6 个评论
显示 4更早的评论隐藏 4更早的评论