Feature selection using clustering

Question

Kamil 2011-4-28

2
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/6396-feature-selection-using-clustering

回答： arushi 2024-8-22

I have to select features using clastering method - Ward's algorithm.

Short description of dataset: 16000 records, 5400 features (float) each record. I make some subset because working on the full set causes out of memory.

Reading Matlab docs it is quite easy:

X = load('subset.data');
Y = pdist(X);
Z = linkage(Y,'ward');
T = cluster(Z,'maxclust',2); % I set 2 clasters because in my dataset is 2 classes of objects. But now, I'm not sure if it is ok.
% PCA visualization
[W, pc] = princomp(X);
scatter(pc(:,1),pc(:,2),10,T,'filled')

And now, I don't know what to do next. How can I select features? Now, I think that instead of Y = pdist(X) it should be Y = pdist(X'), because I want to have clusters of features and than select some of them, right? But the problem is Y = pdist(X') causes out of memory. I would be greatful for answer, if my way of thinking is correct.

Thank you in advance!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

arushi 2024-8-22

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/6396-feature-selection-using-clustering#answer_1503224

Hi Kamil,

When using clustering methods like Ward's algorithm for feature selection, the goal is to group similar features together and then select representative features from each cluster. You're correct in thinking that you need to cluster the features rather than the records, which means you should transpose your dataset. However, as you've noticed, computing the pairwise distances for such a large number of features can be memory-intensive.

Here are some strategies to handle this problem and proceed with feature selection:

Strategies for Clustering Features

Dimensionality Reduction Before Clustering:

Consider applying a dimensionality reduction technique, like Principal Component Analysis (PCA), to reduce the number of features before clustering. This can help alleviate memory issues.
You can use the top principal components as a lower-dimensional representation of your features

Sample a Subset of Features:

Randomly sample a subset of features to perform the clustering. Once you have identified clusters, you can evaluate the importance of features within those clusters on the full dataset.

Incremental or Batch Processing:

Process the data in smaller batches. Although this can be complex to implement for clustering, it might be necessary if memory constraints are severe.

Use Efficient Data Structures:

Ensure that your data is stored in a memory-efficient format. Consider using MATLAB's tall arrays or other memory-efficient data structures.

Reduce Precision:

If possible, reduce the precision of your data (e.g., using single instead of double) to save memory.

Correcting Your Approach

Given your goal, here's how you can adjust your approach:

Transpose the Data:

Use X' to transpose the data, so you are clustering the features instead of the records.

Compute Pairwise Distances:

Compute the pairwise distances between features. If pdist(X') causes memory issues, consider reducing the number of features first.

Linkage and Clustering:

Use the linkage function to perform hierarchical clustering on the features.

Select Features:

After clustering, select representative features from each cluster. You can choose features that are closest to the centroid of each cluster or use domain knowledge to select features.

Hope this helps.