Feature selection using clustering

3 次查看(过去 30 天)
Kamil
Kamil 2011-4-28
回答: arushi 2024-8-22
I have to select features using clastering method - Ward's algorithm.
Short description of dataset: 16000 records, 5400 features (float) each record. I make some subset because working on the full set causes out of memory.
Reading Matlab docs it is quite easy:
X = load('subset.data');
Y = pdist(X);
Z = linkage(Y,'ward');
T = cluster(Z,'maxclust',2); % I set 2 clasters because in my dataset is 2 classes of objects. But now, I'm not sure if it is ok.
% PCA visualization
[W, pc] = princomp(X);
scatter(pc(:,1),pc(:,2),10,T,'filled')
And now, I don't know what to do next. How can I select features? Now, I think that instead of Y = pdist(X) it should be Y = pdist(X'), because I want to have clusters of features and than select some of them, right? But the problem is Y = pdist(X') causes out of memory. I would be greatful for answer, if my way of thinking is correct.
Thank you in advance!

回答(1 个)

arushi
arushi 2024-8-22
Hi Kamil,
When using clustering methods like Ward's algorithm for feature selection, the goal is to group similar features together and then select representative features from each cluster. You're correct in thinking that you need to cluster the features rather than the records, which means you should transpose your dataset. However, as you've noticed, computing the pairwise distances for such a large number of features can be memory-intensive.
Here are some strategies to handle this problem and proceed with feature selection:
Strategies for Clustering Features
Dimensionality Reduction Before Clustering:
  • Consider applying a dimensionality reduction technique, like Principal Component Analysis (PCA), to reduce the number of features before clustering. This can help alleviate memory issues.
  • You can use the top principal components as a lower-dimensional representation of your features
Sample a Subset of Features:
  • Randomly sample a subset of features to perform the clustering. Once you have identified clusters, you can evaluate the importance of features within those clusters on the full dataset.
Incremental or Batch Processing:
  • Process the data in smaller batches. Although this can be complex to implement for clustering, it might be necessary if memory constraints are severe.
Use Efficient Data Structures:
  • Ensure that your data is stored in a memory-efficient format. Consider using MATLAB's tall arrays or other memory-efficient data structures.
Reduce Precision:
  • If possible, reduce the precision of your data (e.g., using single instead of double) to save memory.
Correcting Your Approach
Given your goal, here's how you can adjust your approach:
Transpose the Data:
  • Use X' to transpose the data, so you are clustering the features instead of the records.
Compute Pairwise Distances:
  • Compute the pairwise distances between features. If pdist(X') causes memory issues, consider reducing the number of features first.
Linkage and Clustering:
  • Use the linkage function to perform hierarchical clustering on the features.
Select Features:
  • After clustering, select representative features from each cluster. You can choose features that are closest to the centroid of each cluster or use domain knowledge to select features.
Hope this helps.

类别

Help CenterFile Exchange 中查找有关 Dimensionality Reduction and Feature Extraction 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by