When applying the Simple Filter Approach (t-test) for feature selection, if all features have p-values of 0, does it mean that all features have strong discrimination power?

Question

Hussein 2024-4-20

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2109536-when-applying-the-simple-filter-approach-t-test-for-feature-selection-if-all-features-have-p-valu

评论： Hussein 2024-5-9

Hello all,

I have frequency response function (FRF) dataset related to pipeline SHM stored in a 6500x4000 matrix (6500 samples (signals) and 4000 features each}. The dataset corresponds to 11 groups or class labels (pipeline conditions). 1500 samples labeled as 'Fault-free', 500 samples labeled as 'BL_C1', 500 samples labeled as 'BL_C2', 500 samples labeled as 'BL_C3', 500 samples labeled as 'BL_C4', 500 samples labeled as 'SD_C1', 500 samples labeled as 'SD_C2', 500 samples labeled as 'SD_C3', 500 samples labeled as 'SC_C1', 500 samples labeled as 'SC_C2', and 500 samples labeled as 'SC_C3'.

I used this code for feature selection using Simple Filter Approach (t-test):

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

% applying t.test for feature selection

% Define the class labels and sample counts

class_labels = {'Fault-free', 'BL_C1', 'BL_C2', 'BL_C3', 'BL_C4', 'SD_C1', 'SD_C2', 'SD_C3', 'SC_C1', 'SC_C2', 'SC_C3'};

sample_counts = [1500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500];

%sample_counts = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]; % using the average signal for each scenario

% Construct the 'groups' variable vector based on the class labels and sample counts

num_samples = sum(sample_counts);

groups = zeros(num_samples, 1);

start_idx = 1;

for i = 1:length(class_labels)

end_idx = start_idx + sample_counts(i) - 1;

groups(start_idx:end_idx) = i;

start_idx = end_idx + 1;

end

% Applying the Simple Filter Approach (t-test)

t_scores = zeros(1, size(data, 2));

p_values = zeros(1, size(data, 2));

alpha = 0.05;

for feature = 1:size(data, 2)

[h, p, ci, stats] = ttest2(data(:, feature), groups, 'Vartype', 'unequal');

t_scores(feature) = stats.tstat;

p_values(feature) = p;

end

% Select features based on p-values below the significance level

selected_features = find(p_values < alpha);

ecdf(p);

xlabel('P value');

ylabel('CDF value')

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The code returned 0 p-value for all features. I used the avarage signal for each scenario reducing the dataset from 6500x4000 to 11x4000 corresponds to 11 sample (signals) representing 11 conditions, again with 4000 feature each, but still 0 p-values returned.

Is this acceptable?

Does it mean that all features have strong discrimination power? I doubt it, to be hounest!

Can anyone clear the doubt, rectify the code if I'm wrong somewher, or help me with a better code for a better technique that works well with my dataset?

Thank you very much in advance!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Ayush Aniket 2024-5-7

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2109536-when-applying-the-simple-filter-approach-t-test-for-feature-selection-if-all-features-have-p-valu#answer_1453676

在 MATLAB Online 中打开

Hi Hussein,

The t-test is traditionally used to compare the means between two groups. Your dataset involves 11 groups, which suggests that a one-way ANOVA (Analysis of Variance) might be more appropriate for comparing means across multiple groups.

Additionally, the 'ttest2' function is designed for comparing the means of two independent samples. In your code, you're comparing 'data(:, feature)' against 'groups', which is conceptually incorrect because 'groups' is not a dataset but a vector of class labels. For feature selection in a multi-class scenario, you would typically compare features across pairs of groups or use techniques designed for multi-class discrimination.

The correct approach for comparing two different 'groups' is as following:

% Define the two groups based on your binary class labels
group1_idx = groups == 1; % Indices for class 1
group2_idx = groups == 2; % Indices for class 2
% Preallocate arrays for t-scores and p-values
t_scores = zeros(1, size(data, 2));
p_values = zeros(1, size(data, 2));
% Loop through each feature to perform t-test
for feature = 1:size(data, 2)
    [h, p, ci, stats] = ttest2(data(group1_idx, feature), data(group2_idx, feature), 'Vartype', 'unequal');
    t_scores(feature) = stats.tstat;
    p_values(feature) = p;
end

You may refer to the following documentation to read more about the arguments of 'ttest2' function and one-way ANOVA which should be more suitable for your analysis:

Hope it helps.