When applying the Simple Filter Approach (t-test) for feature selection, if all features have p-values of 0, does it mean that all features have strong discrimination power?

6 次查看(过去 30 天)
Hello all,
I have frequency response function (FRF) dataset related to pipeline SHM stored in a 6500x4000 matrix (6500 samples (signals) and 4000 features each}. The dataset corresponds to 11 groups or class labels (pipeline conditions). 1500 samples labeled as 'Fault-free', 500 samples labeled as 'BL_C1', 500 samples labeled as 'BL_C2', 500 samples labeled as 'BL_C3', 500 samples labeled as 'BL_C4', 500 samples labeled as 'SD_C1', 500 samples labeled as 'SD_C2', 500 samples labeled as 'SD_C3', 500 samples labeled as 'SC_C1', 500 samples labeled as 'SC_C2', and 500 samples labeled as 'SC_C3'.
I used this code for feature selection using Simple Filter Approach (t-test):
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
% applying t.test for feature selection
% Define the class labels and sample counts
class_labels = {'Fault-free', 'BL_C1', 'BL_C2', 'BL_C3', 'BL_C4', 'SD_C1', 'SD_C2', 'SD_C3', 'SC_C1', 'SC_C2', 'SC_C3'};
sample_counts = [1500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500];
%sample_counts = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]; % using the average signal for each scenario
% Construct the 'groups' variable vector based on the class labels and sample counts
num_samples = sum(sample_counts);
groups = zeros(num_samples, 1);
start_idx = 1;
for i = 1:length(class_labels)
end_idx = start_idx + sample_counts(i) - 1;
groups(start_idx:end_idx) = i;
start_idx = end_idx + 1;
end
% Applying the Simple Filter Approach (t-test)
t_scores = zeros(1, size(data, 2));
p_values = zeros(1, size(data, 2));
alpha = 0.05;
for feature = 1:size(data, 2)
[h, p, ci, stats] = ttest2(data(:, feature), groups, 'Vartype', 'unequal');
t_scores(feature) = stats.tstat;
p_values(feature) = p;
end
% Select features based on p-values below the significance level
selected_features = find(p_values < alpha);
ecdf(p);
xlabel('P value');
ylabel('CDF value')
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The code returned 0 p-value for all features. I used the avarage signal for each scenario reducing the dataset from 6500x4000 to 11x4000 corresponds to 11 sample (signals) representing 11 conditions, again with 4000 feature each, but still 0 p-values returned.
Is this acceptable?
Does it mean that all features have strong discrimination power? I doubt it, to be hounest!
Can anyone clear the doubt, rectify the code if I'm wrong somewher, or help me with a better code for a better technique that works well with my dataset?
Thank you very much in advance!

采纳的回答

Ayush Aniket
Ayush Aniket 2024-5-7
Hi Hussein,
The t-test is traditionally used to compare the means between two groups. Your dataset involves 11 groups, which suggests that a one-way ANOVA (Analysis of Variance) might be more appropriate for comparing means across multiple groups.
Additionally, the 'ttest2' function is designed for comparing the means of two independent samples. In your code, you're comparing 'data(:, feature)' against 'groups', which is conceptually incorrect because 'groups' is not a dataset but a vector of class labels. For feature selection in a multi-class scenario, you would typically compare features across pairs of groups or use techniques designed for multi-class discrimination.
The correct approach for comparing two different 'groups' is as following:
% Define the two groups based on your binary class labels
group1_idx = groups == 1; % Indices for class 1
group2_idx = groups == 2; % Indices for class 2
% Preallocate arrays for t-scores and p-values
t_scores = zeros(1, size(data, 2));
p_values = zeros(1, size(data, 2));
% Loop through each feature to perform t-test
for feature = 1:size(data, 2)
[h, p, ci, stats] = ttest2(data(group1_idx, feature), data(group2_idx, feature), 'Vartype', 'unequal');
t_scores(feature) = stats.tstat;
p_values(feature) = p;
end
You may refer to the following documentation to read more about the arguments of 'ttest2' function and one-way ANOVA which should be more suitable for your analysis:
Hope it helps.

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Analysis of Variance and Covariance 的更多信息

产品


版本

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by