Main Content

rankfeatures

Rank key features by class separability criteria

Description

IDX = rankfeatures(X,GROUP) ranks the features in X using an independent evaluation criterion for binary classification. X is a matrix where every column is an observed vector and the number of rows corresponds to the original number of features. GROUP contains the class labels. IDX is a list of indices to the rows of X with the most significant features.

example

IDX = rankfeatures(X,GROUP,Name=Value) uses additional options specified by one or more name-value arguments.

example

[IDX,Z] = rankfeatures(X,GROUP,___) also returns a list of absolute values of the criterion used for every feature.

example

Examples

collapse all

Find a reduced set of genes that is sufficient for differentiating breast cancer cells from all other types of cancer in the t-matrix NCI60 data set.

Load sample data.

load NCI60tmatrix

Get a logical index vector to the breast cancer cells.

BC = GROUP == 8;

Select features.

I = rankfeatures(X,BC,NumberOfIndices=12);

Test features with a linear discriminant classifier.

C = classify(X(I,:)',X(I,:)',double(BC));
cp = classperf(BC,C);
cp.CorrectRate
ans = 
1

Use cross-correlation weighting to further reduce the required number of genes.

I = rankfeatures(X,BC,'CCWeighting',0.7,'NumberOfIndices',8);
C = classify(X(I,:)',X(I,:)',double(BC));
cp = classperf(BC,C);
cp.CorrectRate 
ans = 
1

Find the discriminant peaks of two groups of signals with Gaussian pulses modulated by two different sources.

Load data.

load GaussianPulses

Specify the regional information to outweigh Z-value of features as a function handle. Set the number of output indices to 5.

f = rankfeatures(y',grp,NWeighting=@(x) x/10+5,NumberOfIndices=5);
plot(t,y(grp==1,:),'b',t,y(grp==2,:),'g',t(f),1.35,'vr');

Figure contains an axes object. The axes object contains 45 objects of type line. One or more of the lines displays its values using only markers

Input Arguments

collapse all

Sample data, specified as a numeric matrix. Each column is an observed vector, and each row is a feature.

Data Types: double

Class labels, specified as a numeric vector, string vector, or cell array of character vectors. numel(GROUP) is the same as the number of columns in X. GROUP must have only two unique values. If it contains any NaN values, the function ignores the corresponding observation vector in X.

Data Types: double | string | cell

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: [idx,x] = rankfeatures(x,groups,Criterion="entrophy",NWeighting=0.2) specifies to use the relative entropy as the criterion to assess the feature significance and regional information value of 0.2 to outweigh the Z-value of potential features.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: [idx,x] = rankfeatures(x,groups,'Criterion',"entrophy",'NWeighting',0.2)

Criterion to assess the significance of each feature for separating two labeled groups, specified as one of the following:

  • "ttest" — Absolute value two-sample t-test with pooled variance estimate.

  • "entropy" — Relative entropy, also known as Kullback-Leibler distance or divergence.

  • "bhattacharyya" — Minimum attainable classification error or Chernoff bound.

  • "roc" — Area between the empirical receiver operating characteristic (ROC) curve and the random classifier slope.

  • "wilcoxon" — Absolute value of the standardized u-statistic of a two-sample unpaired Wilcoxon test, also known as Mann-Whitney.

Note

"ttest", "entropy", and "bhattacharyya" assume normal distributed classes while "roc" and "wilcoxon" are nonparametric tests. All tests are feature independent.

Data Types: char | string

Correlation information to outweigh the Z-value of potential features, specified as a numeric scalar between 0 and 1.

The function uses Z×(1α×ρ) to calculate the weight, where ρ is the average of the absolute values of the cross-correlation coefficient between the candidate feature and all previously selected features. α is the CCWeighting value that sets the weighting factor.

By default, α is 0, and the function does not weight the potential features. A large value of ρ (close to 1) outweighs the significance statistic, meaning that features are highly correlated with the features already picked are less likely to be included in the output list.

Data Types: double

Regional information to outweigh the Z-value of potential features, specified as a nonnegative scalar or function handle.

The function uses Z×(1e(Dβ)2) to calculate the weight, where D is the distance (in rows) between the candidate feature and previously selected features. β is the NWeighting value that sets the weighting factor. β must be greater than or equal to 0.

By default, β is 0, and the function does not weight the potential features. A small value of D (close to 0) outweighs the significance statistics of only close features. This means that features that are close to already picked features are less likely to be included in the output list. This option is useful for extracting features from time series with temporal correlation.

β can also be a function of the feature location, specified using @ or an anonymous function. In both cases rankfeatures passes the row position of the feature to the specified function and expects back a value greater than or equal to 0.

Note

You can use CCWeighting and NWeighting together.

Data Types: double | function_handle

Number of output indices in IDX, specified as a positive scalar.

When α and β are 0 or there are less than 20 features, the default value is the same as the number of features. Otherwise, the default value is 20.

Data Types: double

Method for independent normalization across observations for every feature, specified as one of the following:

  • "none" (default) — No normalization.

  • "meanvar"Xnew=Xμσ

  • "softmax"Xnew=11+e(μXσ)

  • "minmax"Xnew=XXminXmaxXmin

In these equations, μ = mean(X), σ = std(X), Xmin = min(X), and Xmax = max(X).

Cross-normalization ensures comparability among different features although it is not always necessary because the selected criterion might already account for this.

Data Types: char | string

Output Arguments

collapse all

List of indices to the rows of X with the most significant features, returned as a numeric vector.

List of absolute values of the Criterion used for the features, returned as a numeric vector.

References

[1] Theodoridis, Sergios, and Konstantinos Koutroumbas. Pattern Recognition. San Diego: Academic Press, 1999: 341-342.

[2] Liu, Huan, and Hiroshi Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer International Series in Engineering and Computer Science 454. Boston: Kluwer Academic Publishers, 1998.

[3] Ross, Douglas T., Uwe Scherf, Michael B. Eisen, Charles M. Perou, Christian Rees, Paul Spellman, Vishwanath Iyer, et al. “Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines.” Nature Genetics 24, no. 3 (March 2000): 227–35.

Version History

Introduced before R2006a