Feature Extraction
What Is Feature Extraction?
Feature extraction is a set of methods that map input features to new output features. Many feature extraction methods use unsupervised learning to extract features. Unlike some feature extraction methods such as PCA and NNMF, the methods described in this section can increase dimensionality (and decrease dimensionality). Internally, the methods involve optimizing nonlinear objective functions. For details, see Sparse Filtering Algorithm or Reconstruction ICA Algorithm.
One typical use of feature extraction is finding features in images. Using these features can lead to improved classification accuracy. For an example, see Feature Extraction Workflow. Another typical use is extracting individual signals from superpositions, which is often termed blind source separation. For an example, see Extract Mixed Signals.
There are two feature extraction functions: rica
and sparsefilt
. Associated with these functions are the objects that they create: ReconstructionICA
and SparseFiltering
.
Sparse Filtering Algorithm
The sparse filtering algorithm begins with a data matrix X
that has n
rows and p
columns. Each row represents one observation and each column represents one measurement. The columns are also called the features or predictors. The algorithm then takes either an initial random p
-by-q
weight matrix W
or uses the weight matrix passed in the InitialTransformWeights
name-value pair. q
is the requested number of features that sparsefilt
computes.
The algorithm attempts to minimize the Sparse Filtering Objective Function by using a standard limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) quasi-Newton optimizer. See Nocedal and Wright [2]. This optimizer takes up to IterationLimit
iterations. It stops iterating earlier when it takes a step whose norm is less than StepTolerance
, or when it computes that the norm of the gradient at the current point is less than GradientTolerance
times a scalar τ, where
|f| is the norm of the objective function, and is the infinity norm of the initial gradient.
The objective function attempts to simultaneously obtain few nonzero features for each data point, and for each resulting feature to have nearly equal weight. To understand how the objective function attempts to achieve these goals, see Ngiam, Koh, Chen, Bhaskar, and Ng [1].
Frequently, you obtain good features by setting a relatively small value of IterationLimit
, from as low as 5 to a few hundred. Allowing the optimizer to continue can result in overtraining, where the extracted features do not generalize well to new data.
After constructing a SparseFiltering
object, use the transform
method to map input data to the new output features.
Sparse Filtering Objective Function
To compute an objective function, the sparse filtering algorithm uses the following steps. The objective function depends on the n
-by-p
data matrix X
and a weight matrix W
that the optimizer varies. The weight matrix W
has dimensions p
-by-q
, where p
is the number of original features and q
is the number of requested features.
Compute the
n
-by-q
matrixX*W
. Apply the approximate absolute value function to each element ofX*W
to obtain the matrixF
. ϕ is a smooth nonnegative symmetric function that closely approximates the absolute value function.Normalize the columns of
F
by the approximate L2 norm. In other words, define the normalized matrix byNormalize the rows of by the approximate L2 norm. In other words, define the normalized matrix by
The matrix is the matrix of converted features in
X
. Oncesparsefilt
finds the weightsW
that minimize the objective function h (see below), which the function stores in the output objectMdl
in theMdl.TransformWeights
property, thetransform
function can follow the same transformation steps to convert new data to output features.Compute the objective function h(
W
) as the 1–norm of the matrix , meaning the sum of all the elements in the matrix (which are nonnegative by construction):If you set the
Lambda
name-value pair to a strictly positive value,sparsefilt
uses the following modified objective function:Here, wj is the jth column of the matrix
W
and λ is the value ofLambda
. The effect of this term is to shrink the weightsW
. If you plot the columns ofW
as images, with positiveLambda
these images appear smooth compared to the same images with zeroLambda
.
Reconstruction ICA Algorithm
The Reconstruction Independent Component Analysis (RICA) algorithm is based on minimizing an objective function. The algorithm maps input data to output features.
The ICA source model is the following. Each observation x is generated by a random vector s according to
x is a column vector of length
p
.μ is a column vector of length
p
representing a constant term.s is a column vector of length
q
whose elements are zero mean, unit variance random variables that are statistically independent of each other.A is a mixing matrix of size
p
-by-q
.
You can use this model in rica
to estimate A from observations of x. See Extract Mixed Signals.
The RICA algorithm begins with a data matrix X
that has n
rows and p
columns consisting of the observations xi:
Each row represents one observation and each column represents one measurement. The columns are also called the features or predictors. The algorithm then takes either an initial random p
-by-q
weight matrix W
or uses the weight matrix passed in the InitialTransformWeights
name-value pair. q
is the requested number of features that rica
computes. The weight matrix W
is composed of columns wi of size p
-by-1:
The algorithm attempts to minimize the Reconstruction ICA Objective Function by using a standard limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) quasi-Newton optimizer. See Nocedal and Wright [2]. This optimizer takes up to IterationLimit
iterations. It stops iterating when it takes a step whose norm is less than StepTolerance
, or when it computes that the norm of the gradient at the current point is less than GradientTolerance
times a scalar τ, where
|f| is the norm of the objective function, and is the infinity norm of the initial gradient.
The objective function attempts to obtain a nearly orthonormal weight matrix that minimizes the sum of elements of g(XW
), where g is a function (described below) that is applied elementwise to XW
. To understand how the objective function attempts to achieve these goals, see Le, Karpenko, Ngiam, and Ng [3].
After constructing a ReconstructionICA
object, use the transform
method to map input data to the new output features.
Reconstruction ICA Objective Function
The objective function uses a contrast function, which you specify by using the ContrastFcn
name-value pair. The contrast function is a smooth convex function that is similar to an absolute value. By default, the contrast function is . For other available contrast functions, see ContrastFcn
.
For an n
-by-p
data matrix X
and q
output features, with a regularization parameter λ as the value of the Lambda
name-value pair, the objective function in terms of the p
-by-q
matrix W
is
The σj are known constants that are ±1. When σj = +1, minimizing the objective function h encourages the histogram of to be sharply peaked at 0 (super Gaussian). When σj = –1, minimizing the objective function h encourages the histogram of to be flatter near 0 (sub Gaussian). Specify the σj values using the rica
NonGaussianityIndicator
name-value pair.
The objective function h can have a spurious minimum of zero when λ is zero. Therefore, rica
minimizes h over W that are normalized to 1. In other words, each column wj of W is defined in terms of a column vector vj by
rica
minimizes over the vj. The resulting minimal matrix W
provides the transformation from input data X
to output features XW
.
References
[1] Ngiam, Jiquan, Zhenghao Chen, Sonia A. Bhaskar, Pang W. Koh, and Andrew Y. Ng. “Sparse Filtering.”
Advances in Neural Information Processing Systems. Vol. 24, 2011, pp. 1125–1133. https://papers.nips.cc/paper/4334-sparse-filtering.pdf
.
[2] Nocedal, J. and S. J. Wright. Numerical Optimization, Second Edition. Springer Series in Operations Research, Springer Verlag, 2006.
[3] Le, Quoc V., Alexandre Karpenko, Jiquan Ngiam, and Andrew Y. Ng. “ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning.”
Advances in Neural Information Processing Systems. Vol. 24, 2011, pp. 1017–1025. https://papers.nips.cc/paper/4467-ica-with-reconstruction-cost-for-efficient-overcomplete-feature-learning.pdf
.
See Also
rica
| sparsefilt
| ReconstructionICA
| SparseFiltering