Regularize Wide Data in Parallel
This example shows how to regularize a model with many more predictors than observations. Wide data is data with more predictors than observations. Typically, with wide data you want to identify important predictors. Use lassoglm
as an exploratory or screening tool to select a smaller set of variables to prioritize your modeling and research. Use parallel computing to speed up cross validation.
Load the ovariancancer
data. This data has 216 observations and 4000 predictors in the obs
workspace variable. The responses are binary, either 'Cancer'
or 'Normal'
, in the grp
workspace variable. Convert the responses to binary for use in lassoglm
.
load ovariancancer y = strcmp(grp,'Cancer');
Set options to use parallel computing. Prepare to compute in parallel using parpool
.
opt = statset('UseParallel',true);
parpool()
Starting parallel pool (parpool) using the 'local' profile ... Connected to the parallel pool (number of workers: 6). ans = ProcessPool with properties: Connected: true NumWorkers: 6 Cluster: local AttachedFiles: {} AutoAddClientPath: true IdleTimeout: 30 minutes (30 minutes remaining) SpmdEnabled: true
Fit a cross-validated set of regularized models. Use the Alpha
parameter to favor retaining groups of highly correlated predictors, as opposed to eliminating all but one member of the group. Commonly, you use a relatively large value of Alpha
.
rng('default') % For reproducibility tic [B,S] = lassoglm(obs,y,'binomial','NumLambda',100, ... 'Alpha',0.9,'LambdaRatio',1e-4,'CV',10,'Options',opt); toc
Elapsed time is 90.892114 seconds.
Examine cross-validation plot.
lassoPlot(B,S,'PlotType','CV'); legend('show') % Show legend
Examine trace plot.
lassoPlot(B,S,'PlotType','Lambda','XScale','log')
The right (green) vertical dashed line represents the Lambda
providing the smallest cross-validated deviance. The left (blue) dashed line has the minimal deviance plus no more than one standard deviation. This blue line has many fewer predictors:
[S.DF(S.Index1SE) S.DF(S.IndexMinDeviance)]
ans = 1×2
50 89
You asked lassoglm
to fit using 100 different Lambda
values. How many did it use?
size(B)
ans = 1×2
4000 84
lassoglm
stopped after 84 values because the deviance was too small for small Lambda
values. To avoid overfitting, lassoglm
halts when the deviance of the fitted model is too small compared to the deviance in the binary responses, ignoring the predictor variables.
You can force lassoglm
to include more terms by using the 'Lambda'
name-value pair argument. For example, define a set of Lambda
values that additionally includes three values smaller than the values in S.Lambda
.
minLambda = min(S.Lambda); explicitLambda = [minLambda*[.1 .01 .001] S.Lambda];
Specify 'Lambda',explicitLambda
when you call the lassoglm
function. lassoglm
halts when the deviance of the fitted model is too small, even though you explicitly provide a set of Lambda
values.
To save time, you can use:
Fewer
Lambda
, meaning fewer fitsFewer cross-validation folds
A larger value for
LambdaRatio
Use serial computation and all three of these time-saving methods:
tic [Bquick,Squick] = lassoglm(obs,y,'binomial','NumLambda',25,... 'LambdaRatio',1e-2,'CV',5); toc
Elapsed time is 16.517331 seconds.
Graphically compare the new results to the first results.
lassoPlot(Bquick,Squick,'PlotType','CV'); legend('show') % Show legend
lassoPlot(Bquick,Squick,'PlotType','Lambda','XScale','log')
The number of nonzero coefficients in the lowest plus one standard deviation model is around 50, similar to the first computation.