Feature Screening with `screenpredictors`

Open Live Script

This example shows how to perform predictor screening using screenpredictors and then set predictor thresholds using the Threshold Predictors live task. Predictor screening using screenpredictors is a type of univariate analysis performed as an early step in the Credit Scorecard Modeling Workflow. Predictor screening is an important preprocessing step when you work with credit scorecards, as data sets can be prohibitively large and have dozens or hundreds of potential predictors.

The goal of screening predictors is to pare down the set of predictors to a subset that is more useful in predicting the response variable based on the calculated metrics. You can set predictor thresholds using the Threshold Predictors live task to select the top predictors as ranked by a given metric to train your credit scorecards.

Load Data

The credit card data table contains a customer ID (CustID), nine predictors, and the response variable (status). Some of the risk factors are more useful in predicting the probability of a loan default, whereas others are less useful. The screening process helps you select the best subset of predictors.

Although the data set in this example contains only a few predictors, in practice, credit scorecard data sets can be very large. The predictor screening process is important as data sets grow to contain dozens or hundreds of predictors.

% Load credit card data.
load CreditCardData.mat

% Use the dataMissing data set, which contains some missing values.
data = dataMissing;

% Identify the ID and response variables.
idvar = 'CustID';
responsevar = 'status';

% Examine the structure of the table.
disp(head(data));

    CustID    CustAge    TmAtAddress     ResStatus     EmpStatus    CustIncome    TmWBank    OtherCC    AMBalance    UtilRate    status
    ______    _______    ___________    ___________    _________    __________    _______    _______    _________    ________    ______

      1          53          62         <undefined>    Unknown        50000         55         Yes       1055.9        0.22        0   
      2          61          22         Home Owner     Employed       52000         25         Yes       1161.6        0.24        0   
      3          47          30         Tenant         Employed       37000         61         No        877.23        0.29        0   
      4         NaN          75         Home Owner     Employed       53000         20         Yes       157.37        0.08        0   
      5          68          56         Home Owner     Employed       53000         14         Yes       561.84        0.11        0   
      6          65          13         Home Owner     Employed       48000         59         Yes       968.18        0.15        0   
      7          34          32         Home Owner     Unknown        32000         26         Yes       717.82        0.02        1   
      8          50          57         Other          Employed       51000         33         No        3041.2        0.13        0

Add Additional Derived Predictors

Often, derivative predictors can capture additional information or produce better metrics results; for example, the ratio of two predictors or a predictor transformation for predictor x, such as x^2 or log(x). To demonstrate this, create two derived predictors and add them to the data set.

data.BalanceUtilRatio = data.AMBalance ./ data.UtilRate;
data.BalanceIncomeRatio = data.AMBalance ./ data.CustIncome;

Compute Metrics

Use screenpredictors to compute several measures of risk factor predictiveness. The columns of the output table contain the metrics values for the predictors. The table is sorted by the information value (InfoValue).

T = screenpredictors(data,'IDVar',idvar,'ResponseVar',responsevar)

T=11×7 table
                          InfoValue    AccuracyRatio     AUROC     Entropy     Gini      Chi2PValue    PercentMissing
                          _________    _____________    _______    _______    _______    __________    ______________

    CustAge                 0.17698        0.1672        0.5836    0.88795    0.42645     0.0020599          0.025   
    TmWBank                 0.15719       0.13612       0.56806    0.89167    0.42864     0.0054591              0   
    CustIncome              0.15572       0.17758       0.58879      0.891    0.42731     0.0018428              0   
    BalanceIncomeRatio     0.097073        0.1278        0.5639    0.90024    0.43303       0.11966              0   
    TmAtAddress            0.094574      0.010421       0.50521    0.90089    0.43377         0.182              0   
    UtilRate               0.075086      0.035914       0.51796    0.90405    0.43575       0.45546              0   
    AMBalance               0.07159      0.087142       0.54357    0.90446    0.43592       0.48528              0   
    BalanceUtilRatio       0.068955      0.026538       0.51327    0.90486    0.43614       0.52517              0   
    EmpStatus              0.048038       0.10886       0.55443    0.90814     0.4381    0.00037823              0   
    OtherCC                0.014301      0.044459       0.52223    0.91347    0.44132      0.047616              0   
    ResStatus             0.0095558      0.049855       0.52493    0.91446    0.44198       0.29879       0.033333

Set Threshold Metrics

Set thresholds for the predictors based on one or more metrics. Use the Threshold Predictors live task to interactively select thresholds for one or more predictors. In the plot displayed for Predictors, green bars indicate predictors that pass the threshold and red bars indicate predictors that do not pass the threshold. You can omit predictors that do not "pass" the threshold from the final data set.

Use the Threshold Predictors live task to select predictors based on their information value (InfoValue) and accuracy ratio (AccuracyRatio). Additional thresholds can be set by adding the desired metric using the Select threshold metrics drop-down control.

labelTable=11×2 table
                          InfoValue    AccuracyRatio
                          _________    _____________

    CustAge                 Pass           Pass     
    TmWBank                 Pass           Pass     
    CustIncome              Pass           Pass     
    BalanceIncomeRatio      Pass           Pass     
    TmAtAddress             Pass           Fail     
    UtilRate                Fail           Fail     
    AMBalance               Fail           Pass     
    BalanceUtilRatio        Fail           Fail     
    EmpStatus               Fail           Pass     
    OtherCC                 Fail           Fail     
    ResStatus               Fail           Fail

Screening Summary

Summarize the thresholding results in table form. The lableTable output from the live task indicates which of the predictors passed each of the threshold tests.

disp(labelTable)

                          InfoValue    AccuracyRatio
                          _________    _____________

    CustAge                 Pass           Pass     
    TmWBank                 Pass           Pass     
    CustIncome              Pass           Pass     
    BalanceIncomeRatio      Pass           Pass     
    TmAtAddress             Pass           Fail     
    UtilRate                Fail           Fail     
    AMBalance               Fail           Pass     
    BalanceUtilRatio        Fail           Fail     
    EmpStatus               Fail           Pass     
    OtherCC                 Fail           Fail     
    ResStatus               Fail           Fail

Reduce Table

Create a reduced table that contains only the passing predictors. Select only the predictors that pass both of the threshold tests and create a reduced data set.

% Select predictors that pass at least 2 metric threshold tests.
all_passes = labelTable.Variables == "Pass";
pass_both_idx = 2 <= sum(all_passes,2);
selected_predictors = T.Row(pass_both_idx);

% Trim the data table to contain only the ID, passing predictors, and
% response.
top_predictor_table = data(:,[idvar; selected_predictors; responsevar]);

Use creditscorecard to create a creditscorecard object using the reduced data set.

% Create the credit scorecard using the screened predictors.
sc = creditscorecard(top_predictor_table,'IDVar',idvar,'ResponseVar',responsevar,...
    'BinMissingData', true)

sc = 
  creditscorecard with properties:

                GoodLabel: 0
              ResponseVar: 'status'
               WeightsVar: ''
                 VarNames: {'CustID'  'CustAge'  'TmWBank'  'CustIncome'  'BalanceIncomeRatio'  'status'}
        NumericPredictors: {'CustAge'  'TmWBank'  'CustIncome'  'BalanceIncomeRatio'}
    CategoricalPredictors: {1x0 cell}
           BinMissingData: 1
                    IDVar: 'CustID'
            PredictorVars: {'CustAge'  'TmWBank'  'CustIncome'  'BalanceIncomeRatio'}
                     Data: [1200x6 table]

For more information on developing credit scorecards, see Create Credit Scorecards.

Related Examples

More About

External Websites

Credit Scorecard Modeling Using the Binning Explorer App (6 min 17 sec)

Feature Screening with screenpredictors