Feature Screening with screenpredictors
This example shows how to perform predictor screening using screenpredictors and then set predictor thresholds using the Threshold Predictors live task. Predictor screening using screenpredictors
is a type of univariate analysis performed as an early step in the Credit Scorecard Modeling Workflow. Predictor screening is an important preprocessing step when you work with credit scorecards, as data sets can be prohibitively large and have dozens or hundreds of potential predictors.
The goal of screening predictors is to pare down the set of predictors to a subset that is more useful in predicting the response variable based on the calculated metrics. You can set predictor thresholds using the Threshold Predictors live task to select the top predictors as ranked by a given metric to train your credit scorecards.
Load Data
The credit card data table contains a customer ID (CustID
), nine predictors, and the response variable (status
). Some of the risk factors are more useful in predicting the probability of a loan default, whereas others are less useful. The screening process helps you select the best subset of predictors.
Although the data set in this example contains only a few predictors, in practice, credit scorecard data sets can be very large. The predictor screening process is important as data sets grow to contain dozens or hundreds of predictors.
% Load credit card data. load CreditCardData.mat % Use the dataMissing data set, which contains some missing values. data = dataMissing; % Identify the ID and response variables. idvar = 'CustID'; responsevar = 'status'; % Examine the structure of the table. disp(head(data));
CustID CustAge TmAtAddress ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance UtilRate status ______ _______ ___________ ___________ _________ __________ _______ _______ _________ ________ ______ 1 53 62 <undefined> Unknown 50000 55 Yes 1055.9 0.22 0 2 61 22 Home Owner Employed 52000 25 Yes 1161.6 0.24 0 3 47 30 Tenant Employed 37000 61 No 877.23 0.29 0 4 NaN 75 Home Owner Employed 53000 20 Yes 157.37 0.08 0 5 68 56 Home Owner Employed 53000 14 Yes 561.84 0.11 0 6 65 13 Home Owner Employed 48000 59 Yes 968.18 0.15 0 7 34 32 Home Owner Unknown 32000 26 Yes 717.82 0.02 1 8 50 57 Other Employed 51000 33 No 3041.2 0.13 0
Add Additional Derived Predictors
Often, derivative predictors can capture additional information or produce better metrics results; for example, the ratio of two predictors or a predictor transformation for predictor x, such as x^2 or log(x). To demonstrate this, create two derived predictors and add them to the data set.
data.BalanceUtilRatio = data.AMBalance ./ data.UtilRate; data.BalanceIncomeRatio = data.AMBalance ./ data.CustIncome;
Compute Metrics
Use screenpredictors
to compute several measures of risk factor predictiveness. The columns of the output table contain the metrics values for the predictors. The table is sorted by the information value (InfoValue
).
T = screenpredictors(data,'IDVar',idvar,'ResponseVar',responsevar)
T=11×7 table
InfoValue AccuracyRatio AUROC Entropy Gini Chi2PValue PercentMissing
_________ _____________ _______ _______ _______ __________ ______________
CustAge 0.17698 0.1672 0.5836 0.88795 0.42645 0.0020599 0.025
TmWBank 0.15719 0.13612 0.56806 0.89167 0.42864 0.0054591 0
CustIncome 0.15572 0.17758 0.58879 0.891 0.42731 0.0018428 0
BalanceIncomeRatio 0.097073 0.1278 0.5639 0.90024 0.43303 0.11966 0
TmAtAddress 0.094574 0.010421 0.50521 0.90089 0.43377 0.182 0
UtilRate 0.075086 0.035914 0.51796 0.90405 0.43575 0.45546 0
AMBalance 0.07159 0.087142 0.54357 0.90446 0.43592 0.48528 0
BalanceUtilRatio 0.068955 0.026538 0.51327 0.90486 0.43614 0.52517 0
EmpStatus 0.048038 0.10886 0.55443 0.90814 0.4381 0.00037823 0
OtherCC 0.014301 0.044459 0.52223 0.91347 0.44132 0.047616 0
ResStatus 0.0095558 0.049855 0.52493 0.91446 0.44198 0.29879 0.033333
Set Threshold Metrics
Set thresholds for the predictors based on one or more metrics. Use the Threshold Predictors live task to interactively select thresholds for one or more predictors. In the plot displayed for Predictors, green bars indicate predictors that pass the threshold and red bars indicate predictors that do not pass the threshold. You can omit predictors that do not "pass" the threshold from the final data set.
Use the Threshold Predictors live task to select predictors based on their information value (InfoValue
) and accuracy ratio (AccuracyRatio
). Additional thresholds can be set by adding the desired metric using the Select threshold metrics drop-down control.
labelTable=11×2 table
InfoValue AccuracyRatio
_________ _____________
CustAge Pass Pass
TmWBank Pass Pass
CustIncome Pass Pass
BalanceIncomeRatio Pass Pass
TmAtAddress Pass Fail
UtilRate Fail Fail
AMBalance Fail Pass
BalanceUtilRatio Fail Fail
EmpStatus Fail Pass
OtherCC Fail Fail
ResStatus Fail Fail
Screening Summary
Summarize the thresholding results in table form. The lableTable
output from the live task indicates which of the predictors passed each of the threshold tests.
disp(labelTable)
InfoValue AccuracyRatio _________ _____________ CustAge Pass Pass TmWBank Pass Pass CustIncome Pass Pass BalanceIncomeRatio Pass Pass TmAtAddress Pass Fail UtilRate Fail Fail AMBalance Fail Pass BalanceUtilRatio Fail Fail EmpStatus Fail Pass OtherCC Fail Fail ResStatus Fail Fail
Reduce Table
Create a reduced table that contains only the passing predictors. Select only the predictors that pass both of the threshold tests and create a reduced data set.
% Select predictors that pass at least 2 metric threshold tests. all_passes = labelTable.Variables == "Pass"; pass_both_idx = 2 <= sum(all_passes,2); selected_predictors = T.Row(pass_both_idx); % Trim the data table to contain only the ID, passing predictors, and % response. top_predictor_table = data(:,[idvar; selected_predictors; responsevar]);
Use creditscorecard
to create a creditscorecard
object using the reduced data set.
% Create the credit scorecard using the screened predictors. sc = creditscorecard(top_predictor_table,'IDVar',idvar,'ResponseVar',responsevar,... 'BinMissingData', true)
sc = creditscorecard with properties: GoodLabel: 0 ResponseVar: 'status' WeightsVar: '' VarNames: {'CustID' 'CustAge' 'TmWBank' 'CustIncome' 'BalanceIncomeRatio' 'status'} NumericPredictors: {'CustAge' 'TmWBank' 'CustIncome' 'BalanceIncomeRatio'} CategoricalPredictors: {1x0 cell} BinMissingData: 1 IDVar: 'CustID' PredictorVars: {'CustAge' 'TmWBank' 'CustIncome' 'BalanceIncomeRatio'} Data: [1200x6 table]
For more information on developing credit scorecards, see Create Credit Scorecards.
See Also
creditscorecard
| screenpredictors
| autobinning
| bininfo
| predictorinfo
| modifypredictor
| modifybins
| bindata
| plotbins
| fitmodel
| displaypoints
| formatpoints
| score
| setmodel
| probdefault
| validatemodel
Related Examples
- Common Binning Explorer Tasks
- Credit Scorecard Modeling with Missing Values
- Troubleshooting Credit Scorecard Results
- Credit Rating by Bagging Decision Trees
- Stress Testing of Consumer Credit Default Probabilities Using Panel Data
More About
- Overview of Binning Explorer
- About Credit Scorecards
- Credit Scorecard Modeling Workflow
- Monotone Adjacent Pooling Algorithm (MAPA)
- Credit Scorecard Modeling Using Observation Weights