screenpredictors

Screen credit scorecard predictors for predictive value

Syntax

metric_table
= screenpredictors(data)

metric_table
= screenpredictors(___,Name,Value)

Description

metric_table = screenpredictors(data) returns the output variable, metric_table, a MATLAB^® table containing the calculated values for several measures of predictive power for each predictor variable in the data.

Use the screenpredictors function as a preprocessing step in the Credit Scorecard Modeling Workflow to reduce the number of predictor variables before you create the credit scorecard using the creditscorecard function from Financial Toolbox™. In addition, you can use Threshold Predictors from Risk Management Toolbox™to interactively set credit scorecard predictor thresholds using the output from screenpredictors before you create the credit scorecard using the creditscorecard.

example

metric_table = screenpredictors(___,Name,Value) specifies options using one or more name-value pair arguments in addition to the input arguments in the previous syntax.

example

Examples

collapse all

Screen Predictors for a `creditscorecard` Object

Open Live Script

Reduce the number of predictor variables by screening predictors before you create a credit scorecard.

Use the CreditCardData.mat file to load the data (using a dataset from Refaat 2011).

load CreditCardData.mat

Define 'IDVar' and 'ResponseVar'.

idvar = 'CustID';
responsevar = 'status';

Use screenpredictors to calculate the predictor screening metrics. The function returns a table containing the metrics values. Each table row corresponds to a predictor from the input table data.

metric_table = screenpredictors(data,'IDVar', idvar,'ResponseVar', responsevar)

metric_table=9×7 table
                   InfoValue    AccuracyRatio     AUROC     Entropy     Gini      Chi2PValue    PercentMissing
                   _________    _____________    _______    _______    _______    __________    ______________

    CustAge          0.18863       0.17095       0.58547    0.88729    0.42626    0.00074524          0       
    TmWBank          0.15719       0.13612       0.56806    0.89167    0.42864     0.0054591          0       
    CustIncome       0.15572       0.17758       0.58879      0.891    0.42731     0.0018428          0       
    TmAtAddress     0.094574      0.010421       0.50521    0.90089    0.43377         0.182          0       
    UtilRate        0.075086      0.035914       0.51796    0.90405    0.43575       0.45546          0       
    AMBalance        0.07159      0.087142       0.54357    0.90446    0.43592       0.48528          0       
    EmpStatus       0.048038       0.10886       0.55443    0.90814     0.4381    0.00037823          0       
    OtherCC         0.014301      0.044459       0.52223    0.91347    0.44132      0.047616          0       
    ResStatus      0.0097738       0.05039        0.5252    0.91422    0.44182       0.27875          0

metric_table = sortrows(metric_table,'AccuracyRatio','descend')

metric_table=9×7 table
                   InfoValue    AccuracyRatio     AUROC     Entropy     Gini      Chi2PValue    PercentMissing
                   _________    _____________    _______    _______    _______    __________    ______________

    CustIncome       0.15572       0.17758       0.58879      0.891    0.42731     0.0018428          0       
    CustAge          0.18863       0.17095       0.58547    0.88729    0.42626    0.00074524          0       
    TmWBank          0.15719       0.13612       0.56806    0.89167    0.42864     0.0054591          0       
    EmpStatus       0.048038       0.10886       0.55443    0.90814     0.4381    0.00037823          0       
    AMBalance        0.07159      0.087142       0.54357    0.90446    0.43592       0.48528          0       
    ResStatus      0.0097738       0.05039        0.5252    0.91422    0.44182       0.27875          0       
    OtherCC         0.014301      0.044459       0.52223    0.91347    0.44132      0.047616          0       
    UtilRate        0.075086      0.035914       0.51796    0.90405    0.43575       0.45546          0       
    TmAtAddress     0.094574      0.010421       0.50521    0.90089    0.43377         0.182          0

Based on the AccuracyRatio metric, select the top predictors to use when you create the creditscorecard object.

varlist = metric_table.Row(metric_table.AccuracyRatio > 0.09)

varlist = 4x1 cell
    {'CustIncome'}
    {'CustAge'   }
    {'TmWBank'   }
    {'EmpStatus' }

Use creditscorecard to create a createscorecard object based on only the "screened" predictors.

sc = creditscorecard(data,'IDVar', idvar,'ResponseVar', responsevar, 'PredictorVars', varlist)

sc = 
  creditscorecard with properties:

                GoodLabel: 0
              ResponseVar: 'status'
               WeightsVar: ''
                 VarNames: {'CustID'  'CustAge'  'TmAtAddress'  'ResStatus'  'EmpStatus'  'CustIncome'  'TmWBank'  'OtherCC'  'AMBalance'  'UtilRate'  'status'}
        NumericPredictors: {'CustAge'  'CustIncome'  'TmWBank'}
    CategoricalPredictors: {'EmpStatus'}
           BinMissingData: 0
                    IDVar: 'CustID'
            PredictorVars: {'CustAge'  'EmpStatus'  'CustIncome'  'TmWBank'}
                     Data: [1200x11 table]

Input Arguments

collapse all

`data` — Data for `creditscorecard` object
table | tall table | tall timetable

Data for the creditscorecard object, specified as a MATLAB table, tall table, or tall timetable, where each column of data can be any one of the following data types:

Numeric
Logical
Cell array of character vectors
Character array
Categorical
String

Data Types: table

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: metric_table = screenpredictors(data,'IDVar','CustAge','ResponseVar','status','PredictorVars',{'CustID','CustIncome'})

`IDVar` — Name of identifier variable
`''` (default) | character vector

Name of identifier variable, specified as the comma-separated pair consisting of 'IDVar' and a case-sensitive character vector. The 'IDVar' data can be ordinal numbers or Social Security numbers. By specifying 'IDVar', you can omit the identifier variable from the predictor variables easily.

Data Types: char

`ResponseVar` — Response variable name
last column of the `data` input (default) | character vector

Response variable name, specified as the comma-separated pair consisting of 'ResponseVar' and a case-sensitive character vector. The response variable data must be binary, the "Good" or "Bad" indicator.

If not specified, 'ResponseVar' is set to the last column of the input data by default.

Data Types: char

`PredictorVars` — Names of predictor variables
set difference between `VarNames` and `{IDVar`,`ResponseVar}` (default) | cell array of character vectors | string array

Names of predictor variables, specified as the comma-separated pair consisting of 'PredictorVars' and a case-sensitive cell array of character vectors or string array. By default, when you create a creditscorecard object, all variables are predictors except for IDVar and ResponseVar. Any name you specify using 'PredictorVars' must differ from the IDVar and ResponseVar names.

Data Types: cell | string

`WeightsVar` — Name of weights variable
`''` (default) | character vector

Name of weights variable, specified as the comma-separated pair consisting of 'WeightsVar' and a case-sensitive character vector to indicate which column name in the data table contains the row weights.

If you do not specify 'WeightsVar' when you create a creditscorecard object, then the function uses the unit weights as the observation weights.

Data Types: char

`NumBins` — Number of (equal frequency) bins for numeric predictors
`20` (default) | scalar numeric

Number of (equal frequency) bins for numeric predictors, specified as the comma-separated pair consisting of 'NumBins' and a scalar numeric.

Data Types: double

`FrequencyShift` — Indicates small shift in frequency tables that contain zero entries
`0.5` (default) | scalar numeric between `0` and `1`

Small shift in frequency tables that contain zero entries, specified as the comma-separated pair consisting of 'FrequencyShift' and a scalar numeric with a value between 0 and 1.

If the frequency table of a predictor contains any "pure" bins (containing all goods or all bads) after you bin the data using autobinning, then the function adds the 'FrequencyShift' value to all bins in the table. To avoid any perturbation, set 'FrequencyShift' to 0.

Data Types: double

Output Arguments

collapse all

`metric_table` — Calculated values for predictor screening metrics
table

Calculated values for the predictor screening metrics, returned as table. Each table row corresponds to a predictor from the input table data. The table columns contain calculated values for the following metrics:

'InfoValue' — Information value. This metric measures the strength of a predictor in the fitting model by determining the deviation between the distributions of "Goods" and "Bads".
'AccuracyRatio' — Accuracy ratio.
'AUROC' — Area under the ROC curve.
'Entropy' — Entropy. This metric measures the level of unpredictability in the bins. You can use the entropy metric to validate a risk model.
'Gini' — Gini. This metric measures the statistical dispersion or inequality within a sample of data.
'Chi2PValue' — Chi-square p-value. This metric is computed from the chi-square metric and is a measure of the statistical difference and independence between groups.
'PercentMissing' — Percentage of missing values in the predictor. This metric is expressed in decimal form.

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

This function supports input data that is specified as a tall column vector, a tall table, or a tall timetable. Note that the output for numeric predictors might be slightly different when using a tall array. Categorical predictors return the same results for tables and tall arrays. For more information, see tall and Tall Arrays.

Version History

Introduced in R2019a

screenpredictors

Syntax

Description

Examples

Screen Predictors for a `creditscorecard` Object

Input Arguments

`data` — Data for `creditscorecard` object
table | tall table | tall timetable

Name-Value Arguments

`IDVar` — Name of identifier variable
`''` (default) | character vector

`ResponseVar` — Response variable name
last column of the `data` input (default) | character vector

`PredictorVars` — Names of predictor variables
set difference between `VarNames` and `{IDVar`,`ResponseVar}` (default) | cell array of character vectors | string array

`WeightsVar` — Name of weights variable
`''` (default) | character vector

`NumBins` — Number of (equal frequency) bins for numeric predictors
`20` (default) | scalar numeric

`FrequencyShift` — Indicates small shift in frequency tables that contain zero entries
`0.5` (default) | scalar numeric between `0` and `1`

Output Arguments

`metric_table` — Calculated values for predictor screening metrics
table

Extended Capabilities

Tall Arrays
Calculate with arrays that have more rows than fit in memory.

Version History

See Also

Topics

screenpredictors

Syntax

Description

Examples

Screen Predictors for a creditscorecard Object

Input Arguments

data — Data for creditscorecard object table | tall table | tall timetable

Name-Value Arguments

IDVar — Name of identifier variable '' (default) | character vector

ResponseVar — Response variable name last column of the data input (default) | character vector

PredictorVars — Names of predictor variables set difference between VarNames and {IDVar,ResponseVar} (default) | cell array of character vectors | string array

WeightsVar — Name of weights variable '' (default) | character vector

NumBins — Number of (equal frequency) bins for numeric predictors 20 (default) | scalar numeric

FrequencyShift — Indicates small shift in frequency tables that contain zero entries 0.5 (default) | scalar numeric between 0 and 1

Output Arguments

metric_table — Calculated values for predictor screening metrics table

Extended Capabilities

Tall Arrays Calculate with arrays that have more rows than fit in memory.

Version History

See Also

Topics

Screen Predictors for a `creditscorecard` Object

`data` — Data for `creditscorecard` object
table | tall table | tall timetable

`IDVar` — Name of identifier variable
`''` (default) | character vector

`ResponseVar` — Response variable name
last column of the `data` input (default) | character vector

`PredictorVars` — Names of predictor variables
set difference between `VarNames` and `{IDVar`,`ResponseVar}` (default) | cell array of character vectors | string array

`WeightsVar` — Name of weights variable
`''` (default) | character vector

`NumBins` — Number of (equal frequency) bins for numeric predictors
`20` (default) | scalar numeric

`FrequencyShift` — Indicates small shift in frequency tables that contain zero entries
`0.5` (default) | scalar numeric between `0` and `1`

`metric_table` — Calculated values for predictor screening metrics
table

Tall Arrays
Calculate with arrays that have more rows than fit in memory.