Main Content

This example shows how to work with consumer (retail) credit panel data to visualize observed default rates at different levels. It also shows how to fit a model to predict probabilities of default and perform a stress-testing analysis.

The panel data set of consumer loans enables you to identify default rate patterns for loans of different ages, or years on books. You can use information about a score group to distinguish default rates for different score levels. In addition, you can use macroeconomic information to assess how the state of the economy affects consumer loan default rates.

A standard logistic regression model, a type of generalized linear model, is fitted to the retail credit panel data with and without macroeconomic predictors. The example describes how to fit a more advanced model to account for panel data effects, a generalized linear mixed effects model. However, the panel effects are negligible for the data set in this example and the standard logistic model is preferred for efficiency.

The standard logistic regression model predicts probabilities of default for all score levels, years on books, and macroeconomic variable scenarios. When the standard logistic regression model is used for a stress-testing analysis, the model predicts probabilities of default for a given baseline, as well as default probabilities for adverse and severely adverse macroeconomic scenarios.

For additional information, refer to the example Modeling Probabilities of Default with Cox Proportional Hazards, which follows the same workflow but uses Cox regression instead of logistic regression, and also has additional information on the computation of lifetime PD and lifetime Expected Credit Loss (ECL).

The main data set (`data`

) contains the following variables:

`ID`

: Loan identifier.`ScoreGroup`

: Credit score at the beginning of the loan, discretized into three groups:`High Risk`

,`Medium Risk`

, and`Low Risk`

.`YOB`

: Years on books.`Default`

: Default indicator. This is the response variable.`Year`

: Calendar year.

There is also a small data set (`dataMacro`

) with macroeconomic data for the corresponding calendar years:

`Year`

: Calendar year.`GDP`

: Gross domestic product growth (year over year).`Market`

: Market return (year over year).

The variables `YOB`

, `Year`

, `GDP`

, and `Market`

are observed at the end of the corresponding calendar year. The score group is a discretization of the original credit score when the loan started. A value of `1`

for `Default`

means that the loan defaulted in the corresponding calendar year.

There is also a third data set (`dataMacroStress`

) with baseline, adverse, and severely adverse scenarios for the macroeconomic variables. This table is used for the stress-testing analysis.

This example uses simulated data, but the same approach has been successfully applied to real data sets.

Load the data and view the first 10 and last 10 rows of the table. The panel data is stacked, in the sense that observations for the same ID are stored in contiguous rows, creating a tall, thin table. The panel is unbalanced, because not all IDs have the same number of observations.

load RetailCreditPanelData.mat fprintf('\nFirst ten rows:\n')

First ten rows:

disp(data(1:10,:))

ID ScoreGroup YOB Default Year __ ___________ ___ _______ ____ 1 Low Risk 1 0 1997 1 Low Risk 2 0 1998 1 Low Risk 3 0 1999 1 Low Risk 4 0 2000 1 Low Risk 5 0 2001 1 Low Risk 6 0 2002 1 Low Risk 7 0 2003 1 Low Risk 8 0 2004 2 Medium Risk 1 0 1997 2 Medium Risk 2 0 1998

`fprintf('Last ten rows:\n')`

Last ten rows:

disp(data(end-9:end,:))

ID ScoreGroup YOB Default Year _____ ___________ ___ _______ ____ 96819 High Risk 6 0 2003 96819 High Risk 7 0 2004 96820 Medium Risk 1 0 1997 96820 Medium Risk 2 0 1998 96820 Medium Risk 3 0 1999 96820 Medium Risk 4 0 2000 96820 Medium Risk 5 0 2001 96820 Medium Risk 6 0 2002 96820 Medium Risk 7 0 2003 96820 Medium Risk 8 0 2004

```
nRows = height(data);
UniqueIDs = unique(data.ID);
nIDs = length(UniqueIDs);
fprintf('Total number of IDs: %d\n',nIDs)
```

Total number of IDs: 96820

`fprintf('Total number of rows: %d\n',nRows)`

Total number of rows: 646724

Use the credit score group as a grouping variable to compute the observed default rate for each score group. For this, use the `groupsummary`

function to compute the mean of the `Default`

variable, grouping by the `ScoreGroup`

variable. Plot the results on a bar chart. As expected, the default rate goes down as the credit quality improves.

DefRateByScore = groupsummary(data,'ScoreGroup','mean','Default'); NumScoreGroups = height(DefRateByScore); disp(DefRateByScore)

ScoreGroup GroupCount mean_Default ___________ __________ ____________ High Risk 2.0999e+05 0.017167 Medium Risk 2.1743e+05 0.0086006 Low Risk 2.193e+05 0.0046784

figure; bar(double(DefRateByScore.ScoreGroup),DefRateByScore.mean_Default*100) set(gca,'XTickLabel',categories(data.ScoreGroup)) title('Default Rate vs. Score Group') xlabel('Score Group') ylabel('Observed Default Rate (%)') grid on

Next, compute default rates grouping by years on books (represented by the `YOB`

variable). The resulting rates are conditional one-year default rates. For example, the default rate for the third year on books is the proportion of loans defaulting in the third year, relative to the number of loans that are in the portfolio past the second year. In other words, the default rate for the third year is the number of rows with `YOB`

= `3`

and `Default`

= 1, divided by the number of rows with `YOB`

= `3`

.

Plot the results. There is a clear downward trend, with default rates going down as the number of years on books increases. Years three and four have similar default rates. However, it is unclear from this plot whether this is a characteristic of the loan product or an effect of the macroeconomic environment.

DefRateByYOB = groupsummary(data,'YOB','mean','Default'); NumYOB = height(DefRateByYOB); disp(DefRateByYOB)

YOB GroupCount mean_Default ___ __________ ____________ 1 96820 0.017507 2 94535 0.012704 3 92497 0.011168 4 91068 0.010728 5 89588 0.0085949 6 88570 0.006413 7 61689 0.0033231 8 31957 0.0016272

figure; plot(double(DefRateByYOB.YOB),DefRateByYOB.mean_Default*100,'-*') title('Default Rate vs. Years on Books') xlabel('Years on Books') ylabel('Observed Default Rate (%)') grid on

Now, group both by the score group and number of years on books and then plot the results. The plot shows that all score groups behave similarly as time progresses, with a general downward trend. Years three and four are an exception to the downward trend: the rates flatten for the `High Risk`

group, and go up in year three for the `Low Risk`

group.

DefRateByScoreYOB = groupsummary(data,{'ScoreGroup','YOB'},'mean','Default'); % Display output table to show the way it is structured % Display only the first 10 rows, for brevity disp(DefRateByScoreYOB(1:10,:))

ScoreGroup YOB GroupCount mean_Default ___________ ___ __________ ____________ High Risk 1 32601 0.029692 High Risk 2 31338 0.021252 High Risk 3 30138 0.018448 High Risk 4 29438 0.018276 High Risk 5 28661 0.014794 High Risk 6 28117 0.011168 High Risk 7 19606 0.0056615 High Risk 8 10094 0.0027739 Medium Risk 1 32373 0.014302 Medium Risk 2 31775 0.011676

`disp(' ...')`

...

DefRateByScoreYOB2 = reshape(DefRateByScoreYOB.mean_Default,... NumYOB,NumScoreGroups); figure; plot(DefRateByScoreYOB2*100,'-*') title('Default Rate vs. Years on Books') xlabel('Years on Books') ylabel('Observed Default Rate (%)') legend(categories(data.ScoreGroup)) grid on

The data contains three cohorts, or vintages: loans started in 1997, 1998, and 1999. No loan in the panel data started after 1999.

This section shows how to visualize the default rate for each cohort separately. The default rates for all cohorts are plotted, both against the number of years on books and against the calendar year. Patterns in the years on books suggest the loan product characteristics. Patterns in the calendar years suggest the influence of the macroeconomic environment.

From years two through four on books, the curves show different patterns for the three cohorts. When plotted against the calendar year, however, the three cohorts show similar behavior from 2000 through 2002. The curves flatten during that period.

% Get IDs of 1997, 1998, and 1999 cohorts IDs1997 = data.ID(data.YOB==1&data.Year==1997); IDs1998 = data.ID(data.YOB==1&data.Year==1998); IDs1999 = data.ID(data.YOB==1&data.Year==1999); % IDs2000AndUp is unused, it is only computed to show that this is empty, % no loans started after 1999 IDs2000AndUp = data.ID(data.YOB==1&data.Year>1999); % Get default rates for each cohort separately ObsDefRate1997 = groupsummary(data(ismember(data.ID,IDs1997),:),... 'YOB','mean','Default'); ObsDefRate1998 = groupsummary(data(ismember(data.ID,IDs1998),:),... 'YOB','mean','Default'); ObsDefRate1999 = groupsummary(data(ismember(data.ID,IDs1999),:),... 'YOB','mean','Default'); % Plot against the years on books figure; plot(ObsDefRate1997.YOB,ObsDefRate1997.mean_Default*100,'-*') hold on plot(ObsDefRate1998.YOB,ObsDefRate1998.mean_Default*100,'-*') plot(ObsDefRate1999.YOB,ObsDefRate1999.mean_Default*100,'-*') hold off title('Default Rate vs. Years on Books') xlabel('Years on Books') ylabel('Default Rate (%)') legend('Cohort 97','Cohort 98','Cohort 99') grid on

% Plot against the calendar year Year = unique(data.Year); figure; plot(Year,ObsDefRate1997.mean_Default*100,'-*') hold on plot(Year(2:end),ObsDefRate1998.mean_Default*100,'-*') plot(Year(3:end),ObsDefRate1999.mean_Default*100,'-*') hold off title('Default Rate vs. Calendar Year') xlabel('Calendar Year') ylabel('Default Rate (%)') legend('Cohort 97','Cohort 98','Cohort 99') grid on

After you visualize the data, you can build predictive models for the default rates.

Split the panel data into training and testing sets, defining these sets based on ID numbers.

```
NumTraining = floor(0.6*nIDs);
rng('default');
TrainIDInd = randsample(nIDs,NumTraining);
TrainDataInd = ismember(data.ID,UniqueIDs(TrainIDInd));
TestDataInd = ~TrainDataInd;
```

The first model uses only score group and number of years on books as predictors of the default rate *p*. The odds of defaulting are defined as *p/(1-p)*. The logistic model relates the logarithm of the odds, or *log odds*, to the predictors as follows:

$$\mathrm{log}\left(\frac{p}{1-p}\right)={a}_{H}+{a}_{M}{1}_{M}+{a}_{L}{1}_{L}+{b}_{YOB}YOB+\u03f5$$

*1M* is an indicator with a value `1`

for `Medium Risk`

loans and `0`

otherwise, and similarly for *1L* for `Low Risk`

loans. This is a standard way of handling a categorical predictor such as `ScoreGroup`

. There is effectively a different constant for each risk level: *aH* for `High Risk`

, *aH+aM* for `Medium Risk`

, and *aH+aL* for `Low Risk`

.

To calibrate the model, call the `fitglm`

function from Statistics and Machine Learning Toolbox™. The formula above is expressed as

`Default ~ 1 + ScoreGroup + YOB`

The `1 + ScoreGroup`

terms account for the baseline constant and the adjustments for risk level. Set the optional argument `Distribution`

to `binomial`

to indicate that a logistic model is desired (that is, a model with log odds on the left side).

ModelNoMacro = fitglm(data(TrainDataInd,:),... 'Default ~ 1 + ScoreGroup + YOB',... 'Distribution','binomial'); disp(ModelNoMacro)

Generalized linear regression model: logit(Default) ~ 1 + ScoreGroup + YOB Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue ________ ________ _______ ___________ (Intercept) -3.2453 0.033768 -96.106 0 ScoreGroup_Medium Risk -0.7058 0.037103 -19.023 1.1014e-80 ScoreGroup_Low Risk -1.2893 0.045635 -28.253 1.3076e-175 YOB -0.22693 0.008437 -26.897 2.3578e-159 388018 observations, 388014 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 1.83e+03, p-value = 0

For any row in the data, the value of *p* is not observed, only a `0`

or `1`

default indicator is observed. The calibration finds model coefficients, and the predicted values of *p* for individual rows can be recovered with the `predict`

function.

The `Intercept`

coefficient is the constant for the `High Risk`

level (the *aH* term), and the `ScoreGroup_Medium Risk`

and `ScoreGroup_Low Risk`

coefficients are the adjustments for `Medium Risk`

and `Low Risk`

levels (the *aM* and *aL* terms).

The default probability *p* and the log odds (the left side of the model) move in the same direction when the predictors change. Therefore, because the adjustments for `Medium Risk`

and `Low Risk`

are negative, the default rates are lower for better risk levels, as expected. The coefficient for number of years on books is also negative, consistent with the overall downward trend for number of years on books observed in the data.

To account for panel data effects, a more advanced model using mixed effects can be fitted using the `fitglm`

function from Statistics and Machine Learning Toolbox™. Although this model is not fitted in this example, the code is very similar:

`ModelNoMacro = fitglme(data(TrainDataInd,:),'Default ~ 1 + ScoreGroup + YOB + (1|ID)','Distribution','binomial');`

The `(1|ID)`

term in the formula adds a *random effect* to the model. This effect is a predictor whose values are not given in the data, but calibrated together with the model coefficients. A random value is calibrated for each ID. This additional calibration requirement substantially increases the computational time to fit the model in this case, because of the very large number of IDs. For the panel data set in this example, the random term has a negligible effect. The variance of the random effects is very small and the model coefficients barely change when the random effect is introduced. The simpler logistic regression model is preferred, because it is faster to calibrate and to predict, and the default rates predicted with both models are essentially the same.

Predict the probability of default for training and testing data.

data.PDNoMacro = zeros(height(data),1); % Predict in-sample data.PDNoMacro(TrainDataInd) = predict(ModelNoMacro,data(TrainDataInd,:)); % Predict out-of-sample data.PDNoMacro(TestDataInd) = predict(ModelNoMacro,data(TestDataInd,:));

Visualize the in-sample fit.

PredPDTrainYOB = groupsummary(data(TrainDataInd,:),'YOB','mean',... {'Default','PDNoMacro'}); figure; scatter(PredPDTrainYOB.YOB,PredPDTrainYOB.mean_Default*100,'*'); hold on plot(PredPDTrainYOB.YOB,PredPDTrainYOB.mean_PDNoMacro*100); hold off xlabel('Years on Books') ylabel('Default Rate (%)') legend('Observed','Predicted') title('Model Fit (Training Data)') grid on

Visualize the out-of-sample fit.

PredPDTestYOB = groupsummary(data(TestDataInd,:),'YOB','mean',... {'Default','PDNoMacro'}); figure; scatter(PredPDTestYOB.YOB,PredPDTestYOB.mean_Default*100,'*'); hold on plot(PredPDTestYOB.YOB,PredPDTestYOB.mean_PDNoMacro*100); hold off xlabel('Years on Books') ylabel('Default Rate (%)') legend('Observed','Predicted') title('Model Fit (Testing Data)') grid on

Visualize the in-sample fit for all score groups. The out-of-sample fit can be computed and visualized in a similar way.

PredPDTrainScoreYOB = groupsummary(data(TrainDataInd,:),... {'ScoreGroup','YOB'},'mean',{'Default','PDNoMacro'}); figure; hs = gscatter(PredPDTrainScoreYOB.YOB,... PredPDTrainScoreYOB.mean_Default*100,... PredPDTrainScoreYOB.ScoreGroup,'rbmgk','*'); mean_PDNoMacroMat = reshape(PredPDTrainScoreYOB.mean_PDNoMacro,... NumYOB,NumScoreGroups); hold on hp = plot(mean_PDNoMacroMat*100); for ii=1:NumScoreGroups hp(ii).Color = hs(ii).Color; end hold off xlabel('Years on Books') ylabel('Observed Default Rate (%)') legend(categories(data.ScoreGroup)) title('Model Fit by Score Group (Training Data)') grid on

The trend predicted with the previous model, as a function of years on books, has a very regular decreasing pattern. The data, however, shows some deviations from that trend. To try to account for those deviations, add the gross domestic product annual growth (represented by the `GDP`

variable) and stock market annual returns (represented by the `Market`

variable) to the model.

$$\mathrm{log}\left(\frac{p}{1-p}\right)={a}_{H}+{a}_{M}{1}_{M}+{a}_{L}{1}_{L}+{b}_{YOB}YOB+{b}_{GDP}GDP+{b}_{Market}Market+\u03f5$$

Expand the data set to add one column for `GDP`

and one for `Market`

, using the data from the `dataMacro`

table.

data.GDP = dataMacro.GDP(data.Year-1996); data.Market = dataMacro.Market(data.Year-1996); disp(data(1:10,:))

ID ScoreGroup YOB Default Year PDNoMacro GDP Market __ ___________ ___ _______ ____ _________ _____ ______ 1 Low Risk 1 0 1997 0.0084797 2.72 7.61 1 Low Risk 2 0 1998 0.0067697 3.57 26.24 1 Low Risk 3 0 1999 0.0054027 2.86 18.1 1 Low Risk 4 0 2000 0.0043105 2.43 3.19 1 Low Risk 5 0 2001 0.0034384 1.26 -10.51 1 Low Risk 6 0 2002 0.0027422 -0.59 -22.95 1 Low Risk 7 0 2003 0.0021867 0.63 2.78 1 Low Risk 8 0 2004 0.0017435 1.85 9.48 2 Medium Risk 1 0 1997 0.015097 2.72 7.61 2 Medium Risk 2 0 1998 0.012069 3.57 26.24

Fit the model with the macroeconomic variables by expanding the model formula to include the `GDP`

and the `Market`

variables.

ModelMacro = fitglm(data(TrainDataInd,:),... 'Default ~ 1 + ScoreGroup + YOB + GDP + Market',... 'Distribution','binomial'); disp(ModelMacro)

Generalized linear regression model: logit(Default) ~ 1 + ScoreGroup + YOB + GDP + Market Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue __________ _________ _______ ___________ (Intercept) -2.667 0.10146 -26.287 2.6919e-152 ScoreGroup_Medium Risk -0.70751 0.037108 -19.066 4.8223e-81 ScoreGroup_Low Risk -1.2895 0.045639 -28.253 1.2892e-175 YOB -0.32082 0.013636 -23.528 2.0867e-122 GDP -0.12295 0.039725 -3.095 0.0019681 Market -0.0071812 0.0028298 -2.5377 0.011159 388018 observations, 388012 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 1.97e+03, p-value = 0

Both macroeconomic variables show a negative coefficient, consistent with the intuition that higher economic growth reduces default rates.

Predict the probability of default for the training and testing data.

data.PDMacro = zeros(height(data),1); % Predict in-sample data.PDMacro(TrainDataInd) = predict(ModelMacro,data(TrainDataInd,:)); % Predict out-of-sample data.PDMacro(TestDataInd) = predict(ModelMacro,data(TestDataInd,:));

Visualize the in-sample fit. As desired, the model including macroeconomic variables, or macro model, deviates from the smooth trend predicted by the previous model. The rates predicted with the macro model match more closely with the observed default rates.

PredPDTrainYOBMacro = groupsummary(data(TrainDataInd,:),'YOB','mean',... {'Default','PDMacro'}); figure; scatter(PredPDTrainYOBMacro.YOB,PredPDTrainYOBMacro.mean_Default*100,'*'); hold on plot(PredPDTrainYOB.YOB,PredPDTrainYOB.mean_PDNoMacro*100); % No Macro plot(PredPDTrainYOBMacro.YOB,PredPDTrainYOBMacro.mean_PDMacro*100); % Macro hold off xlabel('Years on Books') ylabel('Default Rate (%)') legend('Observed','No Macro', 'Macro') title('Macro Model Fit (Training Data)') grid on

Visualize the out-of-sample fit.

PredPDTestYOBMacro = groupsummary(data(TestDataInd,:),'YOB','mean',... {'Default','PDMacro'}); figure; scatter(PredPDTestYOBMacro.YOB,PredPDTestYOBMacro.mean_Default*100,'*'); hold on plot(PredPDTestYOB.YOB,PredPDTestYOB.mean_PDNoMacro*100); % No Macro plot(PredPDTestYOBMacro.YOB,PredPDTestYOBMacro.mean_PDMacro*100); % Macro hold off xlabel('Years on Books') ylabel('Default Rate (%)') legend('Observed','No Macro', 'Macro') title('Macro Model Fit (Testing Data)') grid on

Visualize the in-sample fit for all score groups.

PredPDTrainScoreYOBMacro = groupsummary(data(TrainDataInd,:),... {'ScoreGroup','YOB'},'mean',{'Default','PDMacro'}); figure; hs = gscatter(PredPDTrainScoreYOBMacro.YOB,... PredPDTrainScoreYOBMacro.mean_Default*100,... PredPDTrainScoreYOBMacro.ScoreGroup,'rbmgk','*'); mean_PDMacroMat = reshape(PredPDTrainScoreYOBMacro.mean_PDMacro,... NumYOB,NumScoreGroups); hold on hp = plot(mean_PDMacroMat*100); for ii=1:NumScoreGroups hp(ii).Color = hs(ii).Color; end hold off xlabel('Years on Books') ylabel('Observed Default Rate (%)') legend(categories(data.ScoreGroup)) title('Macro Model Fit by Score Group (Training Data)') grid on

Use the fitted macro model to stress-test the predicted probabilities of default.

Assume the following are stress scenarios for the macroeconomic variables provided, for example, by a regulator.

disp(dataMacroStress)

GDP Market _____ ______ Baseline 2.27 15.02 Adverse 1.31 4.56 Severe -0.22 -5.64

Set up a basic data table for predicting the probabilities of default. This is a dummy data table, with one row for each combination of score group and number of years on books.

dataBaseline = table; [ScoreGroup,YOB]=meshgrid(1:NumScoreGroups,1:NumYOB); dataBaseline.ScoreGroup = categorical(ScoreGroup(:),1:NumScoreGroups,... categories(data.ScoreGroup),'Ordinal',true); dataBaseline.YOB = YOB(:); dataBaseline.ID = ones(height(dataBaseline),1); dataBaseline.GDP = zeros(height(dataBaseline),1); dataBaseline.Market = zeros(height(dataBaseline),1);

To make the predictions, set the same macroeconomic conditions (baseline, adverse, or severely adverse) for all combinations of score groups and number of years on books.

% Predict baseline the probabilities of default dataBaseline.GDP(:) = dataMacroStress.GDP('Baseline'); dataBaseline.Market(:) = dataMacroStress.Market('Baseline'); dataBaseline.PD = predict(ModelMacro,dataBaseline); % Predict the probabilities of default in the adverse scenario dataAdverse = dataBaseline; dataAdverse.GDP(:) = dataMacroStress.GDP('Adverse'); dataAdverse.Market(:) = dataMacroStress.Market('Adverse'); dataAdverse.PD = predict(ModelMacro,dataAdverse); % Predict the probabilities of default in the severely adverse scenario dataSevere = dataBaseline; dataSevere.GDP(:) = dataMacroStress.GDP('Severe'); dataSevere.Market(:) = dataMacroStress.Market('Severe'); dataSevere.PD = predict(ModelMacro,dataSevere);

Visualize the average predicted probability of default across score groups under the three alternative regulatory scenarios. Here, all score groups are implicitly weighted equally. However, predictions can also be made at a loan level for any given portfolio to make the predicted default rates consistent with the actual distribution of loans in the portfolio. The same visualization can be produced for each score group separately.

PredPDYOB = zeros(NumYOB,3); PredPDYOB(:,1) = mean(reshape(dataBaseline.PD,NumYOB,NumScoreGroups),2); PredPDYOB(:,2) = mean(reshape(dataAdverse.PD,NumYOB,NumScoreGroups),2); PredPDYOB(:,3) = mean(reshape(dataSevere.PD,NumYOB,NumScoreGroups),2); figure; bar(PredPDYOB*100); xlabel('Years on Books') ylabel('Predicted Default Rate (%)') legend('Baseline','Adverse','Severe') title('Stress Test, Probability of Default') grid on

Generalized Linear Models documentation: https://www.mathworks.com/help/stats/generalized-linear-regression.html

Generalized Linear Mixed Effects Models documentation: https://www.mathworks.com/help/stats/generalized-linear-mixed-effects-models.html

Federal Reserve, Comprehensive Capital Analysis and Review (CCAR): https://www.federalreserve.gov/bankinforeg/ccar.htm

Bank of England, Stress Testing: https://www.bankofengland.co.uk/financial-stability

European Banking Authority, EU-Wide Stress Testing: https://www.eba.europa.eu/risk-analysis-and-data/eu-wide-stress-testing