Regularized Estimates of Model Parameters

What Is Regularization?

Regularization is the technique for specifying constraints on the flexibility of a model, thereby reducing uncertainty in the estimated parameter values.

Model parameters are obtained by fitting measured data to the predicted model response, such as a transfer function with three poles or a second-order state-space model. The model order is a measure of its flexibility — higher the order, the greater the flexibility. For example, a model with three poles is more flexible than one with two poles. Increasing the order causes the model to fit the observed data with increasing accuracy. However, the increased flexibility comes with the price of higher uncertainty in the estimates, measured by a higher value of random or variance error. On the other hand, choosing a model with too low an order leads to larger systematic errors. Such errors cannot be attributed to measurement noise and are also known as bias error.

Ideally, the parameters of a good model should minimize the mean square error (MSE), given by a sum of systematic error (bias) and random error (variance):

MSE = |Bias|² + Variance

The minimization is thus a tradeoff in constraining the model. A flexible (high order) model gives small bias and large variance, whereas a simpler (low order) model results in larger bias and smaller variance errors. Typically, you can investigate this tradeoff between bias and variance errors by cross-validation tests on a set of models of increasing flexibility. However, such tests do not always give full control in managing the parameter estimation behavior. For example:

You cannot use the known (a priori) information about the model to influence the quality of the fits.
In grey-box and other structured models, the order is fixed by the underlying ODEs and cannot be changed. If the data is not rich enough to capture the full range of dynamic behavior, this typically leads to high uncertainty in the estimated values.
Varying the model order does not let you explicitly shape the variance of the underlying parameters.

Regularization gives you a better control over the bias versus variance tradeoff by introducing an additional term in the minimization criterion that penalizes the model flexibility. Without regularization, for a model with one output and no weighting, the parameter estimates are obtained by minimizing a weighted quadratic norm of the prediction errors ε(t,θ):

$\begin{array}{l} V_{N} (θ) = \frac{1}{N} \sum_{t = 1}^{N} ε^{2} (t, θ) \end{array}$

where t is the time variable, N is the number of data samples, and ε(t,θ) is the predicted error computed as the difference between the observed output and the predicted output of the model.

Regularization modifies the cost function by adding a term proportional to the square of the norm of the parameter vector θ, so that the parameters θ are obtained by minimizing:

${\hat{V}}_{N} (θ) = \frac{1}{N} \sum_{t = 1}^{N} ε^{2} (t, θ) + \frac{1}{N} λ {‖ θ ‖}^{2}$

where λ is a positive constant that has the effect of trading variance error in V_N(θ) for bias error — the larger the value of λ, the higher the bias and lower the variance of θ. The added term penalizes the parameter values with the effect of keeping their values small during estimation. In statistics, this type of regularization is called ridge regression. For more information, see Ridge Regression (Statistics and Machine Learning Toolbox).

Note

Another choice for the norm of θ vector is the L₁-norm, known as lasso regularization. However, System Identification Toolbox™ supports only the 2-norm based penalty, known as L₂ regularization, as shown in the previous equation.

The penalty term is made more effective by using a positive definite matrix R, which allows weighting and/or rotation of the parameter vector:

${\hat{V}}_{N} (θ) = \frac{1}{N} \sum_{t = 1}^{N} ε^{2} (t, θ) + \frac{1}{N} λ θ^{T} R θ$

The square matrix R gives additional freedom for:

Shaping the penalty term to meet the required constraints, such as keeping the model stable
Adding known information about the model parameters, such as reliability of the individual parameters in the θ vector

For structured models such as grey-box models, you may want to keep the estimated parameters close to their guess values to maintain the physical validity of the estimated model. This can be achieved by generalizing the penalty term to $λ {(θ - θ^{*})}^{T} R (θ - θ^{*})$ , such that the cost function becomes:

${\hat{V}}_{N} (θ) = \frac{1}{N} \sum_{t = 1}^{N} ε^{2} (t, θ) + \frac{1}{N} λ {(θ - θ^{*})}^{T} R (θ - θ^{*})$

Minimizing this cost function has the effect of estimating θ such that their values remain close to initial guesses θ*.

In regularization:

θ* represents prior knowledge about the unknown parameters.
λ*R represents the confidence in the prior knowledge of the unknown parameters. This implies that the larger the value, the higher the confidence.

A formal interpretation in a Bayesian setting is that θ has a prior distribution that is Gaussian with mean θ* and covariance matrix $σ^{2} / λ R^{- 1}$ , where σ² is the variance of ε(t). The use of regularization can therefore be linked to some prior information about the system. This could be quite soft, such as the system is stable.

You can use the regularization variables λ and R as tools to find a good model that balances complexity and provides the best tradeoff between bias and variance. You can obtain regularized estimates of parameters for transfer function, state-space, polynomial, grey-box, process, and nonlinear black-box models. The three terms defining the penalty term, λ, R and θ*, are represented by regularization options Lambda, R, and Nominal, respectively in the toolbox. You can specify their values in the estimation option sets for both linear and nonlinear models. In the System Identification app, click Regularization in the linear model estimation dialog box or Estimation Options in the Nonlinear Models dialog box.

When to Use Regularization

Use regularization for:

Identifying overparameterized models.
Imposing a priori knowledge of model parameters in structured models.
Incorporating knowledge of system behavior in ARX and FIR models.

Identifying Overparameterized Models

Over-parameterized models are rich in parameters. Their estimation typically yields parameter values with a high level of uncertainty. Over-parameterization is common for nonlinear ARX (idnlarx) models and can also be for linear state-space models using free parameterization.

In such cases, regularization improves the numerical conditioning of the estimation. You can explore the bias-vs.-variance tradeoff using various values of the regularization constant Lambda. Typically, the Nominal option is its default value of 0, and R is an identity matrix such that the following cost function is minimized:

${\hat{V}}_{N} (θ) = \frac{1}{N} \sum_{t = 1}^{N} ε^{2} (t, θ) + \frac{1}{N} λ {‖ θ ‖}^{2}$

In the following example, a nonlinear ARX model estimation using a large number of neurons leads to an ill-conditioned estimation problem.

% Load estimation data.
load regularizationExampleData.mat nldata
% Estimate model without regularization.
Orders = [1 2 1];
NL = idSigmoidNetwork('NumberOfUnits',30);
sys = nlarx(nldata,Orders,NL);
compare(nldata,sys)

Applying even a small regularizing penalty produces a good fit for the model to the data.

% Estimate model using regularization constant λ = 1e-8.
opt = nlarxOptions;
opt.Regularization.Lambda = 1e-8;
sysr = nlarx(nldata,Orders,NL,opt);
compare(nldata,sysr)

Imposing A Priori Knowledge of Model Parameters in Structured Models

In models derived from differential equations, the parameters have physical significance. You may have a good guess for typical values of those parameters even if the reliability of the guess may be different for each parameter. Because the model structure is fixed in such cases, you cannot simplify the structure to reduce variance errors.

Using the regularization constant Nominal, you can keep the estimated values close to their initial guesses. You can also design R to reflect the confidence in the initial guesses of the parameters. For example, if θ is a 2-element vector and you can guess the value of the first element with more confidence than the second one, set R to be a diagonal matrix of size 2-by-2 such that R(1,1) >> R(2,2).

In the following example, a model of a DC motor is parameterized by static gain G and time constant τ. From prior knowledge, suppose you know that G is about 4 and τ is about 1. Also, assume that you have more confidence in the value of τ than G and would like to guide the estimation to remain close to the initial guess.

% Load estimation data.
load regularizationExampleData.mat motorData
% Create idgrey model for DC motor dynamics. 
mi = idgrey(@DCMotorODE,{'G',4;'Tau',1},'cd',{}, 0);
mi = setpar(mi,'label','default'); 
% Configure Regularization options.
opt = greyestOptions;
opt.Regularization.Lambda = 100;
% Specify that the second parameter better known than the first.
opt.Regularization.R = [1, 1000]; 
% Specify initial  guess as Nominal. 
opt.Regularization.Nominal = 'model';
% Estimate model.
sys = greyest(motorData,mi,opt)
getpar(sys)

Incorporating Knowledge of System Behavior in ARX and FIR Models

In many situations, you may know the shape of the system impulse response from impact tests. For example, it is quite common for stable systems to have an impulse response that is smooth and exponentially decaying. You can use such prior knowledge of system behavior to derive good values of regularization constants for linear-in-parameter models such as ARX and FIR structure models using the arxRegul command.

For black-box models of arbitrary structure, it is often difficult to determine the optimal values of Lambda and R that yield the best bias-vs.-variance tradeoff. Therefore, it is recommended that you start by obtaining the regularized estimate of an ARX or FIR structure model. Then, convert the model to a state-space, transfer function or polynomial model using the idtf, idss, or idpoly commands, followed by order reduction if required.

In the following example, direct estimation of a 15th order continuous-time transfer function model fails due to numerical ill-conditioning.

% Load estimation data.
load dryer2
Dryer = iddata(y2,u2,0.08);
Dryerd = detrend(Dryer,0);
Dryerde = Dryerd(1:500);
xe = Dryerd(1:500);
ze = Dryerd(1:500);
zv = Dryerd(501:end);
% Estimate model without regularization.
sys1 = tfest(ze,15);

Therefore, use regularized ARX estimation and then convert the model to transfer function structure.

% Specify regularization constants.
[L, R] = arxRegul(ze,[15 15 1]);
optARX = arxOptions;
optARX.Regularization.Lambda = L;
optARX.Regularization.R = R;
% Estimate ARX model.
sysARX = arx(ze,[15 15 1],optARX);
% Convert model to continuous time. 
sysc = d2c(sysARX);
% Convert model to transfer function.
sys2 = idtf(sysc);
% Validate the models sys1 and sys2.
compare(zv,sys1,sys2)

Choosing Regularization Constants

A guideline for selecting the regularization constants λ and R is in the Bayesian interpretation. The added penalty term is an assumption that the parameter vector θ is a Gaussian random vector with mean θ* and covariance matrix $σ^{2} / λ R^{- 1}$ .

You can relate naturally to such an assumption for a grey-box model, where the parameters are of known physical interpretation. In other cases, this may be more difficult. Then, you have to use ridge regression (R = 1; θ* = 0) and tune λ by trial and error.

Use the following techniques for determining λ and R values:

Incorporate prior information using tunable kernels.
Perform cross-validation tests.

Incorporate Prior Information Using Tunable Kernels

Tuning the regularization constants for ARX models in arxRegul is based on simple assumptions about the properties of the true impulse responses.

In the case of an FIR model, the parameter vector contains the impulse response coefficients b_k for the system. From prior knowledge of the system, it is often known that the impulse response is smooth and exponentially decaying:

$E {[b_{k}]}^{2} = C μ^{k}, c o r r {b_{k} b_{k - 1}} = ρ$

where corr means correlation. The equation is a parameterization of the regularization constants in terms of coefficients C, μ, and ρ and the chosen shape (decaying polynomial) is called a kernel. The kernel thus contains information about parameterization of the prior covariance of the impulse response coefficients.

You can estimate the parameters of the kernel by adjusting them to the measured data using the RegularizationKernel input of the arxRegul command. For example, the DC kernel estimates all three parameters while the TC kernel links $ρ = \sqrt{μ}$ . This technique of tuning kernels applies to all linear-in-parameter models such as ARX and FIR models.

Perform Cross-Validation Tests

A general way to test and evaluate any regularization parameters is to estimate a model based on certain parameters on an estimation data set, and evaluate the model fit for another validation data set. This is known as cross-validation.

Cross-validation is entirely analogous to the method for selecting model order:

Generate a list of candidate λ and R values to be tested.
Estimate a model for each candidate regularization constant set.
Compare the model fit to the validation data.
Use the constants that give the best fit to the validation data.

For example:

% Create estimation and validation data sets.
ze = z(1:N/2);
zv = z(N/2:end);
% Specify regularization options and estimate models.
opt = ssestOptions;
for tests = 1:M
opt.Regularization.Lambda = Lvalue(test);
opt.Regularization.R = Rvalue(test);
m{test} = ssest(ze,order,opt);
end
% Compare models with validation data for model fit.
[~,fit] = compare(zv,m{:))

References

[1] L. Ljung. “Some Classical and Some New Ideas for Identification of Linear Systems.” Journal of Control, Automation and Electrical Systems. April 2013, Volume 24, Issue 1-2, pp 3-10.

[2] L. Ljung, and T. Chen. “What can regularization offer for estimation of dynamical systems?” In Proceedings of IFAC International Workshop on Adaptation and Learning in Control and Signal Processing, ALCOSP13, Caen, France, July 2013.

[3] L. Ljung, and T. Chen. “Convexity issues in system identification.” In Proceedings of the 10th IEEE International Conference on Control & Automation, ICCA 2013, Hangzhou, China, June 2013.