Building Diagnostic Models in the Processing Industries with Machine Learning
Chris Aldrich, WA School of Mines
Overview
Machine learning models are often seen as black box systems that can capture complex nonlinear relationships between variables, but that are difficult to interpret otherwise. However, this is not necessarily the case, and in this presentation, the use of random forests and neural networks to build diagnostic models in the process industries will be reviewed. Both these modelling methodologies are highly versatile tools and their implementation in MATLAB will be demonstrated based on several case studies in the mining and mineral processing industries.
About the Presenter
Professor Chris Aldrich holds a Chair in Process Systems Engineering and is Deputy Head of the WA School of Mines. He is also an Extraordinary Professor at Stellenbosch University in South Africa and an Honorary Professor at The University of Queensland. He was previously the Founding Director of the Anglo American Platinum Centre for Process Monitoring at Stellenbosch University.
He holds a PhD and a higher doctoral degree from Stellenbosch University, is a Fellow of the South African Academy of Engineering, a Fellow of the Australasian Institute of Mining and Metallurgy and serves on the editorial boards of the International Journal of Mining, Minerals and Metals, Minerals Engineering, Water SA and Control Engineering Practice. His research interests include artificial intelligence, machine learning, mineral processing and extractive metallurgy.
Recorded: 11 Aug 2020
Ladies and gentlemen, as you can see, my presentation is entitled "Building Diagnostic Models in the Process Industries with Machine Learning." The outline of my presentation is as follows. In the first part, I'd like to spend just some time on an overview of the machine learning tools that we're going to look at. That would be multilayer perceptron specifically and then also decision trees and random forests.
I'll also spend some time on the diagnostic approaches that one would use with these machine learning models. So that would include variable importance analysis, so different measures for that. And then in the second part, we're going to look at the practical case study involving an autogenous grinding circuit. And I'll conclude with some discussion and conclusions.
So just by way of background, insight in underlying physical phenomenon process systems is really important to the development of reliable process models that are used or can be used in advanced process control systems for the support or decision support of operators and plants and in general, just more efficient operation of the plants themselves. And with the availability of large amounts of plant data becoming much more commonplace, there's also a growing interest in the use of these data to gain deeper insights into plant operations. Now, linear models, of course, have been used for this for decades. The problem is just that these models are limited when it comes to dealing with nonlinear systems or complex systems, unless we understand the system well in the first place, which is a bit of a chicken and egg situation. And therefore, machine learning has attracted a very significant interest as a means to access or unlock the sort of knowledge or insights could potentially be captured in the plant data.
So this slide shows the papers that have been published in Minerals Engineering, one of the prominent journals in the field, that are related to the use of multiple linear regression, neural networks, and decision trees or random forests in sort of mineral processing operations. So this is sort of a proxy for the use or application of these methods often-- well, not just diagnostic purposes but in general in the field. And we can see that while multiple linear regression is still used quite a lot, neural networks is sort of a close second and decision trees, random forests, perhaps not so, although the more recent trends would be that the use of these, let's say, decision tree-based models is sort of growing more rapidly, if you like, than the other two approaches, which are sort of more established, if you like. So this is sort of a proxy for what's happening in industry as well.
OK, so what are we looking at when we are looking at neural networks? So basically, in a neural network, the network is composed or consists of these nodes, or neurons, or artificial neurons. And in essence, a neuron or a node would take an input, a variable x1, x2, x3, multiply that with a weight, the weight's associated with that node. And that sum of the weights would then be further processed by a simple function, and the output would be passed on to-- either it's the output of the network or to the other nodes, distributed to all the other nodes in the network, where that process would be repeated.
So in that network itself, we have the inputs to the node, so the input layer. So the network itself is, in this case, a multilayer perceptron. It's arranged in distinct layers where we have these nodes, and there would be an input layer, one or more hidden layers, one being sort of the most common configuration, and then also an output node.
So the inputs are-- with input layer, the input layer is only really distributing the weighted-- weighting the inputs and distributing them to the next layer, the hidden layer, where it's actually processed as well. So the hidden layer would take that weighted input x1, the weighted input x2, get the sum of those and then with some simple function, do some transformation and pass that on to the next node. So multilayer perceptron is a really-- if you want to express it as an equation-- a nonlinear function g of a linear combination of nonlinear functions that are in turn linear combinations of the inputs. So yes, there is the equation, and you can interpret it any way you like.
The activation functions, just a quick word on those, usually, or sometimes, people starting out in neural networks would say, "So what sort of activation function should I use?" Well, sigmoidal functions, activation functions really dominate the-- are used quite extensively, yes, with a few other functions as well. The reason for that is they are easy to differentiate-- we'll see why shortly-- and they are bounded, regardless of the input. And that's quite important because otherwise, you would have an unstable network when you try to train it.
So training of the networks, just very briefly, we start off with a set of weights. We initialize it, small values between minus 0.1 and 0.1 typically, randomly so. We pass the first record of the training data set. So we split the data into a training and a test data set.
The training data, we pass the first record through the network. It goes through the input layer, the hidden layer. The output, it generates an output that's typically way off the actual target or the target output. Because it's supervised training, so we have the training set as examples of-- these are the inputs, and these the associated outputs.
So that error is then propagated back through the network, and as it's done, each weight is changed just slightly, just incrementally so that if that same output or same sample were to be passed through the network again, the error would be slightly smaller. So just as a small change in the error. And we can do that by sort of-- we can adjust the weights based on that error to ensure that that's the case.
This is repeated with all the samples many times, until we have a set of weights in the network that can predict the data to an arbitrary degree of accuracy. And of course, we do make sure that we don't overfit or fit the data. We don't want that. So we have sort of an independent check every once in a while during the training process with a set-- the so-called test set data, just to make sure that we not overfitting. The moment we start to overfit, we will stop training, and that would be our network.
So that's in essence the training process. Again, as a user, not something that we are too concerned about. It's pretty much established maybe 20, 50 years ago, when we'd be a little bit more careful. But now, we can use advanced algorithms like the Levenberg Marquardt algorithm. That's the default in MATLAB.
Let's look at tree-based models. With tree-based models, it's important to understand the idea or the concept of entropy or the entropy of a system. So let's consider the system on the right, showing two predictor variables x1 and x2 and two clauses.
So it's a classification problem, a balanced problem. We have 25 samples of each class. Let's say that blue would be perhaps the process being in control, and the red, the process out of control, just sort of as an example.
So the entropy of this entire system, the system that's sort of defined by what's inside that circle, would be simply-- well, we define that as the product of the probability of any of these events, event being either the blue or the red dots, if you like, the probability of one of those events occurring. So if we sample blindly within the circle, in this case, if there are an equal number of each of those, the probability of that will simply be 50% or 25 out of 50. So the entropy of that entire system is calculated as the negative of the sum of the products of those event probabilities and the logs of those probabilities.
And we can take that to, well, any base, but two is convenient. So in this case, the entropy of this specific system would be for the blue dots, minus 25 out of 50, times the log of base 2 of, again, 25 out of 50 or 1/2 plus-- and then exactly the same for the red dot system. And we get an entropy of 1, a unit entropy. So this is a unit system with a unit entropy calculated, let's say, if it's taken to the base of 2.
OK, so if we then split this system into two subsystems, where in each system, we have a total-- well, again, the entropy of the entire system of the two subsystems collectively is calculated as the entropies of each of the subsystems. So you can see here that according to these calculations, starting with the blue dots first, again, we have two events, red and blue. So red or out of control does not happen if we're in that area, in that top subsystem.
So the calculations would simply be, again, the same, the product of the probability with the log of the probability. So that is 25 out of 25-- that's for the blue dots-- times the log of 25 over 25 plus-- and for the red dots, nothing there. That would be 0. And we get a zero entropy for the top subsystem, and we also get a zero entropy, same calculations, for the bottom system, the red dots there. So this is an example of a zero entropy system.
So you can say, well, how do we use this to actually generate decision trees? Well, it works as follows. So the basic principles would be to start off with-- and again, this is sort of an illustration of the system. On the left hand side, we have two predictor variables. It's a classification problem.
So the way that we go about it is to search for the best split. In other words, if we look at that previous system, the entropy went down from 1 to 0. So in this case, we are looking at splitting the system along a specific variable at this specific point to maximize the reduction in the entropy of the system. So we start with a system. In this case, again, it would be a unit entropy system. And we search for that split that would best reduce or maximally reduce the entropy of the system.
So let's say, in this case, it's at x1. We search over all possible split points along x1. We find that at w2, it's actually-- we have a maximum reduction. We have sort of one system there on the right of w2 that would have a zero entropy.
And we'd calculate the entropy of the other system. We weight them, sample-based. And we get an entropy, a new entropy for this split system. And we continue that search then for the second variable along all possible splits of the second variable, v1, v2. But we find that after we've searched all the variables at all possible split points that it is actually the best or the-- we get the maximum reduction when splitting at x1 equal to w2.
So then we start off from-- that's the top of the tree-- x1. That's the variable, and we have a split point, and it's a binary split. And so we continue with-- we won't sort of bother splitting the pure part or the right hand side of w2. Again, of course, there's no point in splitting that further.
But we'll then split the left side, left of w2 on x1. We'll split that by searching in exactly the same way, all possible split points along all possible variables. In this case, there are two. If there are 50 variables in the system, we'll do that for all 50 variables. And we'll find that best split point. In this case, it would be a x2 at-- let's say, the next split would be at the value v2.
And we split, so we have-- so now, we have three subsystems once we've split that. And again, we won't bother splitting further what's above v2 on the x2 scale. And so we continue until we have a completely pure system, and we can express that as a decision tree or a classification tree, in this case, with variables and split points.
Now, if it's a regression problem, it works exactly the same way, except in this case, we are not looking at splitting or finding the maximum reduction in the entropy, but we're looking at the variance of the output, sort of reducing the variance as much as possible within a subsystem. OK, so with this approach then, this allows us to generate decision trees, classification or regression, efficiently, rapidly. We can work with large data sets, and we have this tree that could also be converted to a set of if-then rules. So it's sort of interpretable. It's a nonlinear model and actually sort of also a very popular way of dealing with, well, many different types of problems.
Let's look at random forest models. Now, first of all, single trees, like the ones we've just considered, are not state of the art predictive modeling approaches. Yes, one could say that is fine, but they are really easy to interpret. We can express them as a set of if-then rules.
But actually, that's also not necessarily the case. If we have many variables, for example, then that also becomes an issue. If we are looking at sort of, if x1 is so-and-so, and x2, and x3, and x4, we tend to lose track. And besides, if the variables are correlated, then it becomes a really complex tree that's very difficult or almost impossible to interpret. So for that reason, and because these trees are also what you refer to as brittle, we can split. But we need to make sure that we don't overfit, and that's also something that we need to-- or the optimization of these models is also not an easy problem necessarily.
So if we look at a random forest model, then as the name suggests, these models are really ensembles or a combination of all the trees. So it's using a set of these decision trees that we use collectively. Now, as the name suggests, with a random forest model, we generate each tree in the forest or in this set of trees randomly, both by selecting at each split a random subset of variables and also a random subset of samples from the trained data.
So each tree would then be a weak model. It won't be a really good model. But collectively, these trees are actually seeing all the data, all the variables. We don't have to worry about overfitting anymore, and we actually have, as it turns out, a model that's really powerful, that can capture nonlinear behavior very well.
And well, we can say, what about the interpretability? We'll get to that shortly. But it can also, like with the trees, deal with regression problems, classification problems. In the case of regression, we'll simply look at the average of each tree's output. If it's a classification problem, the trees will simply vote, and we look at the majority vote.
OK, so these are typical default hyperparameters for random forests or the so-called hyperparameters that we need to specify as a user. But as it turns out, it's actually very robust. So certainly, looking at the second from the bottom, the number of trees in the forest, that we need to specify. But that-- we'll see, as long as we have a, well, some minimum number, that-- the performance of the trees is certainly not affected by the number of trees very much.
The splits-- the other parameter maybe that one should just sort of have a look at is the so-called-- or the number of candidate variables that's drawn or considered at each split. We said, remember, we're sort of splitting or selecting from the variables and from the samples. So that is a parameter that-- there are default values for all of these that sort of works-- these values work pretty well.
But in some cases, if the tree is not or the forest is not perhaps performing that well, then it could well be that changing that parameter, maybe increasing that a little bit from these default values, could yield better results. But otherwise, one could sort of stay with the default parameters. It's certainly not critical to change the number of nodes or the splitting criteria. That doesn't really affect the models in most of the cases.
OK, now on to implementation in MATLAB. So with MATLAB, MATLAB supports a very wide range of machine learning models in its toolboxes. With multilayer perceptrons, for example-- OK, first of all, you have to load the data into the command space, or it depends how you use it. Of course, the data would need to be available. So let's say, in this case, we load it into the command space.
This is just an example of some data. But the data, the input data, would be x. The output data would be y, whether that's sort of a regression value or classification value. Just a note for those that may be using them for the first time, the MATLAB Neural Network Toolbox would interpret column vectors as samples, so we just need to transpose them when we use them. And you can-- you don't have to, but you could specify the variable labels as well for that sort of application.
You can see, the steps are as follows. First, you need to set up your multilayer perceptron model, the number of nodes in the hidden layer. So this is a single hidden layer model that you need to specify. You can simply say each node is equal to 6 or 12 or whatever number you want to use. Preferably lower would be better. Well, not too low, but nonetheless.
And then you set up the net. You configure the net as net x and y. So net is the name of our model. You can give it any other name. So we configure it with the data that's available, the x and y data.
Now, in this case, you would-- yes, OK. So that's sort of setting up the model. And you train the network.
Now, this is sort of the minimum command. In MATLAB, you have sort of control over a range of different training algorithms. You have many other parameters you can set. I'm just showing the simplest sort of approach here. You can see it's fairly simple because it's just the setup.
And this will actually generate this graphical user interface, where you can track the training and see the network. You can see the performance. So very conveniently implemented in MATLAB. And then you can, well, see the actual output statistics there as well. So I've just sort of done some calculations here as well.
And you can also, afterwards, access all the parameters in the network. For example, the input to hidden weights would be, in this case, stored in IW and the hidden to output weights also as well. So very convenient implementation, MATLAB. We simply need to load our data, and then we can run the network by sort of just a few very basic commands.
So if we're looking at the implementation of random forests in MATLAB, similarly, we'll have to load the data first from the data space. So let's say we have the data in there as x, the input data. So this x would be a matrix, a data matrix, where in this case, each column would be a variable, and the rows would be the samples. And y would be the output value that will have either a numeric value for regression or some number, typically, for classification. They could be otherwise as well, if you specify that. And again, that the variable labels would be specified optionally.
So what would a random forest implementation in MATLAB look like? Well, OK, so we need to specify-- as a minimum, we need to specify the number of trees in the forest. So that's sort of the minimal command. So in MATLAB, TreeBagger-- if you type, "Help TreeBagger," you'll get a lot of information on random forests in MATLAB.
So in this case, I call the model b. We can call it anything, My Random Forest, any name you would like to give it. It's equal to TreeBagger. So the number of trees, we've specified 50, 100, 200, 1,000. x and y, the inputs and outputs, and that's it. We have a model. We can go with the model.
In a slightly more verbose form, we can specify-- well, let's put it that way. If we are-- well, actually, we have to specify-- by default, MATLAB will do classification. So if we're going to do regression, we'll have to specify it as such, and we'll specify additionally, method, regression. We can also type classification, but by default, it would be classification. That simply means this is on the same line.
Now, very conveniently, we can also calculate the so-called out-of-bag of bag prediction to be on, the out-of-bag predictor importance to be on. And well, if you want the number of predictors to sample, you can maybe-- that has a default value, but it may be good to just looking at a few-- do a few spot checks just to optimize that. And if you like, you can also specify the minimum leaf size, or you can just leave it out.
So this "OOBPrediction, on" means that-- remember, the network is looking at each point, each flip point. It's looking at just a random selection of the data and the variable set. So we call that sort of the in-bag sample. But it's also automatically testing on the data it has not seen. So it's doing sort of an out-of-bag prediction, which is actually a very good indication of the ability of the network-- of the random forest to generalize on the data, data that it has not seen. So it gives you a good idea of how well this network model-- this random forest model would work when it is deployed.
And if we want to do it for diagnostic purposes, then we have to set the out-of-bag predictor importance to on as well, and I'll get to that a little bit later on. OK, so that is perhaps the implementation that you'd want. And you can see, it's really convenient. It's a single line. Really, that would calculate quite a lot of output.
Right, so if we implemented-- the convenience now is once-- if that OOB error is set to on, we can calculate the out-of-bag error as a function of the number of trees by simply plotting plot oobError as a function of trees because that is what it's doing when you sort of switch it on or activate it. So that gives us the mean square error with a number of trees. And we can see here, number of trees, we specified at 100. Maybe we can go to 200. But pretty much, it's leveling off, so 100 would be fine. With 1,000, we'll just get sort of that tail extending outputs, inputs to the right.
Right, so that's sort of just a check. With classification, it'll give us the actual fraction or the classification results as well. So if it's a 50-50 problem, and you get sort of it's leveling off at 0.5, then you know the model is not really doing anything. And if it's sort of down to 0.05, then you know that's sort of the classification error.
OK, so if we want to use the random forest, we can deploy it by predicting-- simply, that would be the output. And we simply say predict. That's the model, and that's the data that we want to use to do the prediction. It will give us the results.
Right, so how do we interpret these models? How do we use them diagnostically? We do that-- well, broadly speaking, if we are sort of opening up the black box, so to speak, or looking at these machine learning models, we can't look at the neural network as such. We can't inspect the weights. Even if we have the weights, the knowledge or the relationships are distributed among the weights of the network.
And if we retrain it, we'll get a different set of weights. So we can't look at the weights as such of the network. So how do we actually interpret the network? That's one.
And the same with the random forest, yes, we did see sort of tell it to calculate the predict inputs, but how does it actually work? Well, there are three approaches broadly that one could follow. One could look at the structure of the model, so inspect the inner logic of the trained model. Like we said, we can't really do it.
Well, I'll get back to that. With the random forest, we can do that to some extent and actually, with a multilayer perceptron as well, but not sort of directly. We can look at perturbation analysis or sensitivity analysis, where we sort of change the variables one at a time and then see how that affects the output. And then of course, we also look at surrogate models, where we have a simple localized model, perhaps a linear model.
If you're trying to sort of-- it's thinking of a convoluted surface, a decision surface, and then you're sort of fitting a plane or a line through a very localized area. It's a simple model that we can approximate the behavior of the overall black box model by that model. So we're not going to look at that today, but we'll look at these first two approach.
All right, permutation variable importance, the equation is just there for reference purposes. In essence, we are looking at, once the model is trained, of permuting a variable one at a time. So we shuffle the data, in essence, that column of data that's sort of associated with that variable. So we destroy the relationship between that variable and the output.
And once that variable is removed-- well, shuffled or in the data-- if we pass that data through the network or the random forest again, that difference-- say, if it does make a difference at all, then that variable is not very important. If it actually makes a big difference, in other words, the mean square of the model increases quite a lot, or the R squared value, the correlation coefficient, if that were to decrease a lot, then we know that's actually probably quite an important variable.
So if we look at the Gini variable importance measure that's associated with the random forests, we look at the split points in each of the trees in the forest, and we calculate the decrease in the entropy weighted by the samples at that split point. And we do that for each variable at the specific split point that's, of course, associated with the variable. So we can calculate that-- for each variable, the decreases in entropy across all the trees in the forest, and that gives us an indication of the importance of that specific variable in the random forest.
OK, we can also look at the neural network or multilayer perceptron. We said, well, we can use a permutation, but we can also look at the weights. Well, we said we couldn't look at the weights, but OK, let me qualify that. We can actually look at the sum of the products of the weights. So we can look at a certain input variable, and we can look at the path that it follows to the output.
So if we multiply those weights, so we look at the product of those weights, that's one path because that input was sort of going to that hidden node, and then to the output node, and to the second one, and out in the third one, and so on. If we sum those products, that will also give us a fairly good idea of the importance of that weight. Whether it's positive or negative doesn't matter. It's the magnitude that counts.
And finally, we can look at the so-called Shapley variable importance measure. Now, with the Shapley variable importance, this is an interesting one because the other variable importance measures are what we call empirical approaches to variable importance. The Shapley variable importance is what we call an axiomatic variable importance measure. In other words, it is such-- this is the case that all the variable contributions add up to the overall R squared value if it's a regression problem.
So that's very nice because we have an overall method. The model can explain about 90% of the variance in the data or the response variable. So if we split that, it will add up to the overall variable importance. So that's good.
So we also have the property that if a variable does not add any marginal value, then it has a credit portion of 0. That's what we want. If it's not contributing, it has to be 0. And then also, if two variables add the same marginal value, they will have identical credit portions. So these are very attractive properties or very-- I think you'll agree that that's a good way to look at the importance of variables.
So this is done by actually fitting all possible combinations of models to the data. In other words, if we have, say, three variables, we'll look at the single variable models, just x1, and the output, just x2, and the output, just x3, then the combinations, x1 and x2, x1 and x3, x2 an x3, and then x1, x2, and x3, all of them combined in a certain way. And we look at the contribution of each variable in each of those models. It's a little bit like sort of playing sports, and you say-- if you're going to have a doubles player in tennis, for example, and in the team, there are maybe four players in the team.
So you look at all the combinations of doubles players, and you'll say, well, player A with B, A with C, A with D, and so on and so on, and you actually track their performance. So in the end, the two best doubles players would be the ones that have won most often but in combination with other different teams, so to speak. So that's sort of very loosely a description of a Shapley variable importance. But I think that the properties are actually quite important. They're key.
Right, so just to look very briefly at an autogenous grinding circuit, a real world case study. So in this case, this is the circuit. It's a PGM concentrator plant. Actually, three mills in the circuit, a primary mill, a secondary mill, a tertiary mill. And we're just going to look at the primary mill, and the idea would be to look at the operational variables that contribute to the power consumption in the mill in this case.
The operational variables would be the total feed, the coarse to fine or ratio, the actual fines in the feed, the load of the mill, the level of the sump, flow conditions, the density of the flow, the makeup water going into the sump, the flow from the sump, the sump-- well, the density in the sump-- even there's a cyclone there as well-- the cyclone pressure, and the screen-- well, the current or the screen amp, so to speak, in the circuit. So this is just-- not to go into it into too much detail in this case, but just to illustrate the sort of general approach.
Here, I've shown some of the code, and I wanted to apply this. And in this case, I've set it up a little bit more customized. And I'm not a good coder by a long stretch of the imagination. But with MATLAB, it's very easy to generate these script files that you can run. And you can even have sort of maybe a little menu like this and so on. So with this code, I could then run multilayer perceptrons, random forests, and also both of those in a Shapley regression sort of module, where it would run those models, the multilayer perceptrons, the random forests, in all-- sort of looking at all the combinations of the variables.
OK, so here, we see some of the results. This is the variable importance analysis with a multilayer perceptron using the, let's say, connection weights approach to assess the importance of the variables, like I've explained. I've also added random variables to the data here, as you can see, R1 to R12, which are really sort of permutations of the actual variables themselves.
And this is similar to the so-called Boruta algorithm that's used with random forests, and it gives us an indication of the significance of the contributions of the variables themselves as well. Because otherwise, we do have a relative, well, sense of the contributions of the variables. But we also want to have some absolute reference, if you like, as to the actual contributions. And with these box plots with the notches, we can also get some indication or some confidence, if you like, in the quality of contributions, or we can compare the variables as well a bit more rigorously than just with nominal values.
OK, so we see that in this case, the fines and the load are certainly the sort of most important variables, and most of the other variables have also contributed, maybe a couple of them that may be very similar to random variables, but nonetheless. So that's the one approach.
OK, so this is using the multilayered perceptron still but looking at the permutation variable importance analysis. In this case, the fines and the load still are sort of top of the list. But also, the makeup, water, and perhaps the sump level are more important than the other variables. I haven't included the random variables in this case, but these were-- well, the results were similar than what we got with the connection weights approach in terms of the comparison with the random variables. On the left, you can just see the results generated by MATLAB on the training data, the validation data, test data, and so on that sort of, as we've seen previously, automatically generated when we execute those specific functions or code that I've shown.
So moving onto the random forest models, this is the Gini variable importance measure. Again, you can see, the fines and the load and also the sump level that's flagged as quite important. Also included there is just an indication of the correlation of structure of the variables in terms of the so-called variable inflation factors, up to 4, roughly speaking, moderate correlation among the variables but not too serious. So these results are not impacted too much by the correlation or structure of the variables.
OK, so again, then, with the random forest, the permutation variable importance in this case, the permutation variable importance measure, a model agnostic approach, we can use that with any type of model. And again, the load, quite strongly at the top. Well, coarse to fine ratio and perhaps the fines and a few other variables, less easily discernible, at least in terms of ranking.
But again, I've included the random variables here, and we can see clearly that according to this approach, all the variables contribute to the model, although, of course, not-- some contributions would be relatively marginal. In the right hand corner here, you can just see the mean square error as a function of the number of trees in the forest, just showing the first 100. But you can see, it levels off quite rapidly after 20 or 30 trees. Essentially, the model is quite stable.
And just as a matter of interest, I've included an example of one of the trees in the random forest, also generated by MATLAB. Again, a one line code, and you can inspect any of the trees that you like. And this is the Shapley variable importance measure with MLP, with the Multilayer Perceptron. And you get some sort of similar results with the random forest.
So in this case, the other approaches are empirical. They are influenced by the correlation structure of the variables. In this case, with the Shapley variable importance, the correlation structure does not play a role. Because remember, we have those-- we know that if the variable does not contribute to the model, then it will be essentially 0. If two variables are the same, the fines and the sump level, they will come out the same.
And then we can see that sort of contribution about 0.15 to the overall R squared value. We have that-- we can sort of have a firm basis to interpret that value as well. So the Shapley variable importance measure is certainly very attractive, comparatively speaking, if you look at sort of variable importance analysis. And you could say, you could ask the question, so why don't we just always use the Shapley variable importance measure?
Yes, we certainly could, but there's one problem, the elephant in the room, so to speak. And that is that the Shapley variable importance is computationally very costly. The computation time doubles roughly with every variable that you add. So let's say it takes you five minutes to calculate the Shapley variable importance with, well, let's say, 10 variables. With 20 variables, it will take you 1,000 times longer, roughly speaking. So the Shapley variable importance measure is, at this stage, limited to small systems, relatively small systems, in practice, maybe between 10 and 20 variables. But otherwise, certainly, a variable importance measure, that would be very attractive to use.
So it's all good and well to identify the most important variables specifically, but we are definitely also interested in the functional relationship between input and output variables. And for that, we can make use of partial dependence plots of the predictor variables using the model that we've fitted to the data. Essentially, the model would-- well, we are looking at the marginal contributions of each of the variables. So if we look at, let's say, the fines or the effect of the fines on the power, the relationship that we see there is sort of the average values or the average contribution, if you like, taking all the other variables or the variation in the other variables into account. So with a model, for example-- well, we can generate that with a model specifically.
So in this case, again, you can see the code on the top left. In MATLAB, to generate these plots, essentially, a one liner again, that plotPartialDependence, b, the model name. i is just the loop or the index variable in the loop to generate the specific configuration, a six by two configuration of subplots. And x would then be the actual data that we have used, in this case, on the model.
OK, so from this, we can see that the load, for example, there's a sharp increase in the power consumption with an increase in the load. But it levels off if the load is very high. And also, when the load is quite low, the effect is also not-- well, there's not much of an effect there. And so on.
So we can look at each of these variables individually. We can even combine them, look at sort of bivariate effects and so on. I'm not showing these here. But again, this is a very important part of sort of closing out the analysis, if you like, or at least sort of getting a better understanding of the effects of different variables on the process.
And here, another example of MATLAB's functionality in terms of visualizing data. So with a simple single line command, scatter3, OK, maybe the colorbar as well, we can generate a very informative image or picture of the data. In this case, the grinding circuit again, with the three most influential variables, the sump level, the load, the fines, and the power sort of mapped to the color bar. So we can see that the peak sort of power is sitting somewhere over in that cluster on the top left.
And we also get a very good idea, and we can rotate this. We can interact with this in MATLAB to get a better understanding of the data. So visualization and modeling, as I've said, really goes hand in hand.
So just some final comment. The interpretation of machine learning models is part of the emerging field of explainable artificial intelligence that is driven by developments, that demand transparency in certain fields. You simply won't be allowed to use the model if it's not transparent. And also, the ability of these models to, of course, facilitate the interpretation of complex data, as we've seen in the-- well, in a process systems context.
Now, we've also looked at the variable importance measures, and we do get, sometimes-- I mean, if a variable is really dominant and important, then it should show up at the top of each of these measures. But often, we get different or apparently conflicting results. Now, it's good to bear in mind that these variable importance measures actually measure different things, or they have different interpretations of the actual importance of a variable. I mean, it's not exactly defined, necessarily.
And then the correlation of the variables would also have an effect. Even if you use the same variable importance measure on a set of data that's correlated versus one that's not correlated, you'll see that for the same modeling essence, you'll get different results. So that's also something that needs to be borne in mind. And then maybe, finally, I think the take home message is also that MATLAB provides a very rich environment for the implementation of these models, and it's really easy to use, even if coding is not your forte. Thank you, ladies and gentlemen. I'm open to questions.