Predictive Maintenance of a Steam Turbine
Boitumelo Mantji, Opti-Num Solutions
Peter Randall, Sasol
The accumulation of salt deposits on turbine blades can significantly impair turbine efficiency, resulting in a process bottleneck that increases operational costs. Hear how Sasol and Opti-Num Solutions engineers collaborated to determine the optimal wash time to prevent bottlenecking, and how they implemented an end-to-end predictive maintenance workflow using MATLAB®. Learn how they estimated the remaining useful life of a turbine and deployed the application so that decision makers can access up-to-date data to define optimal wash schedules.
Published: 9 Jun 2021
Good everyone. Thanks for attending this webinar. Welcome. This will be a webinar on the optimization of steam turbine maintenance and scheduling using predictive maintenance workflow. I'm one of your hosts or presenters today, I'm Peter Randall, a rotating equipment engineer here at SASOL and I'll be co-presenting with Boitumelo Mantji from Opti-Num Solutions.
Good morning. My name is Boitumelo Mantji and it is a pleasure to be here. I'm an electrical and biomedical engineer and I Work as a solutions engineer at optimum solutions. As part of my role I get to work quite closely with a lot of customers within the mining and manufacturing space helping or supporting them in solving the various problems or even introducing smart technologies into their operations. And this presentation will display an instance of just that. Thanks Peter.
Thanks Boitumelo. So moving on to the agenda. So we'll cover a bit of a background, some content on fouling and how that's important to us, fundamentals of steam turbine operation, and then we'll run you through the predictive maintenance workflow, looking at the preprocessing, the analysis, model development, and then the deployment, and we'll conclude with some comments.
So just to start off with just a little bit about SASOL. So SASOL is global chemicals and energy company. We have operations in 30 countries and we are also one of the largest producers of synthetic fuels in the world, and we also have the world's broadest integrated alcohol and surfactants portfolio. I'm one of their trading equipment engineers located at our second facility, which is quite a large facility covering a geographical area of more or less about 20 square kilometers.
So now I'm going to go forward into the equipment that we're dealing with today's project. It comprised of a series of seven compressor turbine trains, each one with a megawatt rating of about 13 and 1/2 megawatts. It's a condensing steam turbine running at about 7,000 rpm with speed controller.
And then additionally we also have a pressure controller, which will control our wheel chamber pressure when it measures 2,550 kPa. So that'll basically cut back steam flow to limit that pressure at that time. So then the problem statement is actually that these particular steam turbines suffer from fouling and that fouling occurs in a period normally between about eight months and to a year.
And we want to essentially but more planning and predictability because when we start cutting back on that wheel chamber pressure we start to limit the steam coming through the machine and that causes us to have a bottleneck on the compressor side of things.
So then I'm going to go through a little bit about fouling. As you can see there's a picture a photograph of fouling and how it appears inside the turbine. And that basically is occurring when the steam goes from a dry condition to a wet condition. So from super-heated to saturated conditions. At that point is where you actually get the salts being slightly knocked out of the steam and it builds up on these turbine blades as a result.
So what fouling then does inside the machine is it increases the resistance, which will directly affect the wheel chamber pressure. It then also changes or modifies the blade profiles which will decrease the isotropic efficiency. And a decrease in isotropic efficiency will result in slightly increased steam flow. I'll explain this a bit more in theory in a moment.
Then we will also get potentially with an increase in steam flow, we will see an increase outlet pressure. And with the increase wheel chamber pressure we have a larger pressure over the whole machine which will potentially result in a larger pressure force. And that needs to be absorbed thrust bearings.
So going forward into a Mollier diagram, where entropy diagram. Here, we can just see two cases of isotropic expansion, which is in the black area and the real expansion process in the red area. So what that is just showing is that in a real steam expansion process, which is going from some superheated condition into a saturated condition, we get an increase in the entropy.
And this is also visually represented here as a slightly increased exhaust entropy, but as you can see the extraction of energy is the difference between those two. So the more you increase your entropy the less energy are extracted from the steam and so you would need more steam to have the same amount of work. This is a little bit easier to see sometimes on a temperature entropy diagram.
I've plotted three different cases. The one is the isotropic expansion as you saw before, the real expansion to the same isobar, which is a constant pressure line. And then the real curve at with an increased isobar as a result of the fouling condition. So you can see that we've moved more to the right and even more entropy, but we've also climbed up onto a different isobar.
And this is just exaggerated in this case for illustration purposes. But effectively that is the transition that you have and that is why as a result of the fouling you have, number one an increased flow but also these changes that we see. So then going into the project itself, we had a budget of about 100 consulting hours for use with Opti-Num.
And the aim of this project was to analyze the post performance and the maintenance of our maintenance interventions here being the turbine washers, and predict the future on when these interventions are required and to deliver a model that would be able to predict these interventions and then allow us to deploy that within our environment. So here I'm going to hand over to Boitumelo to take you through the predictive maintenance with that.
Thanks Peter. So at this stage, we have a clear view of what the problem is and what we're trying to achieve. And so now we're going to move on to the solution we implemented. So the solution is comprised of various stages which are shown in this workflow, and as I go through each stage in more detail, you will uncover the various challenges that we faced in building out best solution.
The first stage entailed accessing and processing the data. So essentially getting the data into a format that's suitable for building a predictive model. Once the data was clean, we perform some analysis on the data to extract features descriptive of turbine performance, and this together with Peter's domain knowledge, enabled us to define the operating bounds of each turbine.
Once we had characterized what healthy turbine behavior looks like we moved on to building our model. And then finally, the deployment stage, which entails integrating the functionality into SASOL systems. So effectively making the technology accessible to the operators and decision makers, which Peter will cover in more detail later on in the presentation.
Now we're going to look more closely at the data access and preprocessing stage. SASOL provided us with process data for seven of their turbines. This was six years of data collected from 2012 to 2018 at 20 minute Intervals. So this project was actually completed early in 2019. And so this is the most recent data that we had at the time. Initially, four variables of interest were identified that being variables that could characterize turbine behavior, namely real chamber pressure, speed, steam throughput, and vacuum pressure.
But through leveraging Peter's domain knowledge of turbine systems we ended up ignoring the last variable, which is the vacuum pressure. And the reason for that is because vacuum pressure has a strong correlation to steam flow, which is one of the variables we're really looking at. And knowledge of this save us a lot of time because it meant we didn't need to look at possibly applying dimensionality reduction techniques to the data.
You will also made aware of the operating conditions of the turbine, where wheel chamber pressure has to be in the range 1,600 to 2,550 kilo pascals. And speed has to be above 6,700 rotations per minute. So if we take a look at the graph on the right, which is a plot of the speed against time for a single turbine over that six year period, it means that the operation in this region, which gives us information about the performance or the efficiency of the turbine.
And remember what we're trying to do is predict the remaining useful life of the turbine in other words, when next will we need to maintain our turbine. And that deviation from normal operations can only be seen within this region. So for us to estimate the remaining useful life of the turbine, we needed to analyze the life cycle data of each turbine. And we can think of the life cycle as the period of time between two consecutive washes.
But one of the challenges we faced is that there weren't any records or logs of when previous washes occurred over the six years. And so we had to work quite closely with Peter to see if it would be possible to tell when washing occurred based on the data. Because of this, we analyzed the low speed data, which is the data in the green region.
Low speed data in general is representative of ramp up and ramp downtime, overhauling, and washes. And we chose to look at the low speed data in isolation. So this scatter plot is only the data below the 6,700 rpm threshold. And we did this because we knew what we're looking for would be in that region. And by removing the high speed data, it just meant we had less data to process, which saved us a lot of time. At this point, the aim was to look through the low speed data and see if we could detect when this occurred, so we can analyze the turbine cycles more closely.
After even more conversations with Peter, some poor guy, he pointed out the turbine speed behavior when it's undergoing a wash. And in order to show that behavior visually, we plotted the scatter plot as a line graph. This red line represents a time when a wash could. And I'm simply going to Zoom into the circled area to give a better view of the turbine behavior.
So what Peter explained is that when you wash a turbine the machine undergoes a series of ramp ups and downs. You can see here it's turned off, then it's ramped up to run at about 800 rpm, then it's run down again. And this process takes about eight to 12 hours. And this pattern is actually characteristic of washing a turbine and it's something we never would have known without bothering Peter again.
So we could find that working really closely with Peter, so coupling our modeling or data science experience with Peter's experience as the domain expert worked really well in terms of influencing a lot of the design decisions that were made. From this, we were able to develop an algorithm to detect that pattern throughout the data set. And this enabled us to have a log or record of when past washes occurred.
Once we extracted those dates, we discarded the low speed data and this allowed us to focus more on the data that spoke to the operation of the turbine. So the high speed data. At this stage, we have cleaned our data and we know one previous washes occurred. So the next step is to analyze the turbine life cycles. To be able to visualize and understand how the behavior of a turbine changes as it is the end
of its life, we use wash logs to effectively segment the data into a series of life cycles for that specific turbine.
This graph shows the data collected within a single lifecycle of a turbine. Early I mentioned that we had three variables of interest for characterizing the turbine behavior, which was speed, wheel chamber pressure and steam throughput. And we looked at various visualizations of the data. So 3D and 4D plots of different sensor combinations and ratios trying to identify the key features. One of which was looking at speed against the wheel chamber pressure and how this varied over time, which is shown by this color bar.
So the blue region would indicate the time right off a wash, and the yellow would indicate the time just before a wash needs to occur. And what we noticed was as we approach the next wash as the turbine gets more foul, the wheel chamber pressure would rise and the speed would drop. And this is consistent with the theory. And this kind of plateauing behavior that we see in this region is the result of some of the control measures that SASOL has in place to prevent the pressure from rising past a certain point.
Then we looked at the other life cycles of the same turbine. So for example, this turbine washed five times within that six year period. And so five life cycles can be observed. And we use this view to see if the features mentioned previously, which were identified as being indicative of failing were in fact, consistent across the other life cycles.
So now we're taking a closer look at one life cycle and bringing in that third variable, the steam throughput. The reason for that is bringing in steam throughputs provided another level of separation of the data points particularly, when looking at the color gradient, which speaks to the period of time between two washes.
In this view, what we see is that the steam throughput is rising. And from a practical point of view, the reason this happens is because as salts continue to accumulate on the blades of the turbine, more and more steam is required to maintain the same throughput. But eventually a point is reached where the expected throughput can no longer be maintained. So that throughput gets pulled down by the pressure limit, which is limiting the same throughput.
And it is at this point where we know our system is operating inefficiently. From this we could determine what the operating threshold needed to be, which is this red star. So this is where we don't want to be operating, or any way too close to this region. And we can see that point corresponds with the wheel chamber pressure threshold, which is chosen by the operator and the point where we start seeing the decrease in speed.
So once we are within certain bounds of the operating threshold, this is when we typically want to wash the turbines. So we can start operating in the blue region again. So just to recap, at this stage, we have identified our operating threshold. In other words, the point we don't want to reach. But we know that as time progresses, so as the turbine reaches the end of its life cycle, we get closer and closer to that threshold.
Meaning, that the three dissonance between the data points and the threshold over time will decrease. So we started by calculating the distance between each data point and the threshold. And that distance between the data points and the threshold is what this graph represents, with the y-axis is the log of the distance, the x-axis is time and the color band in this instance represents the wheel chamber pressure.
Each graph has a separate life cycle and we can see that in general each graph very gradually tends to zero over time. Because remember the x-axis essentially represents that operating threshold, which we
are approaching. The final step required fitting a linear function to the distance later, in this process, we made use of a moving window with a window size is a certain period of time which gets determined by the operator.
And as the window moves across the data, at each step a linear function is fitted to the data. You then get this collection of linear models. And the reason we use a moving window is to ensure that the model adjusts predictions according to the latest information. And you can see how the model keeps adjusting itself by looking at these decreasing gradients as we move through time. And this is exactly what we expected to see.
We then get an average of the models to get the general trend of the overall data. Then finally we extrapolate that average model to the intercept and otherwise to the operating threshold, which then tells us what the remaining useful life of the turbine is. At this stage, we are very happy because we now have a predictive model that we can verify.
So we verified that the performance of the model by applying it to historic unseen data. And these red lines you see here serve as the markers of what the model predicted the next wash will need to be. So if we look at the first graph, we can see that the turbine was washed around August, September. But according to the model's prediction, the washing could have happened a few months later.
But from the gap in the data, which is about a month long we know that the turbine was actually overhauled which explains why it appears to have been washed too early. We also see that shortly after the overhauling, the efficiency of the turbine dropped quite rapidly. And in this instance, Peter and the team it was due to the poor steam quality at that time.
So really the result of the model are meant to serve as supporting material for the operators and the decision makers. And this is important because the intention of the predictive model was not for it to be used in isolation or independent of the operator but to serve as an additional tool or assistive technology for defining a more holistic wash schedule.
For the second Life cycle, a wash was performed around September. But the wheel chamber pressure was still quite low because see that was still in the blue-grean region. So we washed it too early. And we can see that the prediction headed about a few months later. And not only that, we noticed that shortly after the wash in September, which marks the beginning of the third life cycle, we can already see the wheel chamber pressure starting to rise again. We're already moving into that yellow region.
And this just highlighted that the quality of the wash was not so great. So from an operational point of view, this allows operators to not only predict the future but also have a retrospective view of why certain decisions were made and how they can be avoided in the future.
Once the model is built, the next step was looking at how all of this technology could be made accessible to the relevant people within SASOL and Peter will give more information around that.
Thanks Boitumelo. So now we've got a model that has been built it's now in our hands now we need to from the SASOL side of things integrated into our workflows and our operational environment. That's had to factor in a few different things. The one is how are we going to actually interface with this thing, this model and app. And then how does that integrate itself into our operational databases and things like that. So the back end side.
And then I want to just touch on some of my learnings in that because I think the audience might find some use in that information. On the right hand side, you can actually see an example of what the GUI interface looked like. On the left hand side, we've got a summary section where it would just cover the
operation of each machine since the last wash. So you can see this breaks in operation but it was never washed in that particular period.
And yeah, this is essentially just high look at each machine and then you can deep dive into a specific unit. So in this case unit 7 and look at what that information looked like for the last wash or the last number of washes in the time frame that you've requested. So you can actually have that retrospective look to see, did we actually improve or are we moving further away from that high wheel chamber pressure, getting these nice dark blues and things like that.
So that retrospective analysis is done visually and just as a secondary point to the actual washing procedure itself. Then other interface considerations. So on GUI interface essentially for people like planners and maintenance managers and things like that, they're not necessarily looking to dive into the details. They want to know the information off the bat nice and quickly in a summarized and concise form.
So here we can see that landing page of the app, you essentially select our historian data, you import it, you run the model, and then it'll provide you with a table. And here you can actually just fit all the some cases where these machines might want to or need to come off at very close into those to one another, then you can actually shift around their schedules. Maybe one a little bit earlier based on production availability and things like that.
So it allows you to plan that and to just have that visual way, or tabulate the data to make that decision. Something to note as well on your web apps. That typically it'll just log the next action that should be done. So if you click the two buttons in quick succession, then it will pull the data but then the next coding window it'll actually run the second one, which may not be when you've actually gotten it back.
So in cases like this, it's good to use something like a processing dialogue to prevent the user from getting any more inputs to the app. Then in terms of the deployment on the back end, we have quite an interesting situation where we have two different types of historians actually, essentially backups of one another. One is a Honeywell historian and the other one is a rissi soft pie historian, both of those historian instances also have OPC server deployments on top.
So we had a luxury in terms of how we could integrate this app into our operational facility. And I started off with using the OPC toolbox which allowed me some graphical way of exploring the server, seeing how I'm going to get the data and then you can also create functions to go and pull certain functions or pull data or utilize any OPC functions. And that just helps with creating code if you don't specifically know the architecture.
And then the other method is actually using either the DLLs or the APIs natively within MATLAB, which you could do via the .NET AddAssembly or load library core. And this actually is a slightly better method because the APIs carry a little bit less overhead. And they remove some of the inherent limitations that might exist on the OPC deployments.
And then the last one I was struggling to get the Honeywell API natively into MATLAB, but I was able to actually pull that into Python. And because of some other work that we're doing using coprob within Python, we actually decided to just use Python as a basis for that.
So as we made that decision that then requires us to just ensure that we have correct MATLAB rappers to essentially take whatever is in the Python or dot NET scripts and make sure that it's compatible with the data coming into MATLAB. And then the last step is once you've got all that data, you process it and you figure out some of the downsides, maybe of the GUI and you optimize that to make sure it's quite user-friendly.
And in that instance, we found that the information was taking quite a while to process and as Boyd Miller mentioned, there's multiple linear models that is being done because you're moving-- you've got that moving window. And essentially that takes quite a lot of computation and what we are doing here is actually then just making sure that we can leverage our multiple CPUs on the machine by using a parallel four loop.
But that in itself mandated that it was a little bit of a rewrite in the code, just in terms of how the function communicates to the four loop because you still need to keep track of which machine you're dealing with and that you can do quite easily. You can look in the documentation of MATLAB, but you essentially would use that data GUI to make sure that you can get the values back.
Last thing is that when you use something like a parallel four loop it's going to start a parallel pool with our many workers you've got set and that takes some of it. So these are not necessarily advised for situations where the overhead is more than the cost of the processing.
Then some of the interesting points on the historians that I'd like to mention is essentially they look a lot, they look very Simulink how you would call and get data from the historians. You've got these tag constructs and you've got functions that are built onto the tag, but what's interesting here is I've got a tag construct and I then have a server construct. And the fetch data request is then on the server construct of the Honeywell side of things.
Whereas if you look at the PI side, you've got the same tag construct but then the tag construct has something called a summary or a way to fetch data based on some calculation or something like that. You don't have to put in a calculation, you could just ask for the raw data or something like that. And then here's an example of, say, the current value.
So the important thing here is that each point actually sends a request and not necessarily each server. And that is just the difference in how those connections are handled. So on the Honeywell side of things, the connection is handled by PI server request and the timestamps and the request the data is aggregated in that server instance. And you don't need to make sure that it's there in each call.
And whereas on the PI side it's actually handled as part of these functions. So that's handled internally within that summary function or method for example. Something to note is that because of the way that this occurs. When you want to run through multiple attacks serially, it's actually slightly faster to use what OSI soft spot into the API and that's a point list. So that's essentially a similar way of doing it to the server. In that you've got a list of points that you want to then request these accidentally data fall.
So that makes things a lot faster because you can actually parallelize the call on the server side and you don't have these multiply waits that you do when you serially request the data. So please feel free to visit the support documentation. I only have the OSI support website documented here. If you have a Honeywell server, you should be able to get hold of the documentation for left in PDF form.
And going on to actually getting so now we've got this function in MATLAB and in Python, but we have still an issue that whatever these APIs are using, there are actually dot NET APIs. So they're using system type variables. And firstly because you're using Python for this, you need to get them into Python data types. But preferentially you choose the types that are analogous to the MATLAB data types. So in this case things like doubles, integers, singles, those are very easy to move between Python and MATLAB so that is fairly trivial.
But moving them from a system that adds up to a Python data type might not be. So there's various ways to skin this proverbial cape, that is one of them is to use like four loops and or looping through the racial
constructs and actually then converting each souled actual data point individually. These are typically quite some time consuming, especially when you compare it to the next comment which is actually a guitar posto. I put it in there for interest and you're welcome to go and built it.
This essentially as a memory move of certain data types that are identical to Python data types. And that makes it very quick to convert these data types to a Python data type. In our case, all of the data was actually doubles. And most processed historians store values in either doubles or integers, or maybe Booleans.
I've come across very few that are actually using string data types. So that actually works for most or all of the data that I was dealing with. The last thing that's a bit tricky is the date time. So that was coming through is a system date time, which you could either store as a string value, but then passing the string and making sure that your date is the same on either end is quite time consuming. Passing strings is obviously not just a single action per data point, you've got however many characters you have in there.
So obviously passing a character string is quite time consuming because you have multiple characters that you need to pass. The other option, though, is when we make this call to the database. We can actually request the date time as UTC date number, which means that it's very quick to essentially change that over to a daytime in MATLAB. It's essentially just an addition to zero date number and getting that back. And that is just essentially float 64 or double data points. So that it makes that conversion quite easy.
And I would advise if you're going to do something like this to use that method, although converted into a known date number and then use that formula going forward. And then lastly, now we've got this model we've actually been able to deploy it and now we can use it. So essentially we managed to create a model within the 100 hours of available time. It provides us with retrospective analysis, it's repeatable and comparable, and it does exactly what we needed to do.
In the case of efficiencies we weren't specifically requiring that we look at the efficiency necessarily of the turbine, but we were looking more for the point of operation where we stopped the bottleneck the facility, because that is a much bigger loss for us and that was our main goal. And so we've now got this model we've integrated it and we've used it.
So lastly in terms of a workflow like this is just some details that is handy to remember. That is if you can understand your problem from the first principles or physics or some design basis, then I would urge you to do that because it might highlight points where you can actually take a physical and analytical approach to either minimize the dimensionality of the data, or understand relationships between certain data points, so as not to duplicate a relationship.
And then also understanding your deployment requirements. So in our case, we wanted it to be a grid type deployment so that we could have the retrospective analysis. You could also if you have a specific goal, write this into a script that runs in the background and then just to deliver say a report or a single value or whatever the case might be in the back end. But like I mentioned, we wanted the retrospective side of things.
And then also understanding leveraging of the expertise that's available to you. So in our case, I'm not a data scientist by trade and so the data science portion of it would have taken me much longer to do and implement. And my expertise lies on the machines and how I can bring some knowledge into the model based on that. So it's important to collaborate in things like this.
This whole process as we've detailed was a massive collaboration between the two parties, making sure that my goals as a technical person were achieved and also bringing that technical knowledge into the data science and Opti-Num or Boitumelo being able to provide some visual evaluations and things to essentially see is this model doing the right thing. And that was a really good way to work just leveraging those two things.
So I would advise that as well to whoever wants to approach a problem like this. And then for client personnel that are working and breathing into these operational environments, just to establish whether you have any of these problems and whether you can find a better or more efficient way of dealing with information that you deal with on a repeated and frequent basis, question that and see whether you can improve and expand on what you're doing moving stuff away from a manual action into something that is a little bit more automated that gives you a bit more time to do these interesting examples and workflows and optimize your time a bit better.
But in order to do that, you need to get to know your systems, understand what the limitations are in terms of data flow and what data is available and those things, and what you could actually achieve. And to do that you might also identify some future hurdles and projects that want to be take-- to be done in order to make this a reality.
So yes. Thank you for your time. I hope you guys enjoyed it and I hope you found value in it. We'll now take some time to answer some of your questions. So feel free to contact us if you have any specific questions around the content.
Thank you everybody.