Nomenclature
- AF
activation function
- ATW
adaptable time window
- C-MAPSS
commercial modular aero-propulsion system simulation
- CNN
convolutional neural network
- DAG
directed acyclic graph
- LB
lower boundary
- LR
learning rate
- LSTM
long-short term memory
- PHM
Prognostics and Health Management
- ReLu
rectified linear unit
- RMSE
root mean square error
- RUL
remaining useful life
- SGD
standard gradient descent
- TW
time window
- UB
upper boundary
1.0 Introduction
The research area of PHM focuses, amongst others, on developing accurate models and methods to estimate RUL of complex systems and components. Accurate estimation gives rise to several potential benefits, including lower maintenance costs and increased availability. However, while academic and industrial examples of successful PHM applications are on the rise, several challenges are currently present [Reference Scott, Verhagen, Bieber and Marzocca2], including (1) a small amount of failure events, which complicates the development of data-driven PHM models in particular; (2) the use of sensors which are not specifically targeted to support PHM; (3) a lack of publicly available data to generate new and/or better-performing models; (4) a comparative lack of model validation on real-life datasets and across multiple components, with many available models developed for specific applications but not tested more broadly; and (5) consistent interpretation, explainability and reliability of PHM models and their output in a safety-oriented industry where mistakes may lead to major accidents.
Historically, one way to address the first three challenges identified above has been to use synthetic datasets. The academic state of the art on PHM has focused to a substantial degree on various prognostic datasets provided by NASA, with the NASA Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dataset on engine failures being the most notable and widely used dataset for PHM model development and testing. While various categories of approaches have been investigated [Reference Ramasso and Saxena3], including data-driven, hybrid and physical model-based methods [Reference Sikorska, Hodkiewicz and Ma4], in recent years deep learning methods have become increasingly popular. For instance, Babu et al. developed and applied a Convolutional Neural Network (CNN) towards the C-MAPSS data [Reference Babu, Zhao, Li, Navathe, Wu, Shehkar, Du, Wang and Xiong5]. The main strength of a CNN is that features are extracted from the data, which cannot be defined by using standard pre-processing methods, such as statistics. It is especially practical when the data can be spatially ordered (in time or position) [Reference Jiao, Zhao, Lin and Liang6]. As an alternative, Long-Short Term Memory (LSTM) algorithms have been applied in literature [Reference Shi and Chehade7, Reference Zheng, Ristovski, Farahat and Gupta8]. An LSTM network is able to maintain information from previous input values by combining a short-term and long-term cell state. This technique allows for the network to use the complete data flow and alter predictions based on earlier inputs. The main strength of this technique is that later predictions become increasingly more accurate as time increases. Improved RUL prediction can be achieved since aircraft component degradation is a continuous process. A LSTM network is highly applicable in situations where sequential prediction is required. Currently, a trend is showing where different neural network types are combined to achieve higher prediction accuracy by compensating for disadvantages of the contributing types [Reference Al-Dulaimi, Zabihi, Asif and Mohammadi9]; such approaches are known as Ensemble Learning Methods (ELM). In the context of the C-MAPSS dataset, the combination of CNN and LSTMs in ensemble approaches has shown better accuracy than earlier techniques [Reference Li, Li and He1, Reference Zheng, Ristovski, Farahat and Gupta8, Reference Al-Dulaimi, Zabihi, Asif and Mohammadi9], with the work by Li et al. [Reference Li, Li and He1] showing notable prediction performance. A CNN-LSTM is best used in a situation where a spatial and sequential input is available. This is the case of the C-MAPSS dataset when certain pre-processing measures are used. However, the research in this area has some limitations, for instance in terms of reproducibility, but also in terms of more detailed consideration of degradation evolution over time [Reference Zio10]. In terms of the latter, the current state of the art lacks methods or adaptations which enable RUL predictions at early stages of the lifecycle. Furthermore, most methods are trained to predict entire degradation trajectories, rather than using information on degradation states along the way.
To address these shortcomings, this paper aims to contribute to the academic state of the art by proposing and testing two adaptations to the CNN-LSTM network presented by Li et al. [Reference Li, Li and He1], which is one of the highest-performing ensemble methods in the state of the art. The two adaptations are:
-
• Adaptable time window (ATW): with current time window operations, a RUL prediction cannot be made at algorithm initiation. Therefore, an adaptable time window is applied to be able to predict a RUL in the early stages of the prediction. This method also aims to improve the prediction accuracy at later stages when more data is available.
-
• Sub-network learning: Most techniques aim to predict accurate results with the same network and settings for every data point. Others aim to predict the point where degradation is starting [Reference Al-Dulaimi, Zabihi, Asif and Mohammadi9, Reference Jayasinghe, Samarasinghe, Yeunv, Ni Low and Sam Ge11]. In this paper a sub-network learning method is described that first identifies the stage of degradation and then uses a more specific trained network with the goal of improving overall prediction results.
The structure of the paper is as follows. First, in the Method section a brief description is given of the ‘baseline’ model used by Li et al. [Reference Li, Li and He1], followed by a description of the two proposed adaptations. Subsequently, the Results section describes the C-MAPSS dataset and the pre-processing approach, while providing the general network hyperparameters, training and testing settings. The main results for the two provided adaptations are presented and discussed in detail. Finally, the Conclusions section wraps up the work while considering its main assumptions and limitations, pointing the way to future research.
2.0 Method
The model provided by Li et al. [Reference Li, Li and He1] is used as a baseline configuration, which is subsequently extended to include the two primary adaptations mentioned earlier, i.e., the adaptable time windows and sub-network adaptations. The baseline model comprises an ensemble network in the shape of a directed acyclic graph network (DAG) based on a simultaneous implementation of a 2D CNN network and a LSTM network. The outputs of both networks are combined afterwards in a second LSTM network. A visual representation of this network can be seen in Fig. 1 and is briefly explained below; for a more detailed description of the network architecture, the reader is referred to Li et al. [Reference Li, Li and He1].
To feed the DAG, feature input data is selected and pre-processed as described in Section 3.1, involving feature selection, normalisation and setting up a piece-wise linear target function. A sliding time window (TW) is passed over each feature (n in total) to extract training data for a certain TW length. This data is then implemented into two different parallel paths: the LSTM path and CNN path, respectively. The LSTM path involves flattening the input data since an LSTM network cannot directly use higher-order dimensional data. The resulting LSTM network contains u 1 nodes and one layer, where u 1 denotes the number of nodes adopted in this first LSTM network employed in the overall DAG network. The other path is based on a CNN algorithm. First a convolutional operation is implemented over the input data, followed by pooling and flattening operations to ensure both paths can be combined afterwards. The output size of the CNN is the same as the LSTM path. To obtain a final outcome, the CNN and LSTM paths are combined. First, the output of both paths are summed element-wise. This data is then further implemented in a second LSTM network. This network consists of u 2 nodes. Only the last output of this network will be input to a fully connected layer. This fully connected layer has a single output for each prediction. This output is the final estimated RUL.
2.1 Adaptable time windows
Most approaches that use a TW for prediction apply this at a cost of having no prediction during the first number of cycles [Reference Li, Li and He1, Reference Zheng, Ristovski, Farahat and Gupta8]. This has no implication when only predicting the last cycle of each unit in the testing set. However, this is undesirable for real-time application, hence an ATW approach is proposed here. In addition to the improved capability in generating early predictions, it is hypothsised that ATW may lead to better performance by including additional samples. In the ATW approach, the network is trained for different types of TW using all available samples. This is applied in increments of 3 (e.g., 3,6,9 etc.), since the convolutional operations are applicable for these increments due to the size of the slices used in this research. Other step increments could be achieved when the size of slices is altered. The weights of each training are stored and later used in the application of the model. Longer TW steps lead to less available samples, due to required data points to create a sample of this size. Output of the model can then be taken at any time cycle by taking the algorithm weights for largest possible TW (e.g., for time cycles 6,7 and 8 one can use a TW of 6). An example of ATW for different TW lengths leading to different sample sizes can be observed in Fig. 2. In this research, the ATW process is repeated for a TW length up to 45.
2.2 Sub-network learning
The second adaptation to the baseline network is based on sub-networks. First, a prediction is made by the original trained network (primary network). Afterwards a second network (sub-network) can be applied, which is only trained on the given health stage samples. The RUL prediction of sub-networks is used for the final prediction. This results in a double regression model. An overview of the approach is given in Fig. 3 and further discussed below.
In short, Fig. 3 highlights that pre-processed training data is used to first of all train a primary network. A primary network may provide less accurate predictions, since all different inputs should result in the correct outcome. To alleviate this, the primary network is used towards the identification of health states and subsequent grouping of the original pre-processed training data into these health states. This data, grouped according to health states, is subsequently used to train sub-networks for each health state. The final step in the sub-network is to combine individual sub-network predictions into an overall RUL prediction. The trained sub-networks and associated health states are applied in a separate testing dataset to generate RUL trajectories for the C-MAPSS dataset, a process further detailed in Section 3.1.1. Pre-processing of data is applied equally across the training and testing datasets and is further detailed in Section 3.1.2.
A critical aspect of the proposed approach is to identify health states. In the proposed approach, the original data samples are divided in three stages as shown in Fig. 4: a healthy state, degradation state and critical state. These states are constructed by setting two boundaries. The boundary that divides the healthy and the degradation state is named upper boundary (UB) and the boundary between the degradation state and the critical state is denoted the lower boundary (LB). The locations of these boundaries can be varied for optimal accuracy. To reflect the potential inaccuracy of the primary network predictions, a margin is introduced around the UB and the LB. When a prediction of the primary network is between the margins, no sub-network is applied. The primary prediction is used as the outcome for that sample. This method promises to reduce the effect of incorrect classification and different margin sizes can be applied to test for optimal accuracy.
By training in specific states the accuracy of the network can be increased. However, when the primary network specifies a data sample in the wrong sub-network (incorrect classification) an error is introduced. For the sub-network approach to lead to increased accuracy, this introduced error is required to be lower than the increase in accuracy given by the individual sub-networks.
3.0 Results
In the following section, the results of applying the adaptations to the baseline DAG network are provided. First, details are given as to how the DAG network and its adaptations have been implemented and prepared for application towards a case study involving the C-MAPSS FD001 dataset. Subsequently, results are provided and discussed for both adaptations.
3.1 Implementation towards C-MAPSS case study
3.1.1 Dataset characteristics
The baseline network and proposed adaptations are applied to the C-MAPSS dataset which comprises simulated engine degradation and failure data. The dataset has been created by simulating engine degradation and providing data for prognostic research [Reference Saxena and Goebel12]. The C-MAPSS dataset is divided in four different sub-datasets (FD001-004). Each dataset contains three different files: training file, testing file and a file with the actual RUL value for the final testing data points. The training and testing data both contain a number of engines. Each engine has a different initial health stage and different operation criteria thus provide a different number of operational cycles until failure. Each cycle of each engine contains operational settings and 21 sensor data values. The last cycle in the training set also indicates failure afterwards. The last value of the testing set has a RUL equal to the value given in the file with the actual RUL. Each sub-datasets has a different number of operating conditions and failure modes. The datasets are getting increasingly more complex. With FD001 having only one operating condition and fault mode, whilst FD004 has six operating conditions and two fault modes. The FD001 dataset is applied in this research, since application and optimisation for all datasets would be time consuming and not the main topic of this research. An overview of the different (sub)datasets is provided in Table 1.
3.1.2 Pre-processing
Before feeding the data to the network, several pre-processing steps are required.
First, the data needs to be arranged per engine and the correct input features need to be selected. The selected sensors for this research are obtained by the procedure mentioned by Zheng et al. and Li et al. [Reference Li, Li and He1, Reference Zheng, Liu, Chen, Gao, Cheng, Yang, Zhang, Li, Huang and Peng13]. The sensors that show no positive or negative trend over time (irregular) are not in the model. This results in a total of 14 sensors being used.
The next step is to normalise the data, since neural networks require this to work optimally, without resulting in an exploding gradient [Reference Bengio, Goodfellow and Courville14]. The chosen methods of normalisation are Z-score normalisation and min-max normalisation. Z-score handles outliers well, in contrast to min-max normalisation. However, Z-score does not maintain its exact scale, while min-max does.
Thirdly, an often-used pre-processing technique is to apply a piece-wise linear RUL function, instead of a linear RUL function. This was introduced by Heimes [Reference Heimes15]. The maximum target RUL is limited with this technique. This allows the model to represent real life degradation more accurately. It is based on the principle that after a certain amount of time degradation is visible and during normal operation no failure is apparent.
The final pre-processing technique to be applied originates from Zhao et al. [Reference Zhao, Liang, Wang and Lu16] and informs one of the contributions of this work. This involves the application of a TW over the data points. The window is created by sliding over the training data for a certain TW length, step size of one and the label being the last cycle of every window. This allows the use of CNN and the ability to implement a representation over time. A TW of 30 is applied for the FD001 dataset, as described by Li et al. for the baseline case [Reference Li, Li and He1]. An adaptable TW technique has been described in the Method section, which allows for early prediction and enhanced prediction for longer sustaining components; the results of the ATW application are described in Section 3.2.
An overview of the pre-processing parameters is given in Table 2, with several discussed above. The values of these parameters have been set in accordance with the baseline case [Reference Li, Li and He1], where additional detail on other parameter settings (such as the number of slices and training epochs) can also be found.
3.1.3 Training approach
The baseline DAG network and the introduced adaptations all require training to be able to predict an accurate RUL for a given input. This training is performed over a number of iterations, known as epochs. The required number of epochs is based on the type of dataset, learning rate and applied batch-size. The selection of these parameters is important for the prevention of model overfitting. Table 2 provides an overview of pre-processing and training characteristics, which are kept as close as possible to Li et al. [Reference Li, Li and He1] to enable comparison of performance. The training data is firstly divided in a mini-batch of 100, as also applied by Li et al. [Reference Li, Li and He1]. This mini-batch allows the network to be updated after a sample of 100 data frames. During each epoch, the network is fed with all the different mini-batches. After each mini-batch, the network is updated with the given learning rate. This process prevents overfitting by only providing the network with a small amount of frames and outperforms other training methods, such as updating after the complete set and after every single input (batch gradient descent and stochastic gradient descent) [Reference Ruder17]. Mini-batch training allows the network to be trained with longer sets of data without overfitting on long term data relations. However, applying a mini-batch is computational more costly than batch gradient descent, since the weights need to be updated more often.
After each mini-batch the weights need to be tuned. For this the type of activation function (AF), loss function, learning rate and optimiser are important. The AF used is the rectified linear unit (ReLu) AF. This is a commonly used activation function, which is computationally faster than the original sigmoid AF and more consistent than leaky ReLU, parametric ReLU or swish. However, some might outperform ReLu in certain prediction scenarios [Reference Ruder17]. The loss function applied is the smooth L1 loss function (Huber loss). This type of loss is mostly applied for regression problems and is suitable in most cases. Exploding of the gradient is prevented more often and it is less sensitive to outliers compared to mean squared error loss [Reference Jha18]. The learning rate (LR) indicates the quantity each weight is updated each mini-batch. A low LR can find an optimum more readily than a higher LR, however the training might converge to a local minimum. A higher LR convergences faster towards an optimum but might not be able to find the optimal point. The LR is varied after the suitable normalisation type and optimiser are chosen. After the learning rate is chosen, a suitable stopping point is also allocated (amount of epochs to train). When trained for too long, the network can only recognise the training data itself due to overfitting. When too few training cycles are applied, underfitting is applicable. An optimal amount of epochs is required.
A suitable optimiser is required for good training. An optimiser indicates the direction each weight is altered to after each update. The most commonly used optimisers are standard gradient descent (SGD) with momentum, RMS-Prop and Adam optimisers. SGD with momentum is able to find more flatter local minima, however tuning of the weights is more important and critical. RMS-prop adapts the LR automatically when converging to a minimum. Adam is a combination of both techniques and also possesses an adaptable LR. These techniques are therefore selected and both tested for the best accuracy [Reference Ruder17].
3.1.4 Performance metrics
To enable the evaluation of the results, it is necessary to apply performance metrics. The selected metrics are the root mean square error (RMSE) and a scoring function, first introduced at the 8th international prognostics and health management conference. These are considered benchmark metrics for comparison of prognostics methods, though these functions do not necessarily align with operational considerations for end users.
Performance can be calculated based on the complete test set, as well as based on solely the final value, which is given for the CMAPSS test dataset. The final value is the RUL prediction of the last available data point for each individual engine unit. Most authors use the latter to evaluate the effectiveness of their algorithm, since this was the main goal of the PHM08 challenge [Reference Li, Li and He1, Reference Al-Dulaimi, Zabihi, Asif and Mohammadi9, Reference Mathew, Toby, Singh, Maheswar Rao and Goutham Kumar19]. Nonetheless, prediction accuracy over the whole dataset can be applied to see its overall effectiveness and is more realistic when real-time application is available.
In the following section, both RMSE and scoring function values are provided. Also, a distinction is made between performance evaluated over the complete set versus the final RUL. This results in a total of four different accuracy metrics, being referred to in the remainder as the RMSE, final RMSE, score and final score.
3.2 Results for adaptable time window and sub-network learning adaptations
The results provided are based on 10 iterations to reduce training and testing variability for network instances. The accuracy of median iterations is presented based on the testing set. The results of the baseline DAG network application are given as part of the discussion of the two adaptations and are referred to as the ‘reference’ case. It must be noted that the baseline network results are in line with and close to those provided by Li et al. [Reference Li, Li and He1], though minor deviations exist as not all information was present in Ref. (Reference Li, Li and He1) to fully reproduce the research outcomes.
3.2.1 Adaptable time window results
First, the results of the ATW extension are discussed. In Table 3 estimation performance results are shown for three different types of ATW.
It should be noted that the given numbers are based on a total of 10 iterations for each TW length tested to reduce training and testing variability for network instances. The network used to represent results for each different TW length is the one where the given accuracy metric is the median of all training iterations. This takes into account the relative randomness of neural network training. In terms of the provided results in Table 3, three different setups are compared to the baseline DAG results, referred to as ‘reference’ in the table. The first (simply labelled ATW) takes the highest possible TW length for each time cycle (e.g., time cycle 12, 13 and 14 use a TW length of 12). This results in an improvement across all accuracy metrics. Two other types are introduced, which are based on the effect that for the time cycles where the RUL prediction is low a TW of 3 is optimal. ATW 11 and ATW 14 are predicting time cycles up to 11 and 14, respectively. What can be seen is that the accuracy increases even further, since early time cycles are more accurate to predict with a TW length of 3. Overall, the ATW approach provides more accurate RUL estimations for the C-MAPSS FD001 dataset. This is put into further context by considering results from prior research, where it is evident that the ATW variant of the DAG proposed in this work outperforms prior work (see Table 4, where the final entries (ATW DAG) represents the outputs of the current study for different time window lengths, as discussed previously and presented in Table 3).
3.2.2 Sub-network learning results
Next, the results for the sub-network learning adaptation are given and discussed. Table 5 gives the results for sub-network learning using a lower bound of 30 cycles and an upper bound of 100 cycles. The accuracy of each different health stage is shown for two different models. The baseline (reference) DAG network is applied on each different health stage, as well as the sub-network training. The reference case uses the results of the primary network and does not calculate a new value for the sub-network. The accuracies of the sub-network learning are based on the location where each sample is indicated by the primary network.
It can be observed from the results that the network using all the health stage sub-networks hardly improves results in terms of RMSE. The other accuracy metrics are showing inferior results. This likely is associated with the effect that the primary network classifies samples towards the incorrect health stage.
In Table 6 an overview can be seen of four different types of misclassification. A represents the samples that should be placed in the healthy life stage but are placed in the degradation stage. B and D should be placed in the healthy or critical stage, respectively, but are placed in the degradation stage. Finally C represents samples that should be placed in the critical stage, but are placed in the degradation stage.
The different learned sub-networks based on the different health stages show that degradation stage prediction is increased in accuracy. However, the other health stages are performing less accurately with respect to the reference case. The number of different errors show that most of the errors occur in B and D. This would indicate that increasing the UB and LB should reduce these errors. However, when these bounds are increased, accuracy metrics remain worse compared to the reference case. For example, an UB and a LB of 50 and 110 provides a testing RMSE and score of 13.45 and 571.56, respectively. Another approach which has been applied is to create a margin around each boundary. The effect of different margins is shown in Table 7. However, an increase in the margin does not improve the prediction accuracy.
4.0 Conclusion and future work
This paper reproduced the CNN-LSTM ensemble method proposed by Li et al. [Reference Li, Li and He1], including its application to the C-MAPSS FD001 dataset. Two adaptations were proposed: (ATW and sub-network learning.
-
• Application of the ATW adaptation has led to an improvement in RUL estimation accuracy for the evaluated dataset, surpassing the performance of other ensemble methods applied in the state of the art. When applying the ATW technique the prediction accuracy increases to a RMSE of 11.09 and a score of 176.69. This is due to the effect that later predictions are predicted more accurately and early predictions are added, which are naturally better predicted. This technique allows RUL predictions already from the third time cycle. This method could therefore be applied for real-life applications. A RUL prediction is already available from the third time cycle of a unit/component. Accuracy is increasing when more time cycles are available. The results of this model could be used for planning maintenance in advance.
-
• The sub-network learning approach has been applied but its results do not (yet) show improvement over the reference DAG network. This is likely due to incorrect health state classification. In addition, applying a margin around the boundaries of each sub-network did not improve the accuracy for this approach.
In terms of future work, the ATW approach needs to be applied on real-life datasets rather than solely on synthetic data, where data is available for the complete operational profile (not only a single value per sensor per cycle). For the sub-network learning approach, the data could be re-normalised after the samples are divided by the primary network. This results in a larger difference and might yield better results. An optimisation of different bounds can be applied, potentially involving definition and testing of one or more technique(s) to detect incorrectly assigned samples and relocate them to the correct health stage sub-network. A combination of these techniques might result in an improvement with respect to the original DAG network. beyond these particular points, real-life datasets have some of the challenging characteristics set out in the introduction, in particular regarding limited failure event data, limited sensor data ‘fitness for purpose’, and may furthermore reveal issues with generalisability, interpretation and application of approaches such as those proposed in this work. Therefore, the ensemble method extended in this work by the proposed adaptations should be applied to real-life complex datasets, setting the stage for evaluation of RUL prognostics in real-life applications.