PCA for fault detection in Tennessee Eastman Process

In this notebook, the application of monitoring techniques is demonstrated, using data from the Tennessee Eastman Process benchmark.

import bibmon
import pandas as pd
import matplotlib.pyplot as plt

Process description

The Tennessee Eastman Process (TEP) is a realistic simulation of a process from Eastman Chemical Company, introduced as a benchmark for process control and monitoring applications (Downs and Vogel, 1993). The academic community has broadly embraced the TEP model for modeling studies, and it is now considered the most widely used benchmark for evaluating process monitoring techniques (MELO et al., 2022).

The TEP flowsheet is shown below. The objective of the process is to produce two products (G and H) from four reactants (A, C, D, and E), with the presence of an inert substance (B) and a by-product (F) within the process streams. The reactants A, C, D, and E, along with the inert B, are introduced into a stirred tank reactor, where the desired products G and H, as well as the by-product F, are generated. The resulting product stream then goes through a condenser followed by a vapor-liquid separator. Components that do not condense are recycled using a centrifugal compressor, while the condensate moves through a stripper, with the final products being collected from the bottom stream. Vapor is purged from the vapor-liquid separator to eliminate the inert substance and the by-product.

The original proposed simulation presumes the existence of 41 measured variables and 12 manipulated variables, with sampling intervals set at 3, 6, and 15 minutes. Below is a list of these variables.

Manipulated variables

Variable

Description

XMV(1)

D Feed Flow (stream 2) (Corrected Order)

XMV(2)

E Feed Flow (stream 3) (Corrected Order)

XMV(3)

A Feed Flow (stream 1) (Corrected Order)

XMV(4)

A and C Feed Flow (stream 4)

XMV(5)

Compressor Recycle Valve

XMV(6)

Purge Valve (stream 9)

XMV(7)

Separator Pot Liquid Flow (stream 10)

XMV(8)

Stripper Liquid Product Flow (stream 11)

XMV(9)

Stripper Steam Valve

XMV(10)

Reactor Cooling Water Flow

XMV(11)

Condenser Cooling Water Flow

XMV(12)

Agitator Speed

Continuous process measurement

Variable

Description

unit

XMEAS(1)

A Feed (stream 1)

kscmh

XMEAS(2)

D Feed (stream 2)

kg/hr

XMEAS(3)

E Feed (stream 3)

kg/hr

XMEAS(4)

A and C Feed (stream 4)

kscmh

XMEAS(5)

Recycle Flow (stream 8)

kscmh

XMEAS(6)

Reactor Feed Rate (stream 6)

kscmh

XMEAS(7)

Reactor Pressure

kPa gauge

XMEAS(8)

Reactor Level

%

XMEAS(9)

Reactor Temperature

Deg C

XMEAS(10)

Purge Rate (stream 9)

kscmh

XMEAS(11)

Product Sep Temp

Deg C

XMEAS(12)

Product Sep Level

%

XMEAS(13)

Prod Sep Pressure

kPa gauge

XMEAS(14)

Prod Sep Underflow (stream 10)

m3/hr

XMEAS(15)

Stripper Level

%

XMEAS(16)

Stripper Pressure

kPa gauge

XMEAS(17)

Stripper Underflow (stream 11)

m3/hr

XMEAS(18)

Stripper Temperature

Deg C

XMEAS(19)

Stripper Steam Flow

kg/hr

XMEAS(20)

Compressor Work

kW

XMEAS(21)

Reactor Cooling Water Outlet Temp

Deg C

XMEAS(22)

Separator Cooling Water Outlet Temp

Deg C

Sampled process measurement

  • Reactor Feed Analysis (Stream 6)

    • Sampling Frequency = 0.1 hr

    • Dead Time = 0.1 hr

    • Mole %

Variable

Description

XMEAS(23)

Component A

XMEAS(24)

Component B

XMEAS(25)

Component C

XMEAS(26)

Component D

XMEAS(27)

Component E

XMEAS(28)

Component F

  • Purge Gas Analysis (Stream 9)

    • Sampling Frequency = 0.1 hr

    • Dead Time = 0.1 hr

    • Mole %

Variable

Description

XMEAS(29)

Component A

XMEAS(30)

Component B

XMEAS(31)

Component C

XMEAS(32)

Component D

XMEAS(33)

Component E

XMEAS(34)

Component F

XMEAS(35)

Component G

XMEAS(36)

Component H

  • Product Analysis (Stream 11)

    • Sampling Frequency = 0.25 hr

    • Dead Time = 0.25 hr

    • Mole %

Variable

Description

XMEAS(37)

Component D

XMEAS(38)

Component E

XMEAS(39)

Component F

XMEAS(40)

Component G

XMEAS(41)

Component H

Process faults

The simulation provides 21 pre-programmed disturbances, which include the introduction of step disturbances, modification of measurement variability, occurrence of valve sticking, drift in reaction kinetics and the presence of unknown disturbances. Below is a list of these disturbances.

Disturbance

Description

IDV(1)

A/C Feed Ratio, B Composition Constant (Stream 4) Step

IDV(2)

B Composition, A/C Ratio Constant (Stream 4) Step

IDV(3)

D Feed Temperature (Stream 2) Step

IDV(4)

Reactor Cooling Water Inlet Temperature Step

IDV(5)

Condenser Cooling Water Inlet Temperature Step

IDV(6)

A Feed Loss (Stream 1) Step

IDV(7)

C Header Pressure Loss - Reduced Availability (Stream 4) Step

IDV(8)

A, B, C Feed Composition (Stream 4) Random Variation

IDV(9)

D Feed Temperature (Stream 2) Random Variation

IDV(10)

C Feed Temperature (Stream 4) Random Variation

IDV(11)

Reactor Cooling Water Inlet Temperature Random Variation

IDV(12)

Condenser Cooling Water Inlet Temperature Random Variation

IDV(13)

Reaction Kinetics Slow Drift

IDV(14)

Reactor Cooling Water Valve Sticking

IDV(15)

Condenser Cooling Water Valve Sticking

IDV(16)

Unknown

IDV(17)

Unknown

IDV(18)

Unknown

IDV(19)

Unknown

IDV(20)

Unknown

Typically, when the TEP benchmark is used for process monitoring applications, the goal is to develop a method for real-time detection and diagnosis of these disturbances.

For more details on the TEP process, please refer to the following article: https://doi.org/10.1016/j.compchemeng.2022.107964.

Importing data

The TEP simulation data originally provided by Chiang et al. (2000) can be imported into the library using the load_tennessee_eastman function. This function requires an argument to specify the nature of data: use 0 for normal operation data and values from 1 to 20 for various process disturbances. Below, we demonstrate how to import normal operation data for training purposes and data corresponding to disturbance event IDV(1) for testing. In this dataset, disturbances are introduced after 8 hours of operation. By convention, normal datasets start on 2020-01-01 00:00:00 and faulty datasets start on 2020-02-01 00:00:00.

df_train, df_test = bibmon.load_tennessee_eastman(train_id = 0, test_id = 1)

fault_start = '2020-02-01 08:00:00'

Let’s visualize the datasets to gain insights into the types of behaviors we are modeling (normal data) and attempting to detect (faulty data):

def plot_tep_fault(df_train, df_test):
            
    fig, ax = plt.subplots(22, 3, figsize=(15, 30))

    for i in range(22):
        ax[i,0].plot(df_train.iloc[:,i].values,c='k',linewidth=0.8)
        ax[i,0].plot(df_test.iloc[:,i].values,c='r',linewidth=0.8)
        ax[i,0].axvline(160, ls='--')
        ax[i,0].set_yticks([])
        ax[i,0].set_ylabel(i+1,rotation=0,fontsize=14)
        for j in range(3):
            ax[i,j].spines["top"].set_visible(False)
            ax[i,j].spines["right"].set_visible(False)
            ax[i,j].spines["left"].set_visible(False)
        if i<(22-1):
            ax[i,0].set_xticks([])
            ax[i,0].spines["bottom"].set_visible(False)

    for i in range(19):
        ax[i,1].plot(df_train.iloc[:,i+22].values,c='k',linewidth=0.8)
        ax[i,1].plot(df_test.iloc[:,i+22].values,c='r',linewidth=0.8)
        ax[i,1].axvline(160, ls='--')
        ax[i,1].set_yticks([])
        ax[i,1].set_ylabel(i+23,rotation=0,fontsize=14)
        if i!=(19-1):
            ax[i,1].set_xticks([])
            ax[i,1].spines["bottom"].set_visible(False)

    for i in range(11):
        ax[i,2].plot(df_train.iloc[:,i+41].values,c='k',linewidth=0.8)
        ax[i,2].plot(df_test.iloc[:,i+41].values,c='r',linewidth=0.8)
        ax[i,2].axvline(160, ls='--')
        ax[i,2].set_yticks([])
        ax[i,2].set_ylabel(i+42,rotation=0,fontsize=14)
        if i!=(11-1):
            ax[i,2].set_xticks([])
            ax[i,2].spines["bottom"].set_visible(False)

    for i in range(19,22):
        ax[i,1].axis('off')

    for i in range(11,22):
        ax[i,2].axis('off')

    ax[0,0].set_title('Continuous process variables',fontsize=14)
    ax[0,1].set_title('Sampled composition variables',fontsize=14)
    ax[0,2].set_title('Continuous manipulated variables',fontsize=14);

plot_tep_fault(df_train, df_test)
_images/b54eaba8c3d59af812bf57d29f310678bb98e89910711228060472efbe4278ed.png

In the plot above, normal data is represented by black lines, while faulty data is indicated by red lines. The onset of the event, marked by a step change in the A/C feed ratio, is shown as a vertical blue dashed line. Observe that following the onset of the event, the variable values deviate from their normal ranges, exhibit an oscillatory pattern due to the automatic process control loop, and eventually stabilize. For some variables, stabilization occurs within the same range, whereas for others, it occurs in different ranges.

Modeling and process monitoring

The PCA model

The model discussed in this tutorial is Principal Component Analysis (PCA), a widely used technique in fault detection and diagnosis for process systems engineering applications. PCA generates a set of orthogonal linear combinations of the original variables, aiming to select a subset of these combinations that effectively captures and summarizes the data’s variability. It achieves this by ordering the principal components based on the amount of variability they explain. The first principal component represents the direction with the greatest data variation, while the second principal component is orthogonal to the first, and this pattern continues for subsequent components. A subset of these principal components is then chosen to adequately summarize the data’s variability, a process known as dimensionality reduction.

The following figure illustrates this procedure using a bidimensional example dataset. In the first plot, the red arrows represent the two principal components, with the larger arrow indicating the first principal component, which is associated with the direction of higher data variability. The second plot demonstrates the linear transformation applied by PCA, which is simply a rotation to position the first principal component in the direction of the x-axis. The third plot shows the dimensionality reduction step, where the bidimensional data is transformed into a unidimensional representation, eliminating the second dimension. Finally, the last plot depicts the reconstruction process, in which the data is projected back into the original bidimensional space, but now with variation only along the direction of the first principal component.

PCA Visualization

When PCA is applied to fault detection, the principal components are selected using the training data. These principal components are expected to explain the variability of the test data as well. If the test data’s variability is not adequately explained by the principal components identified during training, the process is considered to be faulty and alarms are trigged.

For more details on the PCA model, please refer to the following article: https://doi.org/10.3390/pr12020251.

Training the model

The following command instantiates a BibMon’s PCA model:

model = bibmon.PCA()

To train the model, we use the fit method:

model.fit(df_train)

The fault detection index used by BibMon is the Squared Prediction Error (SPE). This index is based on the difference between the measured data, \(\mathbf{X}\), and the data reconstructed by the model, \(\mathbf{\hat{X}}\):

\[ \text{SPE}_i = \sum_{j} (X_{ij} - \hat{X}_{ij})^2 \]

The higher the SPE values, the greater the likelihood that the captured PCA model does not adequately explain the variability in the data, indicating that the process is not in its normal state.

Let’s create a control chart of SPE values for the training data:

model.plot_SPE(train_or_test = 'train')
_images/9fbb056e859ddac37bfc4470d05df64247a42d06f6d01680b4341fc19a3e70cb.png

The confidence limit is established to ensure that 99% of the training SPE points fall below it. If new data points exceed this limit, it indicates a deviation from normal operation, triggering alarms.

Overfitting to the training data often leads to an underestimation of the confidence limit. To mitigate this, we can reserve a portion of the data at the end of the time series exclusively for setting the limit, rather than using it for training. This practice generally raises the limit and reduces false alarm rates. To implement this, we set the redefine_limit parameter to True, and the fraction of data reserved is specified by the frac_val parameter:

model.fit(df_train, redefine_limit = True, frac_val = 0.1)

Another useful visualization is the explained variance identified by the model:

model.plot_cumulative_variance()
plt.gca().axvline(model.n,ls='--');
_images/dec3d40df9d3613866916e767292826f8ca07cb9753c1f797cc4e8307583fb4d.png

In the above plot, bars represent the data variance captured by each component, while the continuous line shows the cumulative variance. The vertical dashed line indicates the number of principal components retained by the model. By default, this number is set to ensure that these components together account for 90% of the variability in the data.

Applying the model to test data

To apply the model to a test dataset, we use the function predict:

model.predict(df_test)

Let’s plot the SPE chart for the test:

model.plot_SPE(train_or_test = 'test')
plt.gca().axvline(fault_start);
plt.title(f'FDR: {model.alarmOutlier[fault_start:].mean()}');
_images/58a10847a1f1cf8d1f42cd7569aa0d2f1cb3302deeb3bc168e38b044e269f006.png

Observe how the SPE points rapidly increase as the faulty event begins, surpassing the confidence limit and triggering alarms, which are represented by the orange points. The fault detection rate (FDR) exceeds 99%, demonstrating the high capability of PCA in detecting this type of fault.

Now let’s plot the SPE contribution heatmap for the test:

model.SPE_contrib_test
XMEAS(1) XMEAS(2) XMEAS(3) XMEAS(4) XMEAS(5) XMEAS(6) XMEAS(7) XMEAS(8) XMEAS(9) XMEAS(10) ... XMV(2) XMV(3) XMV(4) XMV(5) XMV(6) XMV(7) XMV(8) XMV(9) XMV(10) XMV(11)
2020-02-01 00:00:00 0.001014 0.064975 0.037878 0.246770 0.181036 0.004409 0.033778 0.009796 0.304754 0.004927 ... 0.059832 0.002372 0.103276 0.008692 0.000253 0.099966 0.037930 0.011821 0.309943 0.002010
2020-02-01 00:03:00 0.002600 0.003200 0.017056 0.005200 0.075389 0.537488 0.047553 0.004073 0.161714 0.000592 ... 0.107089 0.000121 0.002754 0.034813 0.002286 0.010254 0.036163 0.015417 0.051729 0.000530
2020-02-01 00:06:00 0.003463 0.335738 0.087237 0.448906 0.082531 0.034783 0.034252 0.036815 0.604998 0.028912 ... 0.095856 0.003716 0.084730 0.006780 0.002523 0.019461 0.011476 0.042128 0.103812 0.010718
2020-02-01 00:09:00 0.005135 0.050067 0.336866 0.083280 0.025785 0.022209 0.000098 0.043427 0.198501 0.016713 ... 0.466475 0.003634 0.702181 0.000092 0.010344 0.024591 0.007360 0.013008 0.068126 0.010793
2020-02-01 00:12:00 0.106140 0.387329 0.008186 2.278590 0.096295 0.041390 0.088964 0.080601 0.366359 0.069751 ... 0.101740 0.196166 1.729925 0.007866 0.020661 0.014950 0.009739 0.007056 0.216235 0.010852
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-02-02 23:45:00 30.408410 0.031451 0.503471 22.163046 1.072180 0.001327 0.122422 0.840029 0.108932 2.289084 ... 0.184355 29.940564 9.990535 0.148251 1.219644 0.031778 0.040662 2.282846 0.068765 0.359752
2020-02-02 23:48:00 31.467704 0.272619 0.911882 11.481591 0.603022 0.027659 0.382506 0.029068 0.024013 2.581854 ... 0.276890 31.187571 3.705808 0.848143 0.961667 0.016849 0.144884 2.416294 1.281990 0.273778
2020-02-02 23:51:00 30.315880 0.455588 0.153864 15.412005 0.317337 0.211359 0.150044 0.168905 0.011646 2.625218 ... 0.025977 30.445067 6.954743 3.053719 0.848270 0.130417 0.349548 2.305401 0.106318 0.106629
2020-02-02 23:54:00 31.400179 0.331269 0.090643 7.593507 1.265269 0.034129 0.064121 0.157971 0.144850 0.235651 ... 0.150828 31.839818 3.874597 0.507650 0.034667 0.384894 0.086569 1.222012 0.090505 0.322083
2020-02-02 23:57:00 33.666126 0.362147 0.431882 10.233358 1.200758 0.146887 0.000917 0.036429 0.143768 0.039342 ... 0.015368 32.429301 4.799023 0.922214 0.262240 0.009751 0.056689 0.646388 0.132035 0.550161

960 rows × 52 columns

model.plot_SPE_contributions()
plt.gca().axhline(y=model.X_test.index.get_loc(fault_start),ls='--',c='r'); 
_images/8c5a401d80e1d60aea7849ec1bbbc785d62cc6f71c268419d501a549ce173044.png

This map illustrates the contribution of each variable, at every time step, to the squared prediction error. It provides insights into the relative connection of each variable to the event, whether as a causal factor or a consequential outcome. Compare this plot with the time series visualization we examined earlier!

Analyzing the effect of preprocessing techniques

In BibMon, applying preprocessing techniques is straightforward: simply define lists with the names of the functions in the desired order of application, and provide them to the fit and predict methods.

In the TEP process, specifically in the IDV-11 case, the fault detection rate improves with the application of noise filtering techniques and the addition of dynamics through lags.

df_train, df_test = bibmon.load_tennessee_eastman(train_id = 0, test_id = 11)

Using the default pre-processing pipeline

The default preprocessing pipeline is as follows:

  • training: ['remove_empty_variables', 'ffill_nan', 'remove_frozen_variables', 'normalize'];

  • test: ['replace_nan_with_values', 'normalize'].

model = bibmon.PCA()

model.fit(df_train, redefine_limit = True, frac_val = 0.1)

model.predict(df_test)

model.plot_SPE(train_or_test = 'test')
plt.gca().axvline(fault_start);
plt.title(f'FDR: {model.alarmOutlier[fault_start:].mean()}');
_images/501db5a58fcd711df2416381b61702dee37e0d03da45f2c13b48bcae3c913699.png

Applying lag

model = bibmon.PCA()

preproc = ['apply_lag', 'normalize']

model.fit(df_train, f_pp = preproc, f_pp_test = preproc, redefine_limit = True, frac_val = 0.1)

model.predict(df_test)

model.plot_SPE(train_or_test = 'test')
plt.gca().axvline(fault_start);
plt.title(f'FDR: {model.alarmOutlier[fault_start:].mean()}');
_images/bfb5a28d976f6c2ebbc10ec083078be0303c58e7924c38b5a77fcd6d8879d41a.png

Notice how the normalize method was added to the pipeline, as it is essential prior to applying PCA modeling.

Applying a moving average filter

model = bibmon.PCA()

preproc = ['moving_average_filter', 'normalize']

model.fit(df_train, f_pp = preproc, f_pp_test = preproc, redefine_limit = True, frac_val = 0.1)

model.predict(df_test)

model.plot_SPE(train_or_test = 'test')
plt.gca().axvline(fault_start);
plt.title(f'FDR: {model.alarmOutlier[fault_start:].mean()}');
_images/39f309b4a995cbffaad588ca1c3eeccf904325374e1a78950ca31a83110e85af.png

Applying filter and lag, modifying parameters using the args_preproc argument

model = bibmon.PCA()

preproc = ['apply_lag', 'moving_average_filter', 'normalize']
args_preproc = {'moving_average_filter__WS': 5, 'apply_lag__lag': 5}

model.fit(df_train, f_pp = preproc, a_pp = args_preproc,
          f_pp_test = preproc, a_pp_test = args_preproc,
          redefine_limit = True, frac_val = 0.1)

model.predict(df_test)

model.plot_SPE(train_or_test = 'test')
plt.gca().axvline(fault_start);
plt.title(f'FDR: {model.alarmOutlier[fault_start:].mean()}');
_images/924d5bfafbe171241a1579af237cd50cebe1b5a0ad070987b050d5ffe6457d12.png

By employing various preprocessing techniques, we can boost fault detection rates from a baseline of 50.5% to 93%.

Final considerations

This notebook provides an example of an offline monitoring study utilizing the PCA model.

For a demonstration of BibMon’s high-level functions, which allow for the rapid execution of a complete process monitoring pipeline for analysis and methodology comparison, please refer to the tutorial_real_process.ipynb notebook.