Predicting mine water inflow volumes using a decomposition-optimization algorithm-machine learning approach

Table of Contents

Change characteristics of water inflow

Figure 4a presents the measured data for water inflow volumes at the Gaojiabao mine from 2020 to 2023. As shown in Fig. 4, there are localized abrupt changes in water inflow volumes, exhibiting a gradually increasing trend. Specifically, the minimum inflow volume recorded on June 4, 2020, was 3506 m³/h, while the maximum volume reached 7956 m³/h on September 1, 2022, with an average volume of 5390.83 m³/h over the period. The extreme values differ by a factor of two, and significant fluctuations are observed at specific locations.

Box plots were used to analyze the distribution characteristics of mine water inflow volumes on a monthly scale. As depicted in Fig. 4b, the variations in mine water inflow volumes exceeded 1000 m³/h in March and April of 2020, in March, April, September, and October of 2022, and in January and February of 2023. The most significant disparity was observed in September 2022, reaching 2000 m³/h. This analysis highlights numerous abrupt changes and turning points in mine water inflow volumes over short periods. From March 2022 to March 2023, mine water inflow displayed a trough-like pattern with a skewness of 0.376. The right-skewed distribution with a long tail suggests a higher likelihood of exceptionally high inflows as mining progresses. This pattern indicates an increased probability of exceptional mine water inflow due to ongoing mining activities. The underlying causes are likely related to pervasive joint fractures within the Luohe formation, significant static reserves, and the formation of conductive pathways when mining-induced fractures intersect with pre-existing joints.

Comparison of single prediction models

For the time series prediction, 80% of the historical data on water inflow volume from Fig. 4a was allocated to the training set, while the remaining 20% was used as the validation set⁵⁰. It is important to note that similar results were observed with other models; however, this study focuses primarily on the LSTM model as a representative example. The detailed results of the evaluation metrics for the LSTM model, including sliding input steps and maximum epoch numbers, are presented in Tables 2 and 3. These tables show that the NSE coefficient initially increases and then decreases as the sliding input steps extends from 1 to 10 days, peaking at 3 days. Conversely, when the sliding input steps exceeds 10 days, the NSE values gradually decline. Other metrics, such as RMSE and MAE, display an opposite trend to NSE, decreasing initially and then increasing. This indicates that increasing the sliding input steps does not improve prediction results; optimal performance is achieved at a 3-day lag. Beyond this, the model may learn unnecessary and potentially noisy patterns, distorting the prediction outcomes. The outcomes of backward-in-time iterations show no direct correlation between increased time step value and enhanced forecast accuracy. Extending the time steps excessively compels the model to process irrelevant information, reducing predictive precision. Therefore, selecting a time step that aligns with the model’s capabilities is crucial for achieving accurate predictions. The focus on mine water inflow prediction on short-term fluctuations, rather than on long-term trends or the cumulative impact of historical data, underscores the importance of choosing an appropriate time step.

Table 2 The Impact of LSTM Time Lag Length on Predicting Water Inrush.

Table 3 Impact of the Maximum Number of Epochs in LSTM on Water Inrush Prediction.

In contrast, increasing the maximum number of epochs has only a minor effect on the prediction outcomes. The NSE coefficient initially increases and then decreases, while RMSE displays the opposite trend. The best results are achieved at 400 epochs. Furthermore, MAE achieves its optimum at 200 epochs, and MAPE reaches its best at 300 epochs. Considering all factors, the performance metrics of the parameters are optimal when the maximum number of training epochs is set to 400. Therefore, to enhance prediction accuracy for predicting the volume of mine water inflows, it is recommended to select a sliding input steps of 3 and a maximum of 400 epochs.

Figure 5 displays the predictive performance of six single-models: Autoregressive Integrated Moving Average (ARIMA), Back Propagation (BP), Convolutional Neural Networks (CNN), Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), and LSTM. The results show that the computational efficiency of each model is similar. However, except for LSTM, the other models generate relatively conservative predictions and do not achieve the predictive accuracy of LSTM. Specifically, LSTM achieves the following results in key performance metrics: a MAE of 186.468, a MAPE of 2.811%, a RMSE of 252.378, and a NSE coefficient of 0.724. These outcomes confirm LSTM’s ability to uncover complex nonlinear relationships within time series data, demonstrating its superior predictive performance. However, LSTM’s robustness in sequences with frequent abrupt changes can be compromised by vanishing gradients, leading to slight lags as LSTM struggles to fully capture the peaks of sudden changes, as observed in Fig. 6. This phenomenon reduces the predictive value for early warning of inrush events, indicating that achieving accurate predictions of water inflow volumes without comprehensive data preprocessing remains a significant challenge.

Decomposition-prediction coupled model

Time series decomposition efficiently extracts local characteristics and hidden information from data, segregating fluctuations and variations in mine water inflow data from the baseline dataset. In this study, we utilized three decomposition methods to analyze mine water inflow time series data: EMD, EEMD, and CEEMDAN as shown in Fig. 7. Due to the adaptive nature of these data-driven methods, the EMD decomposition identified 7 IMFs and one residual component R; EEMD decomposition revealed 8 IMFs and one residual component R; by contrast, CEEMDAN decomposition produced 9 IMFs and one residual component R. From IMF1 to the residual component, there is a progressive decrease in the amplitude of oscillations and an increase in wavelength, reflecting a transition from high to low frequencies and depicting the periodic variation characteristics of each component’s time scale. These IMFs, along with the residual component R, enable the complete reconstruction of the original signal, ensuring no loss of information during frequency decomposition.

Integrating EMD, EEMD, and CEEMDAN with LSTM led to the development of EMD-LSTM, EEMD-LSTM, and CEEMDAN-LSTM coupled predictive models. We determined the model with the highest accuracy by comparing the performance of these three coupled models, as detailed in Fig. 8 and Table 4. Specifically, the EMD-LSTM model exhibited a MAE of 143.306, a MAPE of 2.143%, a RMSE of 199.948, and an NSE coefficient of 0.821. Relative to EMD-LSTM, EEMD-LSTM showed a reduction of 8.185% in MAE, 6.337% in MAPE, 12.273% in RMSE, and an improvement of 4.580% in NSE. However, the CEEMDAN-LSTM model surpassed EEMD-LSTM in prediction accuracy, with further reductions of 2.082% in MAE, 2.880% in MAPE, 10.107% in RMSE, and an increase of 5.903% in NSE. The CEEMDAN model demonstrated the most significant performance improvements compared to EMD-LSTM and EEMD-LSTM, while the computation times for the models were similar. Although the CEEMDAN-LSTM model addressed the lag issue associated with LSTM, it was less effective in predicting abrupt changes in trends, showing a notable discrepancy between expected and actual peak values.

Table 4 Metrics MAE, MAPE, RMSE, and NSE for Decomposition-Prediction and Decomposition-Optimization-Prediction models.

Decomposition-optimization-prediction coupled model

The NGO algorithm offers several significant advantages in addressing complex optimization challenges. Empirical evaluations across various benchmark tests and real-world engineering design problems have demonstrated that NGO outperforms other established algorithms, such as PSO, GA, and GWO, in terms of convergence speed and optimization accuracy. These outcomes highlight NGO’s capability in handling complex and high-dimensional optimization tasks. Moreover, the NGO algorithm exhibits robust adaptability and stability across a range of problem environments, affirming its efficiency and reliability as an optimization tool⁴³. Therefore, as illustrated in Fig. 9 and Table 4, integrating the NGO algorithm into the CEEMDAN-LSTM model significantly enhanced the training performance of the coupled model. It accurately captured and predicted abrupt changes in water inflow volumes, thereby increasing the model’s overall reliability in real-world scenarios. The optimized model demonstrated reductions of 25.039% in MAE, 24.525% in MAPE, and 22.536% in RMSE, along with a 5.415% improvement in the NSE coefficient. These key metrics substantiate the feasibility and significant practical application value of the CEEMDAN-NGO-LSTM in enhancing prediction accuracy.

To evaluate the model’s applicability in various scenarios, the CEEMDAN-NGO-LSTM was used to predict the water inflow of a single working face in a mining area, utilizing data from November 4, 2021, to July 5, 2023. The division of the training and test sets is depicted in Figure S1, with the results presented in Fig. 10 and Table 4. The model achieved an NSE of 0.906, a MAPE of 4.060%, an MAE of 87.760, and an RMSE of 117.410. The accuracy and capability of the CEEMDAN-NGO-LSTM to capture abrupt changes were also validated, confirming its effectiveness in practical applications.

Short-term forecasting

In our analysis, we selected the three top-performing models: the single prediction model, the decomposition prediction model, and the decomposition optimization prediction model. We applied linear fitting to the predicted versus actual values during the training and validation phases, as illustrated in Fig. 11. The results of this approach are revealing.

The linear equations during the training phase are as follows:

$$\textLSTM training phase:\text Y = 1.00\text2X – 10.913 \left( \textR^2 = \, 0.990 \right)$$

$$\textCEEMDAN – \textLSTM training phase:Y = \, 0.990\textX – 60.955\left(\textR^2 = \, 0.997 \right)$$

$$\textCEEMDAN – \textNGO – \textLSTM training phase:Y = \, 0.998\textX + 22.376\left( \textR^2 = \, 0.999 \right)$$

During the validation phase, the equations shifted:

$$\textLSTM:Y = \, 0.935\textX + 405.915 \left( \textR^2 = \, 0.723 \right)$$

$$\textCEEMDAN – \textLSTM:\text Y = 1.0\text34X – 292.777 \left( \textR^2 = \, 0.909 \right)$$

$$\textCEEMDAN – \textNGO – \textLSTM:Y = 1.020\textX – 206.682 {\left( \textR^2 = \, 0.958 \right)}$$

These equations demonstrate a strong correlation between the predicted and actual values, with improvement noted in the order presented. However, the scatter plots for the first two models still show some outliers. One potential explanation is that the abrupt changes are intrinsic rather than acquired characteristics, while another posits that the series exhibits a high degree of autocorrelation. These findings support the notion that optimization algorithms can significantly enhance feature extraction and learning within decomposition models, thereby reducing the impact of autocorrelation in the series.

Our study employed the LSTM, CEEMDAN-LSTM, and CEEMDAN-NGO-LSTM models to forecast mine water inflow over future intervals of 1, 3, 5, and 7 days as depicted in Fig. 12. Given the brief duration of forecasts, conventional indices like the NSE are not suitable for assessing accuracy. Instead, we used RMSE and MAPE for this purpose. In single-step forecasts, the MAPE values were 0.0650, 0.0460, and 0.0430, respectively, coupled with RMSE of 412.110, 288.230, and 271.430. The CEEMDAN-NGO-LSTM model yielded predictions closest to actual values. For three-step forecasts, MAPE values improved to 0.0290, 0.0160, and 0.0230, with RMSEs of 247.040, 128.170, and 155.437, indicating enhanced accuracy across all models. Still, only the CEEMDAN-NGO-LSTM model accurately captured the actual trend. In the five-step forecasts, the initial volumes of mine water inflow rose, followed by stable fluctuations. The LSTM model predicted a gentle ascending trend, culminating near the interval’s maximum value. The CEEMDAN-LSTM model showed an ascending-descending-ascending pattern, but with notable fluctuations and some predictions not aligning with the actual trend. In contrast, the CEEMDAN-NGO-LSTM model closely matched the natural trend, with predictions slightly higher than actual values, providing a solid foundation for mine water hazard prevention. This model’s MAPE values were 0.0270, 0.0290, and 0.0160, with RMSEs of 214.060, 198.020, and 122.790, respectively. For seven-step forecasts, the MAPE values were 0.0510, 0.0460, and 0.0460, with RMSEs of 410.100, 394.370, and 382.370, respectively. Again, the CEEMDAN-NGO-LSTM model outperformed the others. However, predictions for the last two days deviated significantly from the actual trend, impairing forecast accuracy. In conclusion, the CEEMDAN-NGO-LSTM model, utilizing the Northern Goshawk Optimization algorithm, achieves superior accuracy and trend prediction in five-step forecasts. This model effectively navigates the complexities of nonlinear, non-stationary signals such as mine water inflow. Additionally, the predicted water inflow trend for the next five days at the working face is highly consistent with the actual trend (Fig. 13), with RMSE and MAPE values of 236.580 and 9.836%, respectively. These results further substantiate that optimization algorithms can enhance the extraction and learning of features within decomposition models, mitigating the effects of series autocorrelation.