Enhancing groundwater level prediction with a hybrid deep learning model in Jinan City, China

admin

6 months ago

Enhancing groundwater level prediction with a hybrid deep learning model in Jinan City, China

The complete workflow of the methodology in this study was divided into three critical stages: data preprocessing, model construction, and model optimization and evaluation (Fig. 1). During the data preprocessing stage, a variety of techniques, including data cleaning and feature engineering, were employed to reconstruct raw data to meet the requirements for model training. The data were also reasonably partitioned to ensure their quality and usability. The modeling phase developed an architecture specifically designed to capture both the temporal dependencies and spatial correlations inherent in GWL dynamics, establishing a robust foundation for prediction tasks. During the model optimization and validation stage, hyperparameters were iteratively optimized using a loss function and an optimizer until convergence, resulting in the optimal model performance. A comprehensive performance validation was conducted through a multi-indicator, multi-dimensional evaluation system. The experimental design included performance comparison, ablation experiments, and model interpretability analysis to ensure the accuracy and reliability of the model.

Table of Contents

Toggle

Study area and datasets

Study area

This study focused on the administrative region of Jinan City, which is located in the eastern part of North China Plain, in the middle and western part of Shandong Province, East China (Fig. 2). The geographical location ranges from 36°02’ N to 37°54’ N and from 116°21’ E to 117°59’ E, with a total area of 8,154 km².The study area featured a warm-temperate continental monsoon climate situated in the mid-latitude inland area. This climatic characteristic results in significant seasonal differences in precipitation, with the majority of rainfall concentrated in the summer months (June to August), accounting for about 70% of the annual precipitation. Additionally, there is considerable inter-annual variability in precipitation. It is worth noting that atmospheric precipitation serves as the dominant recharge source for the karst aquifer system.

Jinan is situated in a transition zone between the low-mountain hills of central-southern Shandong and the alluvial plain of northwestern Shandong. The topography is higher in the south and lower in the north. The southern consists of Ordovician limestone karst aquifers, while the northern is characterized by igneous aquitards. This geological structure creates a natural hydrogeological unit with “southern recharge-northern barrier” that drives regional groundwater flow along a predominant south-to-north gradient⁴⁴. The karst water in the central piedmont plain serves as the main water supply for Jinan, with an average GWL of 45.68 m. In contrast, the southern hilly areas are primarily composed of fracture water, with an average GWL of 227.36 m. Both karst and fracture water levels exhibit significant fluctuations.

Datasets

In this study, we utilized GWL observation data as the target variable and integrated multiple input variables to construct the dataset for predictive modeling. We collected static attribute data from 27 monitoring wells located within the study area (Fig. 2c), including geographical coordinates (longitude and latitude), wellhead elevation, and aquifer type. Concurrently, we obtained GWL time series recorded at 7-day intervals from January 2018 to October 2023, providing absolute elevation values (meters) relative to the national vertical datum.

To capture the complex dynamics of groundwater fluctuations, this study comprehensively considered the lag effect and driving mechanism, and incorporated three key driving factors: (1) Historical GWL data (τ time-lagged terms): characterizing temporal autocorrelation in groundwater systems; (2) Meteorological variables (precipitation, temperature, evapotranspiration): representing external climatic forcing; (3) Spatial factors (water levels of adjacent monitoring wells): quantifying the spatial correlation. Among them, the lag effect was captured through historical GWL time-lags, while the external driving mechanisms were represented by meteorological and spatial factors. These meteorological data were sourced from the National Earth System Science Data Center ( which provides a 1 km resolution monthly dataset for the Chinese region (1901–2023), with each product containing 12 monthly bands. Data from 2018 to 2023 were selected to ensure temporal consistency with the GWL observation data. Through data preprocessing methods, the gridded meteorological data were precisely matched with the locations of each monitoring well, constructing a spatiotemporally consistent multivariate analysis dataset.

Data preprocessing

Data preprocessing constitutes a critical and indispensable step in the construction of deep learning models, playing a decisive role in enhancing model performance³⁴. Our systematic preprocessing pipeline comprised four key phases: (1) data cleaning; (2) multi-source data fusion; (3) feature engineering; (4) dataset partitioning. The workflow of data processing is illustrated in Fig. 3.

Data cleaning

Given the susceptibility of groundwater level (GWL) observations to sensor errors and environmental disturbances, this study implemented rigorous noise reduction protocols to enhance signal-to-noise ratios. To address heterogeneity of multi-source data, we systematically unified the sampling frequency and measurement units across all monitoring wells, thereby eliminating potential biases from data inconsistencies²³. Specifically: (1) depth-to-water measurements were converted to elevation head values using wellhead benchmarks, (2) high-frequency daily data were resampled to 7-day resolution using arithmetic averaging to maintain temporal consistency, and (3) a stringent quality control filter was applied to select wells with a missing rate of less than 30% and no gaps exceeding one consecutive month during 2018–2023. Missing values were then imputed using seasonal-trend decomposition (STL) to preserve the statistical properties of hydrological time series⁴⁵.

Multi-source data fusion

This study employed a systematic data fusion approach to achieve spatio-temporal synchronization between meteorological variables and GWL observations. Utilizing the ArcGIS 10.8 platform, we first extracted monthly bands (2018–2023) from raster datasets for each meteorological element (precipitation, temperature, and potential evapotranspiration) through raster processing. The “Extract Values to Points” spatial analyst tool was then applied to derive precise time-series of meteorological elements at all 27 monitoring well locations. Temporal alignment was rigorously enforced by establishing unified timestamp indices that synchronize GWL records with corresponding meteorological measurements, ultimately generating a spatiotemporally coherent multivariate dataset. This fusion process ensured rigorous spatiotemporal alignment of multi-source datasets, establishing a robust foundation for subsequent spatiotemporal modeling.

Feature engineering

Temporal feature engineering: Beyond fundamental meteorological variables (precipitation, temperature, and evapotranspiration), we leveraged the time lag of GWLs to generate lagged GWL features. Autocorrelation function (ACF) analysis of GWL time series across monitoring wells revealed that 20 wells had a significant lag step of 2, while the remaining 7 wells demonstrated a significant lag step of 3. To optimize the trade-off between model complexity and feature representation capacity, this study adopted the maximal consensus lag order (lag 2) across all monitoring wells. Consequently, we constructed the GWL lag features for 7-day lagged values (GWL_lag1) and 14-day lagged values (GWL_lag2) as model inputs. Table 1 presents comprehensive statistics (Mean, Min, Max, STDEV, etc.) for both input and target variables (2018–2023), providing quantitative characterization of aquifer system dynamics and data basis for prediction model training.

Table 1 Statistical description of the input and output variable.

Spatial feature engineering: The spatial dataset delineated the geographical coordinates (latitude and longitude) and static attributes (elevation referenced to national geodetic datum, aquifer type classification) for all monitoring wells. Subsequently, we can precisely compute the hydraulic connectivity metrics between each monitoring well based on Euclidean distances. Table 2 provides representative examples of spatial dataset for some monitoring wells.

Normalization: Given the well-documented sensitivity of deep neural networks to input feature scales, this study implemented rigorous normalization using Scikit-learn machine learning library in the Python environment. This preprocessing step effectively eliminated the dimensional differences between features, ensuring the stability and convergence efficiency of model training and laying the data foundation for subsequent modeling.

Table 2 Representative examples of Geospatial characteristics and static attributes of monitoring wells.

Dataset partitioning

To ensure systematic and reliable evaluation of the model, this study adopted the following data partitioning strategy: Initially, the time-series data of two monitoring points randomly selected from the 27 monitoring points were reserved as an independent unseen-well test set to evaluate model performance on unseen monitoring wells. The data of the remaining 25 wells were divided according to the time series, with the data of 2023 year serving as the conventional test set for final validation of prediction accuracy. Data from 2018 to 2022 were used as the model development set, which was strictly divided into a training set (80%) and a validation set (20%) in chronological order. Here, the training set facilitated the learning and optimization of model parameters, while the validation set was used to monitor the generalization ability in real time during the training process and to prevent over-fitting. After partitioning, the training, validation, conventional test, and unseen-well test sets contained approximately 5,220, 1,305, 1, 075, and 608 samples, respectively.

Ultimately, model performance was assessed through dual evaluation levels: The conventional test set was used to evaluate temporal extrapolation capability, that is, the predictive accuracy for future time points on known monitoring wells; The unseen-well test set was used to assess spatial extrapolation performance, that is, the predictive adaptability to new monitoring wells. This dual testing strategy evaluated the model performance across both temporal and spatial dimensions, ensuring the comprehensiveness and reliability of the model evaluation.

Model construction

Inherently, GWL prediction is a complex systems problem with significant spatiotemporal coupling characteristics, where dynamic variations are simultaneously influenced by temporal evolution and spatial interactions. In the temporal dimension, GWL exhibits sequential dependence through continuous evolution, with new observations dynamically correlated to their historical states. Spatially, fluctuations in GWL at adjacent monitoring wells show significant hydraulic interdependencies. In response to these characteristics, this study designed a hybrid GWL prediction model integrating spatio-temporal features (STGPM), whose core architecture was organically composed of three key modules: spatial feature extraction, multi-scale temporal feature extraction, and spatio-temporal feature fusion. The structure of the overall model was shown as Fig. 4, which fully incorporated the spatio-temporal coupling mechanisms of the groundwater system, providing a scientifically rigorous modeling paradigm for accurate GWL prediction.

Construction of the K-nearest neighbor graph

To effectively capture hydraulic connectivity between monitoring wells, this study constructed a K-nearest neighbor (KNN) graph based on the geographical coordinates of monitoring wells, where each monitoring well was regarded as a node of the undirected graph. This graph structure could effectively capture the local spatial correlation among monitoring wells, providing neighborhood information for subsequent spatial feature aggregation. The specific steps were as follows:

Coordinate extraction: Extracted latitude and longitude coordinates of each monitoring well from their spatial information to form an N$\:\times\:$2 coordinate matrix (where N is the number of monitoring points, N = 25).

K-Nearest neighbors calculation: Utilized the nearest neighbor algorithm to calculate the K nearest neighbors for each monitoring well and obtained the Euclidean distances to these nearest neighbors.

Edge construction: Traversed each node and established undirected edges between it and its nearest neighbors, with edge weights set as the inverse of Euclidean distance. This weighting scheme ensured stronger connections between geographically closer nodes, thereby more accurately reflecting the spatial relationships between wells.

The final undirected graph $\:\mathcal{G}\left(\mathcal{V},\mathcal{E}\right)$ completely characterized the spatial topology structure of the monitoring well network, where $\:\mathcal{V}=\left\{{v}_{1},{v}_{2},\dots\:,{v}_{m}\right\}$ represented the monitoring wells and $\:\mathcal{E}=\left\{{e}_{\text{1,2}},{e}_{i,j},\dots\:,{v}_{m,n}\right\}$ described the strength of spatial connections between them. This undirected graph served as input to the GraphSAGE model, providing accurate neighborhood information for subsequent spatial feature extraction.

Spatial feature extraction

This study utilized the GraphSAGE model to learn spatial feature representations of monitoring wells. The model effectively captured spatial dependencies between nodes by leveraging both the feature and structural information of nodes through neighbor sampling and feature aggregation mechanisms.

Node sampling: For each target node $\:\nu\:\in\:\mathcal{V}$, we employed a hierarchical sampling strategy to determine its multi-hop neighbor set $\:\mathcal{N}\left(v\right)$. The sampling process primarily focused on two parameters: the number of sampling layers $\:\mathcal{D}$ and the sampling size per layer. $\:\mathcal{D}$ represented the maximum hop count for neighbor aggregation. Experimental results demonstrated that the model achieved optimal performance when $\:\mathcal{D}=2$.

Node aggregation: The GraphSAGE model provided three aggregation functions: mean aggregation, LSTM aggregation, and pooling aggregation. Comparative experiments indicated that while both LSTM aggregation and pooling aggregation delivered good performance, the former exhibited significant computational inefficiency. Therefore, this study selected the pooling aggregation function, which operated by first applying a nonlinear transformation to the embedding of each neighbor node via a fully connected network, followed by the integration of neighborhood information to generate the target node embedding using max or mean pooling operations. The mathematical formulation was as follows:

$$\:{\text{AGGREGATE}}_{d}^{pool}=max\left(\left\{\sigma\:\left({\text{W}}_{pool}{h}_{{u}_{i}}^{d}+b\right),\forall\:{u}_{i}\in\:\mathcal{N}\left(\nu\:\right)\right\}\right).$$

(1)

Building upon these two processes, we first initialized the feature vector representation $\:{h}_{v}$ for each node. For each node $\:\nu\:\in\:\mathcal{V}$, its neighbor nodes $\:\mathcal{N}\left(v\right)$ were obtained through node sampling. Subsequently, the aggregation function (Eq. 1) was employed to integrate feature information from neighboring nodes. Finally, the aggregated neighborhood features were combined with the node’s own features through a nonlinear transformation to generate the updated node embedding representation, formulated as follows:

$$\:{h}_{v\:}^{d}=\:\sigma\:\left({\text{W}}^{d}\cdot\:\text{CONCAT}\left({h}_{v\:}^{d-1},{h}_{\mathcal{N}\left(\nu\:\right)\:}^{d}\right)\right).$$

(2)

Multi-scale temporal feature extraction

In the process of GWL prediction, the representation ability of temporal features is a critical factor influencing model accuracy. Inspired by Chen, et al. ⁴³, this study employed a multi-branch GRU architecture that processed time-series data at different temporal scales in parallel, enabling joint modeling of both short-term fluctuations and long-term trends.

Considering the hydrological response characteristics of the karst aquifer system in Jinan, we defined three distinct sliding window lengths: short-term (one month), medium-term (6 months), and long-term (12 months) windows. The short-term window focused on recent GWL fluctuations. The medium-term window covered semi-annual hydrological cycles to model seasonal variation patterns, while the long-term window was dedicated to learning interannual trends. The original time series was partitioned into multiple subsequences according to these different window lengths. For example, considering a time series $\:{T}_{\mathcal{w}}=\left\{{x}_{1},{x}_{2},\dots\:,{x}_{n}\right\}$ composed of the GWL observations from monitoring well $\:\mathcal{w}$, to predict the GWL value at time $\:\mathcal{t}$, if the sliding window was set to 3, the GWL values from the three preceding time steps were extracted, forming the input sequence $\:\left\{{x}_{\mathcal{t}-3},{x}_{\mathcal{t}-2},{x}_{\mathcal{t}-1}\right\}$. These sub-sequences from the three distinct sliding windows were fed into three separate GRU branches, with each branch specifically processing the sub-sequence in a specific temporal scale. This parallel architecture enabled comprehensive modeling of both short-term perturbations (e.g., rainfall responses) and long-term evolutionary trends (e.g., seasonal cycles) in groundwater dynamics.

To further optimize feature fusion, this study introduced an attention mechanism to adaptively integrate multi-scale features. Let $\:{h}_{1},{h}_{2},\:$and $\:{h}_{3}$ denote the output feature vectors from the short-term, medium-term, and long-term GRU branches, respectively. The importance weights $\:{\beta\:}_{i}$ of each branch were calculated through the attention mechanism. The feature fusion process based on attention weights can be expressed as:

$$\:{h}_{t}=\sum\limits_{i=1}^{3}{\beta}_{i}\cdot\:{h}_{i}$$

(3)

This mechanism dynamically adjusted the contribution weights of features across different temporal scales, generating more discriminative spatio-temporal feature representations. The design not only preserved scale-specific information but also enhanced predictive capability through synergistic feature interactions.

Spatio-temporal feature fusion

The core of spatiotemporal feature fusion lies in establishing coupled representations of spatial and temporal features. This study employed a cross-attention mechanism⁴⁶ to integrate temporal and spatial features, enabling more comprehensive feature representation. Through the aforementioned spatial and temporal feature extraction processes, supposed we obtain two feature sequences $\:{h}_{s}$ and $\:{h}_{t}$, where $\:{h}_{s}$ was the spatial feature sequence and $\:{h}_{t}$ was the temporal feature sequence. The spatio-temporal cross-attention mechanism allowed one sequence (spatial features) to serve as Query, while the other sequence (temporal features) acted as both Key and Value. The Query, Key, and Value can be expressed as:

$$\:Q={h}_{s}{W}_{q},\:\:K={h}_{t}{W}_{k},\:\:V={h}_{t}{W}_{v},$$

(4)

where $\:{W}_{q}$, $\:{W}_{k}$, and $\:{W}_{v}$ represented the projection matrices for Query, Key, and Value, respectively.

The cross-attention scores between spatial nodes and temporal steps were obtained by computing the similarity between Query and Key:

$$\:A=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)$$

(5)

where $\:{d}_{k}$ was the dimension of the Key, serving as a scaling factor for the dot product to prevent gradient vanishing. Each element $\:A\left(i,j\right)$ in the attention matrix quantified the dependency strength between the $\:i$-th monitoring well and the $\:j$-th timestep.

Finally, temporal features were aggregated to spatial nodes through a weighted sum:

$$\:\text{Z}=A\cdot\:V$$

(6)

Model optimization and evaluation

Experimental setup

The hardware and software environment configurations employed for model optimization and evaluation were detailed in Table 3.

Table 3 Experiment environment.

To ensure the reproducibility of our proposed STGPM model, this subsection provided a comprehensive description of the specific architectural configurations used for each component. The final architecture was summarized in Table 4.

Table 4 Detailed architecture of the STGPM.

Hyperparameter optimization

To identify the optimal hyperparameter configuration for STGPM, a systematic grid search strategy was employed. This exhaustive method was selected due to the discrete and limited nature of the hyperparameter space, ensuring a comprehensive evaluation of all possible combinations to achieve globally optimal performance within the defined search domain, rather than settling for a computationally efficient but potentially local optimum.

Specifically, the grid search examined three critical parameters: learning rate, batch size, and the number of sampled neighbor nodes. The learning rate varied within the range of $\:{10}^{-4}$ to $\:{10}^{-2}$, the batch size was tested at values of [16, 32, 64, 128] to balance computational efficiency and training stability, and generalization performance. The number of samples per hop was set to 1, 3, and 5 to determine the optimal amount of neighborhood information to aggregate for spatial feature extraction. The optimization objective was trained to minimize the Mean Squared Error (MSE) between its predictions and the truth groundwater level values. Each parameter combination underwent 100 training evaluations with early stopping patience to prevent overfitting. The training and validation loss curves were meticulously monitored to ensure convergence and assess generalization performance. This optimal configuration was subsequently used to train the final model on the combined training and validation sets for all subsequent performance evaluations reported in this study.

Similarly, we adopted a rigorous approach where each model underwent an independent hyperparameter optimization process using the same grid search strategy for each baseline model. The search space for each model included key architectural parameters: the number of layers [1, 2] and the number of hidden units [32, 64, 128]. Final optimized hyperparameter configurations for all compared models were shown in Table 5.

Table 5 Final optimized hyperparameter configurations for all compared models.

Evaluation metrics

To comprehensively evaluate model performance, this study employed a multi-dimensional metric system for quantitative analysis. The evaluation framework included the Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Coefficient of Determination (R²). Each metric provided distinct insights: MAE measured the absolute deviation between predicted and observed values, RMSE quantified the dispersion degree of prediction errors, and R² assessed goodness-of-fit. The combination of these three indicators can objectively assess the prediction accuracy and model stability from different perspectives, providing a reliable quantitative basis for model comparison.

The MAE is the average absolute difference between predicted and observed values, quantifying the absolute magnitude of prediction errors. As the most intuitive metric, MAE is less sensitive to outliers due to the use of absolute values. Its formulation is given by:

$$\:\text{M}\text{A}\text{E}=\frac{1}{n}\sum\:_{i=1}^{n}\left|{y}_{i}-\widehat{{y}_{i}}\right|,$$

(7)

where $\:{y}_{i}$ and $\:\widehat{{y}_{i}}\:$denote observed and predicted values, respectively, and $\:n$ is the sample size.

The RMSE, calculated as the square root of the mean squared errors, provides greater sensitivity to prediction variability and extreme errors. The calculation method for RMSE is as follows:

$$\:\text{R}\text{M}\text{S}\text{E}=\sqrt{\frac{1}{n}\sum\:_{i=1}^{n}{\left({y}_{i}-\widehat{{y}_{i}}\right)}^{2}}.$$

(8)

R² quantifies the proportion of variability in the target variable explained by the model from a statistical perspective, serving as an important indicator of goodness-of-fit. An R² value closer to 1 indicates a better fit, while an R² close to 0 or negative suggests a poor model fit. The formula for calculating R² is as follows:

$$\:{\text{R}}^{2}=1-\frac{{\sum\:}_{i=1}^{n}{\left({y}_{i}-\widehat{{y}_{i}}\right)}^{2}}{{\sum\:}_{i=1}^{n}{\left({y}_{i}-\stackrel{-}{y}\right)}^{2}},$$

(9)

where $\:\stackrel{-}{y}$ is the mean of the observed values.

It is important to note that for model evaluation, the predictions were inverse-transformed back to the original scale (meters) before calculating the MAE, RMSE, and R² metrics to ensure their physical interpretability.

Model interpretability

Despite the superior predictive performance of machine learning and deep learning models in groundwater level prediction, their inherent “black box” nature limits the interpretability of the model decision-making process. Model interpretability aims to uncover the underlying mechanisms between input features (such as rainfall, evaporation, and groundwater extraction) and prediction outcomes, providing a scientific basis for water resource management decisions.

This study employed the SHapley Additive exPlanations (SHAP) framework, rooted in cooperative game theory, to quantify feature contributions to the model’s prediction results by calculating the SHAP values of each feature. The advantage of this approach was that SHAP values can simultaneously reveal both the polarity (positive/negative influence) and relative importance of each feature’s impact on predictions. This method transformed opaque model behavior into interpretable, physically consistent logic, thereby comprehensively assessing model behavior.

link