Decoupling and predicting natural gas deviation factor using machine learning methods
Basic dataset
The foundational Z-factor data for developing the machine learning models were sourced from gas wells of the Jurassic Shaximiao Formation gas reservoir in the Sichuan Basin, China. The Shaximiao Formation is a thick red formation developed in the Middle Jurassic of the Sichuan Basin, with a thickness ranging from 1000 to 2000 m. It primarily consists of dark purple-red mudstone, interspersed with multiple sets of sandstone. The reservoir pressure ranges from 19.1 to 26.5 MPa, and the pressure coefficient varies between 0.845 and 1.155. The reservoir temperature ranges from 323.86 to 356.75 K with an average of 338.23K. Natural gas migrates from the underlying source rocks Xujiahe Formation through large faults30. Influenced by the gas accumulation, trapping conditions and gas migration, there are significant differences in the components of natural gas at different locations. Among the sampled gas wells, 88% contain wet gas, while 12% contain condensate gas. We collected a total of 6914 sets of Z-factors with different components at different temperature and pressure. All Z-factors were obtained through CVD experiments.
The statistical summary of natural gas components and properties in the dataset is presented in Table 1 and Fig. 1. The hydrocarbon composition of natural gas is predominantly methane, with methane content ranging from 73.391 to 93.728%. Ethane content varies from 4.512 to 10.4%. The dryness coefficient ranges from 0.858 to 0.944, indicating predominantly wet gas30. Additionally, the natural gas contains around 1% nitrogen, approximately 0.08% carbon dioxide, and around 0.3% C7+ components.

Boxplot of components and properties of natural gas.
Machine learning models
In this section, we briefly introduced the main principles of machine learning algorithms used in this study. This article primarily employs five regression algorithms: SVM, XGBoost, LightGBM, ANN and BiLSTM. The solutions and implementations of the models can be found in most machine learning library such as Pytorch, Keras, TensorFlow, and MATLAB31. However, all the models in this paper are implemented under the MATLAB 2023a library (https://www.mathworks.com)32.
SVM
SVM is a supervised learning algorithm that includes Support Vector Regression (SVR) for regression. It demonstrates good robustness and adaptability in addressing non-linear regression problems. In regression problems, the goal is to fit the data and find a function that minimizes the error between predicted values and actual values. The loss is zero only when the actual value f(x) and the predicted value y are exactly the same. In the case of SVR used for regression problems, the objective is to find a hyperplane that minimizes the distance between the training data points and this hyperplane within a certain tolerance \(\epsilon\). In other words, the loss is only calculated when \(|f(x)-y|>\epsilon\). It is equivalent to constructing a band with a width of \(2\epsilon\) centered around \(f(x)=\omega \cdot x+b\), as illustrated in Fig. 2a. If the training samples fall within the interior of this band, the prediction is considered correct, resulting in no loss33.
SVR finds the optimal hyperplane by minimizing a loss function with a tolerance threshold and regularization term. Its optimization objective can be formulated as a minimization problem with a penalty term. Typically, a loss function with L1 or L2 regularization is employed to control the model’s complexity34. The minimized objective function can be expressed as:
$$\beginaligned SVM_-Loss=\min _\omega , b \frac12\Vert \omega \Vert ^2+C \sum _i=1^n L_\varepsilon \left( y_i, \omega \cdot x_i+b\right) \endaligned$$
(1)
where \(\omega\) is the normal vector of the hyperplane; i is the data index; n is the number of training samples; C is the regularization parameter; \(L_\epsilon\) is the loss function; \(y_i\) is the true label; b is the margin of the “band”, which serves as the penalty term and is typically a penalty for model complexity.
XGBoost
XGBoost is an ensemble learning method based on the gradient boosting framework. Its fundamental learner is a decision tree, and by combining multiple base learners, each iteration iteratively corrects the errors of the previous round’s model, thus constructing a robust model. The difference between XGBoost and random forest lies in how the models are trained. In each step, XGBoost adds a tree based on the previous step, and the newly added tree is designed to address the deficiencies of the preceding one. Its basic structure is illustrated in Fig. 2b. Due to XGBoost incorporating the complexity of tree models into the regularization term, it excels in preventing overfitting and enhancing generalization performance as well35.
XGBoost trains the model by minimizing the sum of weighted residuals. The loss function includes terms for prediction errors and regularization. The objective function of XGBoost is to minimize the loss function, and its general form is typically expressed as:
$$\beginaligned XGBoost_- L o s s=\sum _i=1^n L\left( y_i, \haty_i\right) +\sum _k=1^K \Omega \left( f_k\right) \endaligned$$
(2)
where n is the number of training samples; \(\haty_i\) is the predicted value; \(L\left( y_i, \haty_i\right)\) is typically the Mean Squared Error (MSE) loss function; \(\omega (f_k)\) is the regularization term for decision tree \(f_k\), used to control model complexity and prevent overfitting.
LightGBM
LightGBM is similar to XGBoost, both being machine learning models based on the gradient boosting mechanism, as illustrated in Fig. 2. While XGBoost can precisely find split points, it incurs significant computation and memory overhead and is prone to overfitting. LightGBM utilizes a learning method optimized with histograms, adopts a leaf-wise growth strategy, and supports feature parallelism, reducing the number of candidate split points and memory usage, enhancing computational speed, and effectively preventing overfitting36.
During the training process, the LightGBM algorithm differs from traditional Gradient Boosting Decision Trees (GBDT) algorithms by not requiring multiple passes through the entire dataset. Instead, it utilizes residuals, using the previous round of training as the input for the subsequent round. LightGBM primarily splits decision trees through histograms, employing an intelligent leaf-wise growth strategy based on depth constraints to compute, dividing feature values into discrete intervals and then constructing histograms. LightGBM employs a leaf-wise growth strategy with depth limitations, replacing the level-wise growth strategy typically used in GBDT algorithms. In comparison to XGBoost level-wise growth strategy, LightGBM leaf-wise growth strategy reduces the computational cost of finding the optimal split points, lowering the overall computational burden. This efficiency makes LightGBM more memory-efficient and mitigates the risk of overfitting37.
Feature parallelism can effectively handle high-dimensional data, with each processing unit focusing on only a subset of features. LightGBM follows a parallel computing strategy, dividing the dataset into several nodes, where different nodes process features in parallel, and then model parameters are updated by merging the results from each node. Additionally, LightGBM supports cross-validation to evaluate model performance and provides early stopping functionality, which halts training when the performance on the validation set no longer improves to prevent overfitting. In practical use, LightGBM iteratively optimizes the tree structure and requires careful adjustment of hyperparameters based on the dataset and learning objectives to achieve optimal performance.
ANN
ANNs are suitable for handling complex nonlinear regression problems and can be adapted to specific tasks by adjusting the network structure and parameters. An artificial neural network consists of three main parts: the input layer, the hidden layers, and the output layer5. Each layer’s neurons are connected by weights. The neurons in the input layer are used to receive input data from external sources. The hidden layers, located between the input and output layers, can consist of one or more layers. Each hidden layer is composed of multiple neurons, which are connected by weights. The main role of the hidden layers is to extract features. The output layer is the final layer of the network and is used to produce the final output variable in regression tasks. ANNs utilize multiple hidden layers and activation functions to capture the nonlinear relationships in the data, and they prevent overfitting through standardization and validation sets. This study employs a neural network with three hidden layers: the first layer has 12 neurons, the second layer has 10 neurons, and the third layer has 8 neurons, as shown in Fig. 2d. All hidden layers use the ReLU activation function, which is very effective in handling nonlinear problems and can help mitigate the vanishing gradient problem. The ReLU activation function is defined as follows:
$$\beginaligned ReLU(x)=max(0,x) \endaligned$$
(3)
BiLSTM
LSTM and BiLSTM are effective in capturing long-term dependencies in sequential data. Originally designed for time series prediction problems, both are frequently utilized in regression problems due to their powerful feature extraction capabilities. Derived from recurrent neural networks (RNNs), LSTM introduces improvements to the structure of the hidden layer on the traditional RNN architecture of input, hidden, and output layers. It achieves this by introducing gating mechanisms to control the information flow, selectively retaining or discarding information through these gates. This ultimately alleviates the issues of gradient vanishing or exploding caused by the derivative multiplication, effectively enhancing the memory capacity of deep learning networks38. The basic architecture of an LSTM single hidden layer includes forget gate, input gate, and output gate, as described in Fig. 2e.
BiLSTM is an improvement of LSTM, and its hidden layer is composed of both forward LSTM and backward LSTM, considering both past and future information of the data. Its structure is shown in Fig. 2f, and the BiLSTM memory network has two-directional transmission layers. The forward layer follows the forward training time series, the backward layer follows the backward training time series, and both the forward and backward layers are connected to the output layer39.
LSTM controls the selection or forgetting of data information through the forget gate, deciding which effective information in the training data is used for training. The amount of information to be stored in the memory unit is determined by the input gate. The input gate is used to control the amount of current input data \(x_t\) flowing into the memory unit. The output gate controls the influence of the memory unit \(c_t\) on the current output value \(h_t\)40. Their formulas are:
$$\beginaligned&f_m=\sigma \left( \varvecW_m\left[ h_t-1, x_t\right] +b_m\right) \endaligned$$
(4)
$$\beginaligned&i_t=\sigma \left( \varvecW_i\left[ h_t-1, x_t\right] +b_i\right) \endaligned$$
(5)
$$\beginaligned&o_t=\sigma \left( \varvecW_o\left[ h_t-1, x_t\right] +b_o\right) \endaligned$$
(6)
$$\beginaligned&h_t=o_t \tanh \left( c_t\right) \endaligned$$
(7)
where \(f_m\) is the state of the forget gate, the closer its value is to 0, the more forgotten information; \(\sigma\) is the Sigmoid function; \(\varvecW\) and b are the weight matrix and bias term of the forget gate, input gate, and output gate; \(h_t-1\) is the result of the previous output; \(x_t\) is the information input at the current time t; \(i_t\) is the state of the input gate, the higher its value, the higher the importance of new information, and unimportant information will be deleted. \(o_t\) is the state of the output gate, the closer its value is to 0, the more external state \(h_t\) can obtain more information from the memory unit \(c_t\).
BiLSTM consists of two LSTM networks in different directions. Input information is passed to the two-directional LSTM networks, and the input sample signal is output from the forward LSTM layer as \(h_t\) and from the backward LSTM layer as \(h_t\). The output \(y_t\) is jointly determined by the outputs of the LSTM from both directions, and its update formula is as follows,
$$\beginaligned \left\{ \beginarrayl \overrightarrowh_t=\text LSTM\left( x_t, \overrightarrowh_t-1\right) \\ \overleftarrowh_t=\text LSTM\left( x_t, \overleftarrowh_t-1\right) \endarray\right. \endaligned$$
(8)
$$\beginaligned y_t=\overrightarrow\varvecW \overrightarrowh_t+\overleftarrow\varvecW \overleftarrowh_t+b_y \endaligned$$
(9)
where \(\overrightarrow\varvecW\) and \(\overleftarrow\varvecW\) are the weight matrix from the forward and backward LSTM layer to the output layer, respectively; \(b_y\) represents the bias matrix of the output layer.

Structure of models used in this study.
Decoupling methods
The method of decoupling Z-factors mainly employs signal decomposition algorithms. The primary purpose of signal decomposition algorithms is to decompose complex signals into different components for a better understanding of the structure, nature, and characteristics of the data. Utilizing signal decomposition algorithms to decouple Z-factors helps extract useful information from the Z-factors, remove noise, reveal patterns hidden in the data, and provide a foundation for further analysis and processing. In the process of decoupling Z-factors, we mainly use three signal decomposition methods: VMD41, EFD42, and EEMD43. In subsequent studies, we compared the performance of the three hybrid models VMD+SVM, EFD+SVM and EEMD+SVM on the testing set, VMD was selected due to its best performance. Therefore, the usage process and principle of VMD is mainly introduced. The other two algorithms with poor performance are used for comparison, which will not be described in detail. The workflows of other signal decomposition methods are similar, and more detailed explanations about all decomposition methods can be found in the references cited above.
The specific implementation steps of using VMD to decouple Z-factors are as follows: (1) Establish a variational model. Hilbert transform is used to decompose the original data f(t) into K components. Each component has a finite bandwidth around the center frequency. Then, data is shifted to baseband and mix the center frequency to estimate the signal bandwidth, ensuring that the sum of the estimated bandwidths of each mode is minimized, and the sum of all components is equal to the original signal. The constructed variational function model is as follows:
$$\beginaligned \left\{ \beginarrayc \min _\left\ u_k\right\ \left\ \omega _k\right\ \left\ \partial _t\left[ \left( \delta (t)+\fracj\pi t\right) * u_k(t)\right] e^-j \omega _k t\right\ \\ \text s.t. \sum _i=1^K u_k=f(t) \endarray\right. \endaligned$$
(10)
where \(u_k\) and \(\omega _k\) are, the i-th modal component and central frequency after decomposition, respectively; \(\partial _t\) is the partial derivative with respect to time t; \(\delta (t)\) is the Dirac function; j is the imaginary unit; \(u_k (t)\) is the modal function.
(2) Solve the optimal solution of the variational model. Lagrangian multiplier operator \(\lambda (t)\), Lagrangian parameter \(\lambda\), and quadratic penalty parameter \(\alpha\) are introduced for solving. Equation (11) is transformed into the augmented Lagrangian function:
$$\beginaligned \beginarrayl L\left( \left\ u_k\right\ ,\left\ \omega _k\right\ , \lambda \right) =\alpha \sum _i=1^K\left\| \partial _t\left[ \left( \delta (t)+\fracj\pi t\right) * u_k(t)\right] e^-j \omega _k t\right\| _2^2 +\left\| f(x)-\sum _i=1^K u_k(t)\right\| _2^2 +\left\langle \lambda (t), f(t)-\sum _i=1^K u_k(t)\right\rangle \\ \endarray \endaligned$$
(11)
where \(\alpha\) is the penalty parameter; \(\lambda (t)\) is the Lagrange multiplier.
(3) Alternate direction method of multipliers is used to solve \(u_k^n+1\), \(\omega _k^n+1\) and \(\lambda ^n+1\):
$$\beginaligned&u_k^n+1(\omega )=\fracf(\omega )-\sum _i \ne k u_i(\omega )+\frac\lambda (\omega )21+2 \alpha \left( \omega -\omega _k\right) ^2 \endaligned$$
(12)
$$\beginaligned&\omega _k^n+1=\frac u_k(\omega )\right ^2 d \omega \endaligned$$
(13)
$$\beginaligned&\lambda ^n+1(\omega )=\lambda ^n(\omega )+\tau \left( f(\omega )-\sum _k u_k^n+1(\omega )\right) \endaligned$$
(14)
where n is the iterations; \(\tau\) is the time step of the dual ascent.
(4) Repeat procedure in step (3) until solution satisfies the convergence condition. Then K components with limited bandwidth are obtained,
$$\beginaligned \sum _i^K\left( \left\| u_k^(n+1)-u_k^n\right\| _2^2\right) /\left\| u_k^n\right\| ^2<\varepsilon \endaligned$$
(15)
where \(\varepsilon\) is the convergence condition.
For the Z-factor with K modal components, the original Z-factor is obtained by summing up each component.
Snow ablation optimizer
When decoupling the Z-factor into multiple components using signal decomposition algorithms, it is necessary to set an appropriate number of decompositions K. At the same time, influenced by the decomposition algorithm, a large value of \(\alpha\) may cause the loss of frequency band information, while a small value may lead to information redundancy. Therefore, it is necessary to determine the optimal parameter combination [K,\(\alpha\)]. There are two methods to determine the number of Z-factor decompositions. One is the manual parameter tuning method, where different K values are tried from small to large. This method is random and time-consuming44,45. Another method is to use metaheuristic optimization algorithms for automatic optimization. The optimal values of the [K,\(\alpha\)] combination were selected metaheuristics in this study. In the outer layer of the entire machine learning model, metaheuristic optimization algorithms, such as genetic algorithms, particle swarm optimization algorithms, etc., can be used as model parameter optimization algorithms46. Here, we use the Snow Ablation Optimizer (SAO) as the method for optimizing model parameters. The SAO algorithm is a heuristic optimization algorithm that simulates the process of snow melting to solve optimization problems. This algorithm achieves the search process by adaptively adjusting temperature and melting speed and gradually finding the optimal solution through iterative optimization. The advantages of the SAO algorithm over other optimization algorithms lie in its unique dual-population mechanism, efficient exploration and exploitation strategies, and flexible position update equations. These features make it exhibit better balance, search efficiency, and adaptability when dealing with complex optimization problems, especially in the case of multi-peak and high-dimensional problems47. The performance of the hybrid model in this study is mainly affected by the number of decompositions K, and each K corresponds to an \(\alpha\). \(\alpha\) has a little impact on the model performance and is therefore ignored during the optimization process. The goal of SAO optimization is to find K with the smallest RMSE within a defined iteration. The SAO optimization process takes K as the main optimization variable, and its search range is [1,20]. The results section below discusses the impact of the decompositions number on model performance. Finally, it is found that when the number of decompositions reaches 11, the performance of all models hardly increases, so the K value is 11 in this study.
Evaluation metrics
Root Mean Square Error (RMSE) and the adjusted coefficient of determination (\(R^2\)) are primarily employed to evaluate the model performance in this paper. A smaller RMSE indicates more accurate predictions of Z-factors by the model. A higher \(R^2\) signifies better model performance48. Their formulas are as follows:
$$\beginaligned&RMSE=\sqrt\frac1N\sum _i=1^N(Z_i-Z_pred)^2 \endaligned$$
(16)
$$\beginaligned&r^2 = 1-\frac\sum _i=1^N(Z_i-Z_pred)^2\sum _i=1^N(Z_i-\barZ_pred)^2\endaligned$$
(17)
$$\beginaligned&R^2 = 1-\frac(1-r^2)(N-1)N-m-1 \endaligned$$
(18)
where N represents the number of Z-factor samples, \(Z_i\) represents the true Z-factor, \(Z_pred\) represents the predicted Z-factor,\(\barZ\) represents the sample mean Z-factor, and m represents the number of features influencing the Z-factor.
Additionally, some models also use Mean Absolute Percentage Error (MAPE) for evaluation, where a lower MAPE indicates a more perfect model. The formula is expressed as follows:
$$\beginaligned MAPE=\frac100 \%N \sum _i=1^N\left| \fracZ_pred-Z_iZ_i\right| \endaligned$$
(19)
The hybrid model

Proposed hybrid model for decoupling and predicting natural gas Z-factors.
The conventional Z-factor correlation predicts the Z-factor by establishing a mathematical expression between the Z-factor and the pseudoreduced pressure and temperature of natural gas. Pseudoreduced pressure and temperature are defined as:
$$\beginaligned&P_pr=\fracpP_pc\endaligned$$
(20)
$$\beginaligned&T_pr=\fracTT_pc \endaligned$$
(21)
where \(P_pc\) and \(T_pc\) are pseudocritical pressure and temperature of natural gas, respectively.
This approach simplifies the difficulty of obtaining the Z-factor correlation to some extent. However, it has certain drawbacks, namely, the \(P_pr\) and \(T_pr\) for different natural gases with different components may be equal at different pressures and temperatures. Yet, their Z-factors may not be equal. Therefore, the Z-factor correlation obtained in this way is only applicable to constant component natural gas. When the component of natural gas changes, a new correlation needs to be established to predict the Z-factor. Machine learning models, to some extent, overcome this difficulty. They can establish a connection between multiple factors and the Z-factor by inputting parameters that include the composition of natural gas, temperature, pressure, and all other factors influencing the Z-factor9. Previous machine learning models also used \(P_pr\) and \(T_pr\) as input variables to establish Z-factor prediction models7,22. This approach also faces the problem of data overlap. Considering the difficulty of obtaining the gas composition for all wells, and \(P_pc\) and \(T_pc\) of natural gas already partially contains information about different gas components. Therefore, in this paper, we employ \(P_pc\), \(T_pc\), P, and T as input variables to predict the Z-factor. The overall architecture of the proposed Z-factor prediction model is shown in Fig. 3. The workflow of predicting the Z-factor has four main steps:
Setp 1: Decoupling the Z-factor into several components using decomposition algorithm. For the original undecoupled data set:
$$\beginaligned D_\text original =\\left[ p_r, T_r, p, T\right] _Input,[Z]_Output\ \endaligned$$
(22)
After the Z-factor is decomposed by the VMD, EFD or EEMD algorithms, k Z-factor components are obtained. The decoupled components are combined with the original data to form k data sequences, which can be expressed as:
$$\beginaligned \left\{ \beginarrayl D_1=\\left[ p_r, T_r, p, T\right] _Input,[Z_1]_Output\ \\ D_2=\\left[ p_r, T_r, p, T\right] _Input,[Z_2]_Output\ \\ \cdots \\ D_k=\\left[ p_r, T_r, p, T\right] _Input,[Z_k]_Output\ \endarray\right. \endaligned$$
(23)
Step 2: Training the hybrid model. Dividing the k data sequences into training sets and testing sets, machine learning algorithms (SVM, XGBoost, LightGBM, ANN and BiLSTM) are employed to train the models. For k data sequences, k models are trained respectively:
$$\beginaligned \text Mdl[i]=\text train\left( \left[ p_r, T_r, p, T\right] _Input, [Z_i]_Output\right) , i=1,2, \ldots k \endaligned$$
(24)
Step 3: Evaluating the performance of hybrid models using Eqs. (16) and (18). In this step, the hybrid model with the best performance and the optimal number of decompositions k are selected. When k changes, repeat steps 1 and 2 until the hybrid model reaches the optimal RMSE and \(R^2\) within the defined iterations. Therefore, decompositions number k can be adjusted manually by user or automatically by SAO to achieve the optimal RMSE at this step.
Step 4: Predicting the Z-factor under various conditions using the well-trained hybrid model. After all the model is well-trained and achieves satisfactory performance in the testing set, the k well-trained models are used to predict k decoupled Z-factors:
$$\beginaligned Z[i]=\text predict\left( \text Mdl[i],\left[ p_r, T_r, p, T\right] _Input\ for\ Prediction\right) \endaligned$$
(25)
The final predicted Z-factor is obtained by adding k decoupled Z-factors:
$$\beginaligned Z=\sum _i=1^k Z[i] \endaligned$$
(26)
The above statement briefly describes the entire prediction framework. It is easier to understand the prediction workflow by combining it with the codes. The source code of this study is available at
link
