Estimation of state of health for lithium-ion batteries using advanced data-driven techniques

The suggested method for estimating the SOH of batteries included three stages, as illustrated in Fig. 7, data pre-processing, model training, and a performance assessment. During the data preparation step, the missing and anomalous values were eliminated from the raw data from the batteries. The battery features were extracted once the data was cleaned². A training dataset and a test dataset were created from the feature data. The training dataset was divided into training and validation data during the model training phase.

Using the training data, trained Ad boost, Xgboost, Ridge, DT, RF, ANN, and LSTMs. Using the validation results, the architecture’s hyper-parameters were changed. The trained model was tested using the test dataset during the performance evaluation stage, and the R2 score, mean absolute percentage error (MAE), mean squared error (MSE), and root mean squared error (RMSE) were used to assess the model’s performance.

Table of Contents

Data preprocessing

Pre-processing of the data was done to make it fit for a ML model. Because there were anomalous and missing values in the raw data, data cleaning was necessary. After data cleaning, features were retrieved from the raw data, which was based on time series. Equation (10) was used to represent the feature format. As a result, used min-max normalisation² to normalise the SOH dataset as follows:

$$z_{{nm}}^{k}=~\frac{{x_{{nm – {\text{min}}\left( {{X_n}} \right)}}^{k}}}{{\hbox{max} \left( {{X_n}} \right) – {\text{min}}\left( {{X_n}} \right)}}~~~~n \in \left\{ {1, \ldots ,5} \right\},m \in \left\{ {1, \ldots ,s} \right\}$$

(10)

Here, m is the m-th sampling point, k denotes the number of cycles, and $\:{X}_{n}$ is an array of the n-th row in Eq. (11) of all charging cycles. Additionally, applied min-max normalisation to normalise the capacity in the following manner:

$${c^k}=\frac{{{C^k} – {\text{min}}\left( C \right)}}{{\hbox{max} \left( C \right) – {\text{min}}\left( C \right)}}$$

(11)

Where C is a group of capacity for all charging cycles and k is the number of cycles. After normalization, the data was divided into two sets: the test dataset was used to evaluate the final models, and the training dataset was used to fit the model parameters. In SOH estimation for batteries using machine learning algorithms, data pre-processing is essential. Data re-sampling addresses class imbalances or irregular distributions by either increasing the number of samples in underrepresented classes or reducing them in overrepresented ones. Outlier detection identifies and handles anomalies that could skew results, using box plots. Data normalization scales features to a uniform range or distribution, improving model convergence and performance. Techniques like Min-Max Normalization ensure features contribute equally, leading to more accurate and reliable estimations. These pre-processing steps are crucial for enhancing model performance and achieving precise SOH estimates.

Model training

The training dataset has been divided into training and validation data for the model training phase. In this paper used seven distinct machine learning models with the dataset: Adaboost, Xgboost, Ridge, DT, RF, ANN, and LSTM architectures.

Ridge regression

A linear regression technique known as ridge regression is used to examine the relationship between input factors and a continuous target variable is called ridge regression, shown in Fig. 8.

It is particularly useful when the input variables are correlated or when there is multicollinearity present in the data²⁸. Ridge regression helps to regularise the model and lessen the effect of multicollinearity by adding a penalty term to the usual linear regression objective function.

The mathematical relationship between cell voltage, cell current, and cell internal resistance, temperature, and SOH estimation using Ridge regression can be represented by the following Eqs. (12, 13):

$$\widehat {{SoH}}=~{\beta _0}+~{\beta _1}.{V_{cell}}+{\beta _2}.I+{\beta _3}.{R_{cell}}+{\beta _4}.T$$

(12)

Where, $\widehat {{SoH}}~$ is the predicted State of Health, ${\beta _0},{\beta _1},{\beta _2},{\beta _3},{\beta _4}$ are the coefficients (weights) of the Ridge regression model,${V_{cell}}$ represents the voltage of the cell, I represents the current of the cell, ${R_{cell}}$ represents the internal resistance of the cell, T represents the temperature.

In Ridge regression, the coefficients$~{\beta _0},~{\beta _1},~{\beta _2}$,$~{\beta _3},{\beta _4}$ are estimated by minimizing the following objective function:

Minimize

$$\mathop \sum \limits_{{i=1}}^{N} {\left( {{y_i} – \left( {{\beta _0}+~{\beta _1} \cdot {V_{cell}}+{\beta _2} \cdot I+{\beta _3} \cdot {R_{cell}}+{\beta _4} \cdot T~} \right)} \right)^2}+~\lambda ~\mathop \sum \limits_{{j=1}}^{p} \beta _{j}^{2}~$$

(13)

Where ${\text{N}}$ represents the number of samples, ${\text{P}}$ represents the number of features, ${{\text{y}}_{\text{i}}}~$represents the observed SOH for the sample i,$~{\text{\varvec{\uplambda}}}$ represents the regularization parameter, $\mathop \sum \limits_{{{\text{j}}=1}}^{{\text{p}}} {\text{\varvec{\upbeta}}}_{{\text{j}}}^{2}{\text{~represents~}}$the penalty term that penalizes large coefficients.

The objective is to find the values of$~~{\beta _0},~{\beta _1},~{\beta _2}$,$~{\beta _3},{\beta _4}~{\text{that}}$ minimize the sum of squared errors plus the Ridge penalty. Ridge regression helps to minimize the effects of multicollinearity and stabilizes the model by shrinking the coefficients towards zero.

Decision tree

In this, the feature space is divided into regions, and each region is fitted with a basic model. To support decisions and their possible outcomes, a DT uses a hierarchical model based on chance events, resource costs, and utility²⁹. The tree structure is made up of leaf, internal, branch, and root nodes, arranged in a tree-like hierarchy, shown in Fig. 9. The DT working principle in mathematical form is presented in Eq. (14).

The mathematical relationship between cell voltage, cell current, cell internal resistance, temperature, and SOH estimation using decision tree regression can be represented by the following equation:

$$\widehat {{SoH}}=~\mathop \sum \limits_{{i=1}}^{N} {w_i}.{y_i}~$$

(14)

Where,$\widehat {{SoH}}$ is the predicted State of Health, N represents the number of samples in the region to which the new data point belongs, ${w_i}$ represents the weight assigned to each training sample i, and ${y_i}~$ represents the SOH value of training sample i.

In order to forecast the target variable in DT regression, the model learns a series of if-then-else decision rules from the training set. The goal of the model’s training is to reduce the target variable’s variation within each leaf node.

The DT model equation for SOH estimation does not have a fixed form like linear regression, as it depends on the specific structure of the DT learned from the training data. The series of choices the DT makes, culminating in the ultimate estimation at the leaf node that corresponds to the new data point, represents the relationship between the input features and SOH.

Adaboost

A machine learning technique called Adaboost is applied to regression and classification problems. It is an ensemble learning technique that builds a strong learner by combining several weak learners³⁰, shown in Fig. 10. Using the same dataset, Adaboost trains weak learners iteratively and modifies the weights of the training examples in response to the past weak learners’ performance. This allows Adaboost to focus more on the instances that are difficult to classify or predict.

The method known as Adaboost regression is applied to regression projects in which the objective is to predict a continuous target variable. The mathematical relationship between cell voltages, cell current, and cell internal resistance, temperature, and SOH estimation using Adaboost can be represented by the following Eq. (15):

$$\widehat {{SoH}}=~\mathop \sum \limits_{{t=1}}^{T} {\alpha _t}{h_t}~\left( {{V_{cell}},I,{R_{cell}},T} \right)$$

(15)

Where, $\widehat {{SoH~}}$represents the predicated State of Health, T represents the number of weak learners (base models), ${\alpha _t}~{\text{represents}}~$the weight assigned to the t-th weak learner, and ${h_t}~{\text{represents~}}$ the t-th weak learner that predicts SOH based on the input features.

The Adaboost algorithm sequentially adds weak learners to the ensemble, with each new weak learner focusing on the instances misclassified or poorly predicted by the previous weak learners. The weights ${\alpha _t}$ are computed based on the performance of the weak learners, giving more weight to the more accurate learners. A weighted summation of the estimations made by each of the ensemble’s weak learners makes up the final SOH forecast, where the weights are determined by the Adaboost algorithm. This allows Adaboost to create a strong learner that can effectively predict SOH based on the input features.

Xgboost

Xgboost is a widely used MLA, popularly for both regression and classification tasks. It utilizes ensemble learning method to create a robust predictive model by sequentially combining multiple weak models, typically DT, shown in Fig. 11.

The Xgboost regression technique adds DTs to the ensemble iteratively, with each new tree fixing mistakes produced by the preceding trees³⁰. The weighted total of the estimations made by each tree in the ensemble represents the final forecast. The following Eq. (16) represents the mathematical link between temperature, internal resistance, cell voltage, current, and SOH estimate using Xgboost:

$$\widehat {{SoH}}=f\left( {{V_{cell}},I,{R_{cell}},T} \right)~$$

(16)

Where, $\widehat {{SoH}}~$represents the predicted State of Health, ${V_{cell}}~$ represents the cell voltage, I represents the cell current,${R_{cell}}~$represents the cell’s internal resistance, T represents the temperature, f represents the function learned by the Xgboost model.

The exact form of the function f is learned during the training of the Xgboost model using a dataset that includes measurements of SOH, cell voltage, cell current, cell internal resistance, and temperature. By maximizing the weights given to each feature and the ensemble’s DT parameters, Xgboost discovers the correlation between these input features and SOH. It’s crucial to remember that the Xgboost model has acquired a complex and non-linear connection, which enabled it to fully comprehend the subtleties of the input data and how they affected the estimation of SOH.

Random forest algorithm

Random forest regression is an MLA that involves a collection of DT to perform regression tasks. Each DT in the RF is trained on a random subset of the training data and a random subset of the features, which reduces overfitting and enhances the model generalization capabilities³¹. The final estimation of the RF is the average of the estimations of all the individual trees. The mathematical relationship between cell voltages, cell current, and cell internal resistance, temperature, and SOH estimation using RF regression can be represented by the following Eq. (17):

$$\widehat {{SoH~}}=~\frac{1}{N}~\mathop \sum \limits_{{i=1}}^{N} \widehat {{So{H_i}}}~$$

(17)

Where $\widehat {{SoH~}}~$ represents the predicted SOH, N represents the number of decision trees in the RF, $\widehat {{So{H_i}}}$ represents the predicted SOH from the i-th decision tree.

Each DT in RF regression is trained using a random selection of both the features and the training data. This unpredictability aids in lowering overfitting and enhances the model’s overall effectiveness. The average of each individual tree’s forecasts makes up the RF’s final estimation, which serves to smooth out the estimations and raise the model’s accuracy, as shown in Fig. 12. As a result, complicated correlations in the data can be captured using RF regression, enabling precise estimations for the target variable.

Artificial neural network (ANN)

A class of ML techniques called ANNs is motivated by the composition and operations of the human brain. They are applied to many different tasks, such as pattern recognition, grouping, regression, and classification³². After being received by the input layer, the data is processed in one or more hidden layers before being sent to the output layer, which makes the estimation, shown in Fig. 13. This Eq. (18), for a basic feedforward neural network with one hidden layer, represents the mathematical link between cell voltage, cell current, cell internal resistance, temperature, and SOH estimate using an ANN:

$$\widehat {{SoH}}=f\left( {\mathop \sum \limits_{{i=1}}^{n} {w_{i2}}.g\left( {\mathop \sum \limits_{{j=1}}^{m} {w_{j1}}.{x_j}+~{b_j}~} \right)+{b_2}} \right)~$$

(18)

Where, $\widehat {{SoH}}~~$ represents the predicted State of Health, f and g are activation ReLU functions used in the hidden and output layers, respectively, ${w_{ij~}}$ represents the weights connecting neurons in adjacent layers, ${x_j}~$ represents the input features (e.g., cell voltage, cell current, cell internal resistance, temperature), ${b_j}~~$ represents the bias terms, $n~$ represents the number of neurons in the hidden layer, $m~$ represents the number of input features.

By modifying the weights and biases to reduce the discrepancy between the predicted and actual SOH in the training data, the network learns to map the input features to the target variable. Because of their reputation for capturing intricate non-linear relationships in data.

Long-Short-Term memory (LSTM)

The LSTM model is a type of RNN that captures long-term dependencies in sequential data, as shown in Fig. 14. For battery SOH estimation³², it analyzes historical data, i.e. voltage, current, cycles, etc. to predict battery health and remaining life, ensuring optimal performance and longevity, represents in Eqs. (19–23)

$${i_t}=~\sigma ~\left( {{W_{ix}}.{x_t}+~{W_{ih}}.{h_{t – 1}}+~{W_{ic}}.{c_{t – 1}}+{b_i}} \right)~$$

(19)

$${f_t}=\sigma \left( {{W_{fx}}.{x_t}+~{W_{fh}}.{h_{t – 1}}+{W_{fc}}.{c_{t – 1}}+~{b_f}} \right)$$

(20)

$$~{c_t}=~{f_t}.{c_{t – 1}}+{i_t}.\tanh \left( {{W_{cx}}.{x_t}+~{W_{ch}}~.~{h_{t – 1}}+~{b_c}} \right)~~~$$

(21)

$${o_t}=~\sigma ({W_{ox}}.{x_t}+{W_{oh}}.{h_{t – 1}}+{W_{oc}}.{c_t}+{b_0}$$

(22)

$${h_t}=~{o_t}.\tanh \left( {{c_t}} \right)~$$

(23)

Where, ${i_t},{f_t},~~{o_t}~$ represent the input, forget, and output gates, ${c_t}$ represents the state of the cell, ${h_t}~$ represents the hidden state, ${x_t}~$ represents the input at time step t (e.g., cell voltage, cell current, cell internal resistance, temperature), W represents the weight matrices, $b~$ represents the bias terms, $\sigma$ represents the sigmoid activation function, $tanh$ represents the hyperbolic tangent activation function. The LSTM cell processes the input sequence ${x_1},{x_2}, \ldots \ldots .{x_t}$ sequentially, updating its cell state and hidden state at each time step. The cell state retains information over long sequences, allowing the LSTM to capture long-term dependencies in the data. The final estimation for the SOH is made based on the hidden state ${h_t}$ or by passing it through additional layers of the LSTM or other neural network architectures.

Model testing

Model testing involves testing the previously constructed MLA LSTM model mentioned above using test data. The model will be tested using one to seven input parameters according to the test plan³³. To facilitate comparison of the estimation result, only the last 66,302 data are taken after the data is converted into a series for the subsequent step.

Performance evaluation

The MSE and RMSE were computed to assess the models’ performance using the test data set. Furthermore, the following formulas were used to calculate the MAE and R2 score for a performance evaluation: Evaluation using MSE, RMSE, MAE, and R-squared. The outcomes of the model test are utilized to compute performance using:

Mean absolute error

The average squared difference between the actual and anticipated values is computed via MSE². MSE is sensitive to outliers since squaring the differences penalises greater errors more than smaller ones, represented in Eq. (24). Since the estimations are closer to the actual values, a model with a lower mean square error (MSE) has superior predictive accuracy.

$$MSE=~\frac{1}{N}\mathop \sum \limits_{{i=1}}^{N} {\left( {{C_i} – \widehat {{{C_i}}}} \right)^2}$$

(24)

Where ${C_i}$, represents the actual capacity, and ${\hat {C}_i}$ represents the estimated capacity, and N represents the number of datasets.

In the SOH estimation for Li-ion batteries, MSE is used to assess how accurately a model predicts the SOH.

Root Mean Squared Error (RMSE)

By figuring out the root of the error value between the expected and actual values, RMSE is one method of evaluating the precision of the estimation findings²⁴. The RMSE value can be found using the following Eq. (25):

$$RMSE=\sqrt {\frac{1}{N}\mathop \sum \limits_{{i=1}}^{N} {{\left( {{C_i} – \widehat {{{C_i}}}} \right)}^2}~~}$$

(25)

Where ${C_i}$, represents the actual capacity,${\hat {C}_i}$ represents the estimated capacity, and N represents the number of datasets.

Mean absolute error (MAE)

Whether the error is positive or negative, the MAE measures the absolute error between the actual data and the anticipated data²⁶. The formula used to get the MAE value is shown in Eq. (26);

$$MAE=~\frac{1}{N}\mathop \sum \limits_{{i=1}}^{N} \left| {\left( {{C_i} – \widehat {{{C_i}}}} \right)} \right|~$$

(26)

Where ${C_i}$ represents the actual capacity, ${\hat {C}_i}$ represents the estimated capacity, and N represents the number of datasets.

R-Squared (R2)

The R-squared tool for examining the relationship between actual and expected data is the coefficient of determination. The value of R2 ranges from -∞ to 1, with a closer value indicating greater suitability of the current model with the dataset^11,24. The formula used to determine R2’s value is shown in Eq. (27);

$${R^2}=1 – ~\frac{{\mathop \sum \nolimits_{{i=1}}^{N} {{\left( {{C_i} – {{\hat {C}}_i}} \right)}^2}}}{{\mathop \sum \nolimits_{{i=1}}^{N} {{\left( {{C_i} – \bar {C}} \right)}^2}}}$$

(27)

Where ${C_i}$ represents the actual capacity,${\hat {C}_i}$ represents the estimated capacity, $\bar {C}$ represents mean of the actual capacity, and N represents the number of datasets

link

Estimation of state of health for lithium-ion batteries using advanced data-driven techniques

Data preprocessing

Model training

Ridge regression

Decision tree

Adaboost

Xgboost

Random forest algorithm

Artificial neural network (ANN)

Long-Short-Term memory (LSTM)

Model testing

Performance evaluation

Mean absolute error

Root Mean Squared Error (RMSE)

Mean absolute error (MAE)

R-Squared (R2)

‘Playful’ teaching gaining credibility, say Lego researchers

Machine Learning Boosts Heat Equation Solutions

Professor Creates Unique Social Work Education Resource Guide

Leave a Reply Cancel reply

Penn implements changes to study abroad application process following student feedback

PTDF screens 5,885 candidates for master’s, PhD scholarships to study abroad

Would you study abroad if social media didn’t exist?

New survey data says demand for MBA study abroad is shifting this year – ICEF Monitor

Beyond Big Four: Indian students are dumping default study abroad settings

Data preprocessing

Model training

Ridge regression

Decision tree

Adaboost

Xgboost

Random forest algorithm

Artificial neural network (ANN)

Long-Short-Term memory (LSTM)

Model testing

Performance evaluation

Mean absolute error

Root Mean Squared Error (RMSE)

Mean absolute error (MAE)

R-Squared (R2)

More Stories

Leave a Reply Cancel reply

You may have missed