Hybrid bagging and boosting with SHAP based feature selection for enhanced predictive modeling in intrusion detection systems

0
Hybrid bagging and boosting with SHAP based feature selection for enhanced predictive modeling in intrusion detection systems

The result section demonstrates the proposed model’s efficacy in predicting reasonably accurate outcomes across various test scenarios. Key performance metrics, including ROC-AUC, precision, and SHAP-based feature importance, reveal much about model behavior and predictive power.

Confusion matrix with SMOTE

This section presents the performance of a binary and multiclass classification model applied to the CIC-IDS2017 dataset, with the (SMOTE) to address the class imbalance.

Figure 5 displays the proposed model’s confusion matrix and (ROC) curve. The confusion matrix reveals high accuracy, with 127,981 TN and 132,024 TP, alongside 4,014 FP and no FN. The (ROC) curve corresponding to this performance showcases an area under the curve (AUC) of 1.0000, indicating perfect class discrimination. Figure 6 similarly presents the proposed model’s confusion matrix and (ROC) curve under different conditions or possibly another model iteration. In this case, the confusion matrix records 126,506 TN, 130,786 TP, 5,489 FP, and 1,238 FN. This setup’s (ROC) curve shows a slightly lower (AUC) of 0.9790, signifying a minor decline in classification performance.

Fig. 5
figure 5

Confusion matrices and ROC curves with SMOTE.

Fig. 6
figure 6

Confusion matrices and ROC curves with SMOTE.

Figure 7 shows the confusion matrix and the (ROC) curves for various classes in the dataset. The confusion matrix indicates the model’s ability to distinguish between multiple classes, including ’BENIGN,’ ’DoS GoldenEye,’ ’DoS Hulk,’ ’DoS Slowhttptest,’ ’DoS slowloris,’ and ’Heartbleed.’ The matrix highlights TP and misclassification, such as 121,133 TN for ’BENIGN’ and 127,774 TP for ’DoS GoldenEye,’ and a notable number of misclassified instances across other classes. The (ROC) curves display the model’s performance for each class, with (AUC) values ranging from 0.82 to 0.99, suggesting varying degrees of classification accuracy among the different types of network traffic.

Fig. 7
figure 7

Confusion matrices and ROC curves with SMOTE.

Figure 8 similarly depicts the confusion matrix and (ROC) curves, reflecting another evaluation of the same multiclass classification model. The confusion matrix demonstrates the distribution of correctly and incorrectly classified instances, with 124,214 TN for ’BENIGN’ and 121,061 TP for ’DoS GoldenEye.’ Mis-classifications are also noted, particularly within ’DoS Hulk’ and other attack types. The (ROC) curves for this evaluation indicate (AUC) values between 0.85 and 1.00, underscoring a solid classification performance overall, although some classes exhibit lower discriminative power.

Fig. 8
figure 8

Confusion matrices and ROC curves with SMOTE.

Confusion matrix without SMOTE

This section presents the performance of a binary and multiclass classification model applied to the CIC-IDS2017 dataset without the (SMOTE) to address the class imbalance.

Figure 9 showcases the confusion matrix and (ROC) curve for the model. The confusion matrix shows 125,166 TN and 3,074 TP, with 6,844 FP and 14 FN. The (ROC) curve exhibits an area under the curve (AUC) of 0.9718, indicating overall solid performance despite the absence of SMOTE. Figure 10 also presents the confusion matrix and (ROC) curve for the model. The confusion matrix reports 129,537 TN and 2,649 TP, alongside 2,511 FP and 401 FN. The corresponding (ROC) curve has a slightly higher (AUC) of 0.9887, demonstrating improved discriminative ability.

Fig. 9
figure 9

Confusion matrices and ROC curves without SMOTE.

Fig. 10
figure 10

Confusion matrices and ROC curves without SMOTE.

Figure 11 reveals that while the model robustly classifies ’BENIGN’ traffic, evidenced by 130,010 TP, there are notable mis-classifications, particularly with ’DoS Hulk’ attacks, which show many FN (42,060). The corresponding (ROC) curves reflect the model’s varying ability to distinguish between different attack types, with the (AUC) for ’BENIGN’ traffic at 0.950, indicating high discriminative power, whereas ’DoS Slowloris’ and ’DoS Slowhttptest’ have lower (AUC) values of 0.31 and 0.26, respectively, indicating poorer performance in these classes.

Fig. 11
figure 11

Confusion matrices and ROC curves without SMOTE.

Figure 12 shows improved detection for underrepresented attack types, such as ’DoS Hulk’, with an increased TP count of 29,180 and reduced FN (42,353). The (ROC) curves corroborate this improvement, showing enhanced (AUC) values across most classes, notably ’BENIGN’ (0.930) and ’DoS Hulk’ (0.930). However, some attack types still exhibit challenges in classification, as indicated by ’DoS Slowhttptest’ ((AUC) of 0.91) and ’Heartbleed’ ((AUC) of 0.860).

Fig. 12
figure 12

Confusion matrices and ROC curves without SMOTE.

Despite the robust performance, the confusion matrices reveal misclassifications, including normal instances predicted as attacks. In contrast, FPs and FNs indicate areas for improvement in the model’s predictions for binary and multiclass classifications.

A high peak in accuracy and the F1 score for the model is promising; however, for real-world IDPS, this requires more elaborate interpretation. High accuracy means the model is generalized well to most normal and attacks class instances correctly. However, in the case of real-world IDPS, accuracy alone cannot be deemed appropriate because of the class imbalance problem; a high number of instances from normal traffic could skew the results, and therefore, precision and Recall are more essential metrics on which to focus.

In practice, the performance measure of interest in IDPS is the F1 score. A higher F1 will indicate that most of the attacks have been effectively detected by the model with high Recall. At the same time, it generates few false alarms (high precision), which is very important for real-world applications since a high rate of false positives would result in unnecessary interventions, increasing the system overhead and the operational cost of such systems. On the other hand, false negatives may tend to allow hidden attacks to compromise network security.

The high F1 score, therefore, reflects the model’s potential to be reliable in intrusion detection and ensure that operation efficiency is maintained; thus, it is more applicable on a wide scale of real-world deployment. In other words, while the metrics look promising, their interpretation underlines IDPS as a model that could balance the detection and false alarm rate. It means more realistic, efficient, effective, and practical intrusion detection.

Now, extending the HBB-RE model for multiclass classification involves several areas of significant modification necessary to remain robust while managing the added complexities of multiclass settings. While initially designed for binary classification, the HBB-RE was extended using a hierarchical strategy. Two successive classifications were done to further refine class discrimination in iterative steps. It first carries out binary classification between normal traffic and suspicious activities, followed by several one-vs-rest classifications that identify specific attack types.

Several ensemble model components had to be carefully tuned to implement this multiclass strategy, paying particular attention to residual boosting so that residual errors from one class would not combine cascadingly or excessively bias the identification of other classes. Also, applying SHAP for feature selection to highlight such relevant features at each stage in classification helps in the model’s decision-making process and interpretability across the hierarchy.

The challenges for this extension mainly revolved around handling the class imbalance, where attack types were fewer within some classes, making it very tough to train a classifier to handle them. Further, propagating errors may occur because early-stage misclassifications affect downstream multiclass predictions. These are mitigated by integrating resampling techniques and fine-tuning hyperparameters for each class.

This extension has shown the ability to generalize across different attack types using HBB-RE. Additionally, this work has provided insight into adapting the ensemble models for multiclass intrusion detection, with which further research can extend scalability and precision in real-world IDPS.

The results of the binary and multiclass classification models for the CIC-IDS2017 dataset, with/without the SMOTE application, show their high accuracy (AUC) and robust classification metric scores. These results show that the models are discriminatory, especially regarding different attack types. However, the Confusion Matrices show that although these models are very effective, there is still a lot of room for improvement in the future. Refine these models to improve their predictive ability to reduce errors such as FPs and FNs. These models have the potential to contribute to the early detection and prevention of intruders on the network, thereby providing a network environment that is safer and more reliable.

Learning curve analysis

Figure 13 illustrates learning curves comparing binary classification performance with/without SMOTE. Panels (a) and (b) show the results with SMOTE applied, while panels (c) and (d) represent models trained without SMOTE. Panel (a) features high training accuracy and an upward trend in cross-validation accuracy as the sample size increases, indicating better model generalization after SMOTE. While there are minor fluctuations in panel (b) for the cross-validation score, this may hint at the model’s sensitivity towards changes in data and, hence, be used to get an intuition about which directions the model could further be optimized. Panels (c) and (d) show models without applying SMOTE. For panel (c), the mid-range value for cross-validation accuracy indicates that specific tuning would be required to handle the original distribution of the data. On the other hand, panel (d) also reflects a consistent trend between training and cross-validation scores, confirming that class weighting alone may be adequate to support model generalization in the case of some problems. These observations highlight how SMOTE and class weighting can help the stability and performance generalization of models across different approaches in data handling.

Fig. 13
figure 13

Learning curve for binary class with/without SMOTE.

Figure 14 presents learning curves for multiclass classification for SMOTE with and without weighting in each panel. Panels (a) and (b) show the results with the application of SMOTE: in panel (a), the training and cross-validation accuracies steadily rise as one increases the number of training samples and converge to approximately 0.92, which indicates that good generalization is achieved with SMOTE for multiclass classification. Panel (b) suggests that both measures increase monotonically to reach near convergence at 0.91, indicating that SMOTE allows for stable learning across classes. Without SMOTE but with weighting, panels (c) and (d) present the following: in panel (c), accuracy stabilizes around 0.86, with minor differences between training and cross-validation, suggesting balanced generalization even without SMOTE. Panel (d), on the other hand, shows lower and spikier performance, with stabilization after a while at around 0.82; this may show some problems with capturing minority classes without SMOTE. Overall, from these results, it would appear that these multiclass problems improve the model’s performance and generalization while weighting is moderately successful; however, it is more sensitive to data imbalancedness.

Fig. 14
figure 14

Learning curve for multiclass with/without SMOTE.

Explainable AI analysis

Explainable AI makes machine learning behavior and predictions understandable to humans-a different methodology and technique37. Traditional artificial intelligence models can be highly performant, but at the same time, they are considered black boxes: hard to understand and the reason behind making decisions38. This understanding is a significant problem in those domains where trust and accountability are crucial-such as cybersecurity39,40. Explainable AI would become helpful in giving an essential reason for making those models and indicating which features or patterns have led to which kinds of conclusions. Such transparency is crucial for verifying the model’s reliability, possibly diagnosing biases, and ensuring that it complies with the domain knowledge and requirements to which it will be applied41.

LIME visualization allows for an in-depth analysis of how the model processes individual data points42, which is crucial for model validation and debugging and in cases where model interpret-ability is necessary, such as in domains with regulatory requirements explained43 to stakeholders.

Limitations of existing methods

The critical milestone toward integrating Explainable AI techniques, such as SHAP feature selection, into cybersecurity is that intrusion detection through traditional methods tends to be “black boxes.” Lack of insight into the decision-making process presents challenges in understandability, trust, and compliance in environments that require accountability.

Since these are shortcomings of existing systems, the proposed approach throws more light on which features are most responsible for detecting cyber threats by incorporating SHAP feature selection. This will enhance model interpretability and allow security analysts to understand which indicators of malicious activity are most relevant. Moreover, it develops a better relationship wherein analysts can be more trustful of system decisions, monitor more effectively, and respond faster.

Apart from that, model complexity would be positively affected with the application of SHAP, as feature selection for the most impactful improves the computational efficiency level without performance sacrifice. This is particularly important for intrusion detection systems where speed is essential in real-time. Explainability will also enable compliance with various sets of regulations on transparency within automated decision-making processes.

In other words, integrating Explainable AI techniques such as SHAP directly tackles the shortcomings of current cyber security approaches by improving intrusion detection systems’ transparency, feature selection, and intrusiveness.

SHAP analysis

The SHAP analysis shows the impact of features on the binary and multiclass classification performance using SHAP values for Bagging and Boosting residual models with and without using SMOTE.

SHAP summary plot

Figure 15 highlights the contribution of each feature via the SHAP Summary Plot Fig. 15a and SHAP Waterfall Plot Fig. 15b to reach the final result for a particular instance. The colored dots in the summary plot represent the SHAP value of each feature, which determines how the feature influences the model’s output, solidly depicted at the end. You can see that features are sorted in descending order according to their relevance. Another observation is that the color bar for each feature identifies if its value is high or low. For instance, the feature ’Flow IAT Max’ significantly affects the model prediction when its value is too high.

Fig. 15
figure 15

SHAP with SMOTE indicating robust bagging and boosting binary classification performance.

Figure 16 gives an insight into the Shap feature values. The Shap Summary Plot Fig. 16a represents the features in descending order of importance with their Shap values and a bar indicating the contribution of the value of the SHAP feature to the model output. In the model, the “Flow IAT Max” has the most significant impact on the class classification of the attack, with Shap values ranging from approximately -0.6 to 0.6, which is a higher range as compared to all other features and depicts the impact of the feature on the predictions, with higher values contributing to higher predictions (red color). Other significant features that contribute both positively and negatively to classification are “Destination Port”, “min seg size forward”, and “Init Win bytes forward”, respectively.

Fig. 16
figure 16

SHAP with SMOTE, indicating robust boosting of residuals’ binary classification performance.

Figure 17a shows features on the left side of the chart, and each feature shows the importance of a dot in the bar chart. For instance, according to this chart, “Flow IAT Max” has an absolute value as red dots, an essential feature. The range of SHAP values for this feature is about -0.6 to 0.6, So it is a significant feature for the model’s prediction. A high value of this feature (red) will increase the prediction value, but a low value (blue) will decrease the prediction. Another example of this feature is “Init Win bytes forward”. The blue spot on the left indicates a negative value, a reasonable feature to increase the impact. However, the red part suggests that if the value of “Init Win bytes forward” is too high, it impacts the model’s decision differently.

Fig. 17
figure 17

SHAP without SMOTE, indicating robust bagging and boosting binary classification performance.

Figure 18a—SHAP Summary Plot describes a distribution of SHAP values for a particular x-axis feature, quantifying how much this feature influences the model’s predictions with positive or negative values. The spread on the chart shows extremities that have a strong influence on the predictions of either positive class or negative class, as can be observed for “Bwd Packet Length Std”, “Flow IAT Min”, and “Bwd IAT Mean”. They take positive values if they are high, reaching the highest spread. “Idle Max” and “Active Std” have a more balanced distribution around 0 (the middle) due to minor or inconclusive influence on the model, respectively.

Fig. 18
figure 18

SHAP without SMOTE indicating robust boosting on residuals’ binary classification performance.

The SHAP Summary Plot Fig. 19a shows the SHAP values of each feature per sample over the entire dataset. This plot emphasizes the features’ contribution to the decision boundary prediction. The three most dominant features are ‘Bwd Packet Length Std’, ‘Bwd Packet Length Mean’, and ‘Flow IAT Std’. The dashed line at the zero value of the feature indicates the decision boundary. Below this line, the observation is scored as a planted flag, while above it is a legitimate flag. The lower the value of a particular feature, the lower the chance for a legitimate flag score by the model. The color bar represents the general distribution of the feature set. The red color corresponds to high and blue to low feature values.

Fig. 19
figure 19

SHAP with SMOTE indicating robust bagging and boosting multiclass classification performance.

Figure 20a Features like ‘Bwd Packet Length Std’, ‘Idle Min’, and ‘Bwd Packet Length Max’ are the most influential as they have the highest SHAP values, an almost pure positive score in the case of ‘Bwd Packet Length Std’. The color gradient from blue to red signifies how varying feature values impact the model’s prediction, providing us with a visual understanding of how changes in feature values contribute to the final classification. High values of select features bring about considerable shifts in the model’s prediction score, demonstrating their importance in classification.

Fig. 20
figure 20

SHAP without SMOTE indicating robust Bagging and Boosting multiclass classification performance.

SHAP Summary Plot Fig. 21a depicts the distribution of SHAP values for each feature concerning the model’s output. The visualization on the left shows three features, the first being ’ActiveMin’, ’Bwd Packet Length Std’, and the ’Bwd Packet Length Mean’, which holds strong correlations in influencing the model’s predicted output. Specifically on the ’Active Min’ feature, the SHAP values range from -0.2 to +0.4. A color gradient from red represents the feature values, indicating a high value minimized by the model, to blue, which is the low value.

Fig. 21
figure 21

The figure shows the SHAP Summary Plot, Shap Waterfall Plot. This visualization provides an overview of the impact of features on model prediction without SMOTE, indicating robust Boosting on Residuals multiclass classification performance.

Figure 22a explains the high variance of features like ‘Destination Port’, ‘Flow Duration’, and ‘Total Fwd Packets’ as well as the low variance of features like ‘Bwd IAT Mean’ and ‘Bwd IAT Std’ indicates that the former group is more important for the classification task than the latter, which are less critical, and the classifier learns a joint interaction of the six features represented by salient positive feature importance.

Fig. 22
figure 22

SHAP with SMOTE indicating robust boosting on residuals’ multiclass classification performance.

SHAP waterfall plot

The Waterfall Plot Fig. 15b depicts these feature impacts cumulatively for a single instance: starting from an initial prediction and then adding in, or subtracting, the SHAP value of the feature with the most significant impact on the prediction, and so forth until the final prediction is attained. For example, “Fwd Packet Length Std” and “Flow IAT Max” have a significant positive impact, and “Idle Std” has a negative effect.

The Waterfall Plot Fig. 16b lists feature impacts in all cumulative increments for the given instance, thus revealing the evolution in each FWD score step by step. For example, the binary feature “Flow IAT Max” reduces the prediction by 0.25, “Packet Length Variance” and “Init Win bytes forward” reduce the prediction by 0.25 and 0.11, respectively, at the given step; meanwhile, the features “URG Flag Count” and “ACK Flag Count” increase the prediction by 0.09 and 0.05 respectively.

SHAP Waterfall Plot Fig. 17b illustrates the importance of features. Features, such as “min seg size forward”, negatively impact the model’s decision with high values and a positive impact with low values.

The Waterfall Plot Fig. 17b offers the cumulative (from most positive to most negative) perspective on feature contribution to a specific prediction. It starts with a certain baseline and then informs how the single feature shuttered the final prediction value. In the typical case, “Fwd Header Length” discloses how much one’s value action can effectively shrink or boost the prediction by decreasing it by 0.14 “Min Packet Length” and “Packet Length Std” then show how much is possible on top of the base (0.14 and 0.13 respectively). At the same time, “Flow IAT Max” visualizes a small positive effect, increasing it by 0.01. Such visualization lets us understand how each feature pushes the model’s prediction from the base value to the target.

Figure 18 – It shows how specific values of the features contribute to each single prediction: The SHAP Waterfall Plot Figure 18(b) starts with the base value (E[f(x)]) of 0.023, which of itself represents the average model output. After that, it is possible to observe how single features reduce or increase that base value to obtain the final prediction. For instance, ‘Bwd Packets’ and ‘CWE Flag Count’ reduce them by 0.01 each, while ‘Fwd Avg Packets Bulk’ adds 0.01. Other features – called ’Bwd Packet Length Max’ and ’Init Win bytes forward’ – are responsible for more minor variations. Visualization of the interaction between multiple features to calculate the final output of the model.

The SHAP Waterfall Plot Fig. 19b below depicts the contribution of individual features to the prediction. ‘Flow IAT Std’ features a considerable downward effect (-0.15), whereas ‘Init Win bytes forward’ and ‘Init Win bytes backward’ positively impact the prediction (+0.08 and +0.06, respectively). It elucidates how each feature pushes and pulls the prediction up or down from the input. Such visual representations make it easier to comprehend the predictor’s decision-making process.

Figure 22b is a SHAP Waterfall Plot. It is important to note that for this particular case, ’Total Fwd Packets’ and ’Flow Duration’ have the highest absolute SHAP values. +0.14 and -0.11 are the highest contributing variables to increase and decrease the prediction, respectively. At the same time, ’Fwd Packet Length Min’ and ’Total Length of Bwd Packets’, two other variables, also contribute positively to increasing the prediction.

The SHAP waterfall plot in Fig. 20b details the impact of selected features on a specific prediction where the predicted class is positive (f(x) = 1). Each bar indicates how each feature contributes to the prediction score, whether adding to the final prediction or subtracting from it. The features “Subflow Bwd Bytes” and “Destination Port” contribute very positively, while “Fwd Packet Length Max” contributes very negatively. In comparison, Fig. 23b shows that ‘Subflow Bwd Bytes’ with a SHAP value of +0.14 pulls the prediction towards class 1, and ‘Flow IAT Mean’ with +0.09 is another contributing feature with a positive value. ‘Fwd Packet Length Max’ with -0.14 and ‘Flow Bytes/s’ -0.02 pull the outcome towards class 0. This breakdown provides crucial insight into the impact of a particular feature on the final prediction.

Fig. 23
figure 23

SHAP with SMOTE indicating robust bagging and boosting multiclass classification performance.

For instance, SHAP Waterfall Plot Fig. 21b describes the reasoning behind the class 1 prediction is that ’Subflow Fwd Bytes’ contributes +0.20 while ‘Packet Length Mean’ contributes +0.15, on the other hand ‘Flow IAT Mean’ contributes -0.02 and it pulls the prediction towards class 0. Importantly, this series of observations demonstrates the additive and cumulative effect of features on the model’s decision.

SHAP force plot

Figure 24 depicts a SHAP Force Plot of how specific features influence a single prediction. The base value for each feature, such as ’Total Fwd Packets’ and ’Fwd IAT Max’, is moved higher or lower by the feature in question. This plot exemplifies the interactions between features and the cumulative effect of all features on the final prediction.

Fig. 24
figure 24

SHAP with SMOTE indicating robust bagging and boosting binary classification performance.

Figure 25 shows how the feature values contribute to a prediction. For the prediction shown earlier, values for “min seg size forward”, “Packet Length Mean” and “Destination Port” have lifted the base value above the predicted value. For instance, a value for “Bwd IAT Max” is responsible for the highest increase and “Bwd IAT Std” for the slightest but positive lift. The combined nature of the increase/decrease is seen through the alternating pattern of these two features, with high and low feature values.

Fig. 25
figure 25

SHAP with SMOTE, indicating robust boosting of residuals’ binary classification performance.

Figure 26 shows how specific features contribute to a single prediction. Starting from a baseline value, features like ’Fwd Header Length’ will further plummet the prediction, while ’Min Packet Length’ pushes it lower. The combo of ’Packet Length Std’ and ’Bwd IAT Total’ contributes minimally to the effect, but one feature has a very positive impact on the prediction: ’Flow IAT Max’. This plot shows the additive effect of interacting with these features, leading us to the final prediction.

Fig. 26
figure 26

SHAP without SMOTE indicating robust bagging and boosting binary classification performance.

Figure 27 illustrates that each feature adds to propel the prediction up or down from the default score. Notably, the feature that drops the prediction the most relative to the baseline is ’Flow Packets/s = 127.4372372’. In contrast, features such as ’Bwd Packet Length Std = 0.0’, ’Bwd IAT Mean = 3.0’, ’Flow IAT Mean = 10462.66667’, and ’Flow IAT Min = 3.0’ push up the prediction. The diagnosis is precise: Boosting on Residuals performs robustly on binary classification.

Fig. 27
figure 27

SHAP without SMOTE indicating robust boosting on residuals’ binary classification performance.

In Fig. 23, features such as ‘Fwd Packet Length Max’ (maximum forward packet length), ‘Flow Duration’, ‘RSSI’, and ‘Fwd Short Packets’ have a negative influence, driving the prediction towards zero. On the other hand, ‘Total Backward Packets’ and ‘Fwd Packet Length Std’ (standard deviation of forward packet length) positively influence the prediction upwards, though to a lesser extent. The contributions are quantified, demonstrating how each feature impacts the value of the model prediction by a certain amount. The model finally arrives at an output of prediction value 0.04. Figure 23 plays a pivotal role in understanding how features contribute to the change of output value, and, ultimately, the decision model provides insights into feature interactions in network traffic classification.

Figure 28 illustrates the cumulative effect of the features moving the prediction from the baseline to the final value of 0.64. Features like ‘Flow Duration’ negatively impact the prediction lower, while features such as ‘Fwd Packet Length Max’ and ‘Total Length of Bwd Packets’ push the prediction higher.

Fig. 28
figure 28

SHAP with SMOTE indicating robust boosting on residuals’ multiclass classification performance.

As seen in Fig. 29, the SHAP Force Plot decomposes the prediction of each feature and indicates each feature’s relative importance, linear significance, and directional effect. The ‘Destination Port’ is a crucial feature that strongly pulls the prediction to the higher side. However, ‘flow duration’ strongly affects the prediction’s return negatively. Other features like ‘pack lengths’ and ‘counts’ bring more specificity (how certain) around the base value, leading to the final output of 0.16 for the classification.

Fig. 29
figure 29

SHAP without SMOTE indicating robust Bagging and Boosting multiclass classification performance.

Figure 30 shows how much the individual features contribute to the model’s responses in terms of magnitude and direction. The features ‘Fwd Packet Length Std’ and ‘Flow Duration’ on the left side of the plot have a red colored marker and contribute negatively to the prediction. The feature ‘Fwd Packet Length Std’, with a value of 0.0, pushes the prediction to the left. In contrast, the feature ‘Flow Duration’, with a value of 70832.0, pushes the model’s prediction of less than 50 Max Value to the left, decreasing the model’s output.

Fig. 30
figure 30

The figure shows the SHAP Force Plot. This visualization provides an overview of the impact of features on model prediction without SMOTE, indicating robust Boosting on Residuals multiclass classification performance.

Then, on the right, features like ‘Destination Port’ and ‘Fwd Packet Length Min’ are shown in blue because they again increase the prediction – but this time, it’s a positive effect. Those features have values of 53.0 and 34.0, pushing the model output toward the right. The vertical grey line is the median model output, and the mean values for the specific prediction show up.

LIME analysis

The LIME analysis shows a model predictive probability for classifying between “normal” and “attack” classes in binary and multiclass classification tasks with and without SMOTE to handle class imbalance. Features calculated to increase the predicted probability of an average (class 0) sample are indicated in blue. In contrast, features calculated to increase the expected probability of an attack (class 1) sample are displayed in orange.

Figure 31 – For instance, ‘Bwd Packet Length Std’ with a value of 0.0 and ‘Init Win bytes forward’ with a value of -1.0 both have a negative prediction impact on the ‘attack’ class, making the decision edge in favour of ‘normal’ class as both feature’s contribution is beneficial for it. If the ‘Destination Port’ has a value of 53.0 or ‘min seg size forward’ with a value of 20.0, a positive impact is achieved on the ‘attack’ class. Still, their effect is dominated by the feature pulling the decision in ‘normal’ class.

Fig. 31
figure 31

LIME for bagging and boosting binary classification performance with SMOTE.

Figure 32 – by classifying the “Fwd IAT Mean” feature against the threshold 4,00; if the “Fwd IAT Mean” is less than equal to 4.00, then the predicted probability is assigned “normal”, which is 1.00 for the class “normal” and 0.00 for the “attack” corresponding.

Fig. 32
figure 32

LIME for boosting on residuals’ binary classification performance with SMOTE.

On the other hand, if “Fwd IAT Mean” is more significant than 4.00, the model begins to evaluate several different features, including “Idle Max”, “Fwd IAT Max”, “Fwd IAT Min” and “Flow Duration”, to improve upon its prediction. In that case, for instance, “Idle Max” being less than or equal to 0.00 and “Fwd IAT Max” being less than or equal to 4.00 would both assist in determining a prediction of “attack”. Meanwhile, “Fwd IAT Min” can be 3.00, “Flow Duration” can be 30908.50, and several other flag counts (FIN, ECE, ACK, and RST) can contribute variously to weighing in the decision to determine an “attack” or “normal” instance. These features along with their particular values utilized in the decision way are provided in the right side of the Figure 32: “Fwd IAT Mean”(0.00), “Idle Max” (0.00), “Fwd IAT Max” (0.00), “Fwd IAT Min” (0.00), “Flow Duration” (41723.00), “FIN Flag Count” (0.00), “ECE Flag Count”(0.00), “Fwd IAT Total” (0.00), “ACK Flag Count” (0.00), and “RST Flag Count” (0.00).

Figure 33 – initiates the classification by evaluating the “Bwd IAT Std” feature. If “Bwd IAT Std” is less than or equal to 0.00, the model predicts a “normal” instance with a probability of 1.00 for the “normal” class and 0.00 for the “attack” class. If “Bwd IAT Std” exceeds 0.00, the model assesses additional features to refine its prediction.

Fig. 33
figure 33

LIME for bagging and boosting binary classification performance without SMOTE.

The subsequent decision nodes include features like “RST Flag Count,” “Fwd IAT Total,” “min seg size forward,” and “FIN Flag Count.” For instance, if “RST Flag Count” is less than or equal to 0.00 and “Fwd IAT Total” is more significant than 1.00 but less than or equal to 3.00, the prediction pathway continues through other features such as “Init Win bytes forward” and “Max Packet Length”. The right side of the Fig. 33 lists the specific features and their values used in the decision pathway. These include “Bwd IAT Std” (0.00), “RST Flag Count” (0.00), “Fwd IAT Total” (3.00), “min seg size forward” (20.00), “FIN Flag Count” (0.00), “ECE Flag Count” (0.00), “Init Win bytes forward” (60.00), “URG Flag Count” (0.00), “Fwd IAT Min” (3.00), and “Max Packet Length” (6.00).

Meanwhile Fig. 34 shows that key features helping the model to predict ‘attack’ are ‘Flow Duration’ with a score of 64818700.00, ‘Fwd IVT Total’ with a score of 6840000.00 and ‘Fwd IAT Std’ with a score of 35445.13 that both are orange and decide the model to predict ‘attack’. Meanwhile, features ‘Fwd IVT Max’ a with a soft core of 333000000.00 and ‘Idle Min’ a with a soft core of 819200.00 that are bluish were usually positive e to the ‘normal’ class, but they contributed less than the features that will suggest ‘attack’ arguments.

Fig. 34
figure 34

LIME for boosting on residuals’ binary classification performance without SMOTE.

The LIME plot in Fig. 35 lists the features with their corresponding values essential in the prediction. For example, the ’Bwd Packet Length Mean’ is 0.00, ’Packet Length Variance’ is 0.00, and the ’Flow IAT Std’ is 100062.00 and has high values. These values are fundamental because they make the feature selection process more transparent by explaining to a human user how each feature affects the model decision-making process. There are high values for features such as ’Active Min’ (10110048.00) and ’Idle Min’ (501097.00), meaning that the network activity has a high range and duration and can be crucial for distinguishing between different classes in a multiclass classification problem.

Fig. 35
figure 35

LIME for bagging and boosting multiclass classification performance with SMOTE.

Figure 36 shows the model predicts the instance with 100% confidence as ’BENIGN’. The middle section of the figure shows the score by each feature given the instance and how critical the features are pointing to ’BENIGN’ or ’DoS GoldenEye’. The probability of being ’BENIGN’ is very high in this case. The figure elaborates on the top features that impact the prediction by listing the feature names and the contributions. The ’Bwd Packet Length Std’ was the most contributory feature, followed by ’Destination Port’ and ’Average Packet Size’. The values depicted beside each feature name are the values that contributed to the final prediction. It could be positive, which causes bias in the final decision towards that class, or negative, which biases the final decision away from that class. This visualization shows how Boosting on Residuals with SMOTE works and how specific features such as ’Bwd Packet Length Std’ and ’Destination Port’ are critically important in predicting the model.

Fig. 36
figure 36

LIME for boosting on residuals’ multiclass classification performance with SMOTE.

Figure 37 also shows the prediction probabilities representing the model confidence – 100% probability the instance belongs to class’ 0’ as the instance is NOT 1. The feature,’ Bwd Packet Length Std’, strongly contributes to the prediction. This feature alone can confidently classify an instance as 0 when the value is high. Other features, such as ’Destination Port’ with a value of 53, ’Subflow Bwd Bytes’ with a value of 353, and ’Flow IAT Mean’ with a value of 1178, contribute positively. Whereas ’Bwd Packet Length Std’ shows the highest enumeration, ’Destination Port’ and ’Subflow Bwd Bytes’ have the second highest. The value of the features highlights their impact assessment, and the positive values contribute to the final prediction.

Fig. 37
figure 37

LIME for Bagging and Boosting multiclass classification performance without SMOTE.

Figure 37 is a simple representation that underscores the model’s capability to learn from a limited number of critical features, such as ’Bwd Packet Length Std’ and ’Destination Port’, and perform a robust classification performance without using SMOTE.

Figure 38 illustrates the model’s predictive probabilities of robust Boosting on Residuals multiclass classification performance without SMOTE for ’DoS GoldenEye’ attacks. It can be seen that the model’s predicted outcome shows a 1.00 confidence for predicting the robust scheme in the classification network when it predicted ’BENIGN’. On the other hand, with the probability 0.00, it predicted ’DoS GoldenEye’. Several features account for the course mentioned above of events, including ’Bwd Packet Length Std’ (0.00), ’Destination Port’ (34600.00), ’Average Packet Size’ (2.00), ’Init_Win_bytes_backward’ (0.00), ’Bwd Packets’ (65.00), ’Flow IAT Min’ (49.00), ’URG Flag Count’ (0.00), ’Idle Min’ (0.00), and ’Bwd Packet Length Min’ (0.00).

Fig. 38
figure 38

This figure shows the model’s predictive probabilities for indicating robust Boosting on residual multiclass classification performance without SMOTE.

In Fig. 38, the ’Bwd Packet Length Std’ has the most significant variable importance of the dataset (0.00), followed by ’Destination Port’ (34600.00) and then ’Average Packet Size’ (2.00). Considering each feature’s importance is crucial to understanding how the model is impacted by any single feature, thus enhancing the transparency of a machine learning model’s contribution to the cybersecurity results.

To summarise, the model clearly distinguishes between benign and ’DoS GoldenEye’ attacks with a predictive probability of 1.00 for the ’BENIGN’ class. According to the feature importance analysis, the model is influenced by the following three features: ’Bwd Packet Length Std’, ’Destination Port’, and ’Average Packet Size’. Given the typical opacity of machine learning models, these insights are indispensable for making machine learning more interpretable in cybersecurity and, therefore, more valuable and reliable for decision-making in cyber-attacks at the training stage of machine learning models.

Discussion

This section discusses the proposed model’s performance results. It explains how it performed with SHAP features selection using balancing and balancing of the CIC-IDS2017 dataset.

Performance with SMOTE

Figure 39 presents the 3D bar chart of the intelligent state-of-the-art Bagging and Boosting model classifiers of binary and multiclass classification of the CIC-IDS2017 dataset validated by training performance with the capability of hybrid machine learning-based algorithms. Figure 39 shows a high accuracy of 98.47% and 92.75% for all measures in the intelligent model, together with binary and multiclass classification credentials. It shows its ability to learn proposed Bagging and Boosting algorithms in binary and multiclass (P-BB-B and P-BB-M) classification. However, baseline classifiers like RF, XGB & LGBM demonstrated impressive performance, with nearly 99% accuracy in both cases.

Fig. 39
figure 39

Proposed HBB-RE model performance using SMOTE.

The proposed Boosting on Residual model for binary and multiclass (P-BR-B and P-BR-M) reveals positive graph slopes of near-perfect accuracy of nearly above 97.45% and over 90.99% for binary and multiclass, respectively. AdaBoost performed great in the binary and the multiclass tasks, while GDMB also performed excellently in the binary and multiclass tasks. DT, ADB RF, XGB, and LGB performed well and excellently in the hybrid binary and multiclass performance dominance for the datasets.

Performance without SMOTE

Figure 40 shows the near-perfect accuracy of over 94.92% and 84.52% across all measures; the state-of-the-art Bagging and Boosting model performed well in binary and multiclass (P-BB-B and P-BB-M) classification on the CIC-IDS2017 dataset. However, baseline classifiers like RF, DT, and ADB demonstrated impressive performance with nearly 99% accuracy. At the same time, other models, such as XGB and LGB, scored well in multiclass classification with SHAP features selection without SMOTE.

The Boosting on Residual model (P-BR-B and P-BR-M) dominated the binary class with 97.84% accuracy and the multiclass with over 80.01% accuracy. AdaBoost and XGB also performed well in multiclass tasks with a high accuracy of 99%. DT, RF, and ADB excelled in the hybrid dominance in binary and multiclass performance.

Fig. 40
figure 40

Proposed HBB-RE model performance using without SMOTE.

Performance analysis

The performance of the proposed models demonstrates significant advancements in predictive accuracy and reliability across various configurations.

Figure 41 evaluates the performance of various proposed models using four key metrics: Accuracy, Precision, Recall, and F1 Score. The models include both feature selection with SMOTE (S) and without SMOTE score (WS) variants of Bagging and Boosting (BB) and Boosting on Residuals (BR), whereas (B) is for binary class and (M) is for multiclass classification. The BB-B-S model exhibits high performance, with all metrics around 98.47% to 98.48%, indicating consistent and reliable predictions. The BR-B-S model follows closely, showing slightly lower performance with scores of 97.45% across all metrics. On the other hand, the declining trend of BB-M-S drops significantly, with performances at 92.75% to 93.21%, indicating a drop in dependability of performance, and the BR-M-S has even lower performances in its metrics at 90.99% to 92.50% with another level of decline in performance.

Fig. 41
figure 41

Combine HBB-RE model performance with and without SMOTE.

The BB-B-WS is better according to all four metrics, scoring between 94.92% and 96.19%, with weighted scoring adding to the efficacy of the standard BB model. The BR-B-WS model has metrics of between 97.84% and 98.60%, showing the model’s robustness. The BB-M-WS model, however, although it has a range of acceptable variation, i.e., between 84.52% and 95.83%, also shows inconsistent performance. Finally, the BR-M-WS model has the lowest scores of all the proposed models, with metrics of between 80.01% and 83.49%, and the least accurate predictions on all accounts.

The data show that the weighted scoring has statistically enhanced the performance of the models and that the BR-B-WS does the best job. The modified models (BB-M-S and BR-M-S) provide good insight into the sources of performance variability and possible improvement points.

Comparative study

This section compares the methods and feature algorithms to verify the performance of the proposed (HBB-RE) module in terms of using 100 SHAP feature selection processing. It also compares it with the state-of-the-art intrusion detection methods to demonstrate the advantages of the proposed (HBB-RE) IDS.

Comparison with classical methods

Classical methods such as decision trees, k-nearest neighbors (k-NN), and support vector machines (SVM) run inefficiently on high-dimensional data. They are less useful for the analysis practice. The SHAP-based method effectively identifies the relevant features and promotes model accuracy and robustness. A rich literature shows that SHAP-based methods help to interpret models and can either improve existing models or identify new approaches to improve model performance in diverse domains44. A general trade-off holds between model interpretability and predictive performance.

On the other hand, critical advances have led to impressive improvements in accuracy, such as neural networks and ensemble methods like random forests and gradient boosting. However, these models often have limited interpretability or cannot explain why they produce a particular output from a specific input5. The SHAP-based method nicely demonstrates the importance and contribution of features that should be more interpretable for non-experts.

Classical methods often implode when dealing with imbalanced datasets45. Combining SHAP-based feature selection with modern resampling techniques can improve performance on imbalanced datasets while maintaining interpretability. Overfitting is an obstacle in classical machine learning methods46. Hybrid ensemble learning techniques and SHAP-based feature selection mitigate overfitting, improve generalization, and stabilize promotion.

Moreover, classical methods generally report high accuracies but fail miserably on more meaningful measures, such as precision, recall, and F1 score47. Using SHAP-based feature selection, the model achieves higher accuracy and F1 scores than classical methods. The proposed method demonstrates its significance in developing robust and accurate IDPS and is the first to use SHAP-based feature selection. This type of learning is expected to aid in creating more reliable, strong, and explainable IDPS in scientific fields.

Comparison with state-of-the-art methods

The Table 4 illustrates various studies and proposed methods related to Feature Selection, Algorithm, Number of Features, Accuracy, and F1 Scores in Predictive Modelling. The studies are done from Study 1 to 11 in which Feature Selection methods and Algorithms differ. Feature Selection methods used are NTLEBO, HFS, RFE, Weka-ML, F1, PSO-FO-GO-GA. Algorithms are traditional ones such as Logistic Regression (LR) and Decision Tress (DT) as well as the ones that are much more advanced such as Deep Neural Networks combined with Ant Colony Optimization (DNN+ACO), LGBM, XGBoost, Light Gradient Boosting Machine, Random Forest, Artificial Neural Networks (ANN). Abbreviations used in this study include P (Proposed), B (Binary), M (Multi), S (SMOTE), and W (Without).

The number of features selected for models in the studies ranges from 5 (Study 8)48 to 30 (Study 11)49. This range suggests that feature selection and dimensionality reduction may have been performed differently in the different studies, using various ad hoc strategies to optimize model performance. The overall model performance in accuracy and analysis is given by 75.66% (Study 9)50 to 98.25% (Study 2)51. At the same time, for F1 scores – which measure a balance between precision and recall- the values range from 77.28% (Study 10)52 to 97% (Study 1)53.

Table 4 Comparison of Feature Selection and Algorithms.

These state-of-the-art studies are compared with the proposed methods – Proposed-B-SMOTE, Proposed-M-SMOTE, and Proposed-B-W/SMOTE – which use SHAP for feature selection and HBB-RE as the algorithm, all of which use 100 features. Using SMOTE as a baseline technique, the accuracy for Proposed-B-SMOTE is 98.47%, and the F1 is 98.47%, showing a very high predictive performance. As for Proposed-M-SMOTE, the accuracy is 84.52% to 92.75%, and the F1 scores are 84.34% to 92.81%, showing that the value varies based on the modification. As for Proposed-B-W/SMOTE, the accuracy is 94.92%, and the F1 is 96.19%, which also shows a strong performance effect.

The proposed methods outperform state-of-the-art algorithms in accuracy and F1 score. Table 4 shows that a Proposed-B-W/SMOTE has the highest accuracy of 98.47%, slightly more than the most significant accuracy, 98.25%, in (Study 2). Similarly, F1 scores are also higher than those of the existing studies, which is 96.19%. Compared to the state-of-the-art methods, achieving a higher F1 score using the proposed methods is higher than most studies. The proposed methods have improved overall accuracy, especially with standardizing to 100 features and using SHAP for feature selection. It makes the methods better compared to the state-of-the-art methods. The proposed methods use advanced feature selection techniques combined with resampling techniques to improve the performance of ML models compared to many state-of-the-art methods in improving model accuracy and balancing between precision and recall.

link

Leave a Reply

Your email address will not be published. Required fields are marked *