Machine learning approach for identifying and forecasting streamflow droughts in data limited basins of South Korea using threshold levels
Streamflow deficit characteristic based on threshold level method

Sensitivity in Mean Drought Duration, Deficit Volume, and Intensity with Threshold levels (Q70, Q80, Q90, Q95, Q99) for each Dam: Soyanggang Dam (Dam 1), Andong Dam (Dam 2), Juam Dam (Dam 3), and Yongdam Dam (Dam 4).
The hydrological drought characteristics were analyzed using the historical inflow data of Soyanggang Dam, Andong Dam, Juam Dam, and Yongdam Dam. The threshold level method was employed for each dam inflow, which allows for defining the start and end dates of drought events. Fig. 1 represents the relative drought characteristics, which are drought duration, deficit volume, and intensity, for the threshold levels (Q70, Q80, Q90, Q95, and Q99) at each dam. Four types of time resolutions for thresholds were utilized: daily, monthly, seasonal, and fixed thresholds.
Figure 1a, d, g, and j illustrate changes in the relative annual mean drought duration according to variations in the threshold levels. Using the duration for Q70, which showed the longest drought duration, as a baseline, the relative changes in drought duration for other threshold levels are presented. A linear decrease in drought duration was observed as the threshold level decreased from Q70 to Q99, and this trend was not affected by changes in the time resolution of the thresholds (daily, monthly, seasonal, and fixed). This tendency was more pronounced in Dam1 and Dam2, which had relatively longer data records. Additionally, a comparison of the results for each dam revealed consistent ratios for drought durations at different threshold levels relative to Q70: Q80 was approximately 61% of Q70, Q90 was about 25–0% of Q70, and Q95 was around 10–15% of Q70.
Figure 1b, e, h, and k show changes in the relative annual mean deficit volume according to variations in the threshold levels. Similar to the previous results, the relative deficit volume exhibited consistent values regardless of changes in the threshold time resolution or the specific dam. While the relationship between the threshold level and deficit volume did not show a linear trend, a consistent ratio was observed across threshold levels when comparing the results for each dam. For intensity, defined as the ratio of deficit volume to duration, a linear trend was identified. However, compared to the other two characteristics, no distinct tendencies were observed for intensity.
As shown in Fig. 1, applying the TLM to the inflow data of different dams revealed consistent patterns in drought characteristics across variations in threshold levels. Additionally, the use of indices such as Q70-Q95 aligns with established practices in hydrology, providing a reliable and practical approach to identifying and managing hydrological droughts. Accordingly, the threshold levels for each time resolution were used as the 70th percentile, 90th percentile, and 95th percentile flow values. The hydrological drought events identified using the above method were classified into occurrence and non-occurrence of drought, thereby generating a time series. This time series was used as the target variable to train the model.

Historical drought duration and volume with daily varying threshold (Q70, Q90, and Q95): (a) Soyanggang Dam (Dam 1), (b) Andong Dam (Dam 2), (c) Juam Dam (Dam 3), and (d) Yongdam Dam (Dam 4).
Figure 2 represents the findings on the streamflow deficit using the daily varying threshold level for each dam. Each figure includes a series of bars and lines representing the annual duration and volume of droughts, respectively, with different threshold levels (Q70, Q90, and Q95). A similar drought pattern was observed across all dams in the mid-1990s, 2009, and the mid-2010s. Additionally, for Dams 1 and 2, there has been an increasing frequency of severe droughts, which aligns with meteorological drought patterns.
For each time resolution and threshold, the mean, coefficient of variation (CV), maximum, and minimum values were calculated. As the threshold flow decreased (from Q70 to Q95), the annual average drought duration and deficit volume decreased, and the intensity, defined as the ratio of volume to duration, also diminished. Conversely, as the threshold flow decreased, the CV of the annual average drought volume increased, demonstrating greater variability in the volume of deeper droughts. However, the CV of the annual average duration was highest at Q90 for most dams, except for Dam 2. The CV of the annual average intensity did not show a consistent pattern with respect to the threshold levels.
Table 1 summarizes the hydrological drought characteristics identified using monthly varying, seasonal varying, and fixed thresholds, which are not illustrated in Fig. 2. The statistical properties of the duration, volume, and intensity were calculated for each specific condition. Similar to the results obtained using the daily varying threshold, a decrease in the threshold flow reduced the annual averages of duration and volume. In contrast, the CV of the volume generally increased. Additionally, the CV of the duration often showed its highest value at Q90.
Comparing the annual averages of drought volume and intensity for each dam reveals differences in magnitude. However, when the same time resolution and threshold level are applied, the annual average drought duration shows similar values across the dams. This suggests that, despite the differences in watershed size between the dams, the flow regimes of the watersheds respond similarly to climatic variations.
Drought identification model with XGBoost
This section describes the training and performance evaluation of the machine-learning model for identifying droughts. The target variable for the ML model was created using the threshold level method, applying different time resolutions and flow levels to define droughts, which were then categorized into occurrence and non-occurrence. To ensure robust performance evaluation, we applied the Stratified K-Fold Cross-Validation. This method divides the dataset into k folds while maintaining the proportion of each class in every fold. It addresses class imbalance issues and provides a more reliable estimate of the model’s performance compared to traditional K-Fold Cross-Validation. In this study, the data sets were split into five folds.

Stratified K-fold Validation of the XGBoost Model for Drought Identification Using Past (a) 5 Days, (b) 10 Days, (c) 20 Days, and (d) 30 Days of Meteorological and Inflow Data at Soyanggang dam (Dam 1).
The input data for the drought model comprised seven meteorological variables: precipitation, PET (potential evapotranspiration), daily minimum and maximum temperatures, wind velocity, atmospheric vapor pressure, and atmospheric pressure, along with daily inflow data. To assess the model’s applicability to semi-gauged watersheds, we evaluated its performance under three scenarios: using all eight variables, only meteorological variables (excluding inflow data), and only inflow data. The accumulation period is crucial since droughts occur gradually over time due to cumulative meteorological conditions. Therefore, we also investigated the model’s performance variation with different input data lengths, specifically 5, 10, 20, and 30 days.
Figure 3 demonstrates the performance of the XGBoost model for detecting droughts at the Soyanggang Dam using different time windows of meteorological and inflow data. The performance was measured using the F1 score across various time resolutions: daily, monthly, seasonal, and fixed. The results are presented for four scenarios: using the previous 5 days, 10 days, 20 days, and 30 days of data. The blue lines represent the validation results of the model trained using only inflow data for detecting droughts. The red lines indicate the results when only meteorological variables were used, and the black lines show the validation results when both inflow and meteorological variables were utilized.
As the length of the input data increases, the performance of the drought model improves. However, the performance improvement becomes marginal beyond 10 days up to 30 days. The model performed best under fixed threshold conditions for each scenario when comparing the performance across different time resolutions (daily, monthly, seasonal, and fixed). Additionally, as the threshold level increased (from Q95 to Q70), the model’s performance in predicting streamflow droughts improved. The finer the time resolution of the variable threshold, the greater the variability in thresholds over time, which leads to a decrease in the model’s ability to classify drought occurrences accurately.
When comparing the drought model results using only meteorological variables with those using only inflow data, the model using only inflow data performed better under fixed threshold conditions. This suggests that inflow data has a more direct relationship with streamflow deficits under fixed thresholds, making inflow a dominant factor in the drought model. However, when using variable thresholds, the model’s performance using only meteorological variables improved with longer input data periods, surpassing the performance of the model using only inflow data. This indicates that the influence of inflow data on the streamflow drought model is less significant than that of meteorological data under variable thresholds. Moreover, as the length of input data increases, the importance of meteorological variables in predicting streamflow drought characteristics also increases.
The first case, illustrated in Fig. 3a, shows the F1 score when the model is trained using the previous 5 days of data. The F1 scores consistently improve across all three quantiles (Q70, Q90, and Q95) as the time resolution increases from daily to fixed. The fixed resolution yields the highest F1 scores, with Q70 achieving an F1 score of 0.922, followed by Q90 and Q95 with scores of 0.841 and 0.788, respectively.
Figure 3b represents the model using the previous 10 days of data. Like the previous case, the F1 scores increase with higher time resolutions. The fixed threshold again shows the best performance, with Q70 reaching an F1 score of 0.926. Q90 and Q95 also exhibit significant improvements, attaining F1 scores of 0.841 and 0.791, respectively. Figure 3c presents the results for the model trained with the previous 20 days of data. The trend of increasing F1 scores with higher time resolutions continues. At the fixed resolution, the F1 score for Q70 peaks at 0.925, while Q90 and Q95 achieve scores of 0.834 and 0.764, respectively.
The final case, shown in Fig. 3d, uses the previous 30 days of data. This case provides the highest F1 scores across all-time resolutions and quantiles. The fixed threshold exhibits F1 scores of 0.928, 0.846, and 0.789 for Q70, Q90, and Q95, respectively. This demonstrates that utilizing a more extended period of historical data significantly enhances the model’s ability to detect droughts accurately.
Table 2 presents the model validation results as F1 scores for all threshold conditions and input data lengths. The same patterns observed in Soyanggang Dam (Dam 1) in Fig. 3 are also evident in the other dams. Specifically, as the time resolution of the threshold becomes coarser (from daily to fixed) and the threshold level increases (from Q95 to Q70), the performance of the drought identification model improves. When the time resolution becomes finer, it indicates that the threshold level varies more frequently within shorter time intervals. Consequently, training the drought identification model requires a greater number of classification trees. Since this study fixed the depth of the trees, the model using fixed thresholds outperformed those using varying thresholds.
In general, the performance of the model improves with longer input data lengths. On average, the F1 score was 0.688 for 5-day data, 0.697 for 10-day data, 0.709 for 20-day data, and 0.719 for 30-day data. The F1 score increased by approximately 0.27% per day in the 5-10 day range, 0.16% per day in the 10-20 day range, and 0.14% per day in the 20-30 day range. While longer input data lengths provided more information, the marginal utility of additional data for drought identification tended to diminish. Furthermore, depending on the dam watershed and threshold conditions, there were instances where the model using 20-day input data outperformed the model using 30-day input data. This variation highlighted the importance of considering the watershed characteristics and the specific threshold conditions when selecting the optimal input data length for drought modeling.

Comparison of Observed and Modeled Drought Events by Day of Year with Q70 Threshold Level; (a) Dam 1: daily, (b) Dam 1: fixed, (c) Dam 2: daily, (d) Dam 2: fixed, (e) Dam 3: daily, (f) Dam 3: fixed, (g) Dam 4: daily, (h) Dam 4: fixed.
Based on the validation results of the drought identification model discussed earlier, Fig. 4 presents the detection results for droughts using the Q70 threshold as heatmaps. The figure illustrates the model results for drought identification for each dam using the best-performing fixed threshold and daily-varying thresholds on a daily time series. In each sub-figure, the upper heatmap shows the historical droughts defined by the time resolution and Q70 threshold in red, while the lower heatmap represents the model-identified droughts. In the model-identified heatmap, red indicates droughts in the training set, and blue indicates droughts in the validation set. The y-axis represents the day of the year, and the x-axis represents the years.
When comparing the daily varying droughts on the left with the fixed threshold-defined droughts on the right, it is evident that the former makes it challenging to analyze drought patterns, especially for Dam 1, where the distribution of droughts appears almost random. This randomness seems to be one of the factors that decrease the performance of the drought identification model. The results of the fixed threshold droughts reproduce the pattern of low flows in autumn and winter. Notably, it was observed that droughts tend to occur in spring, and the frequency of spring droughts has increased since the mid-2010s.
Table 3 presents the results of cross-validation conducted to verify whether the drought identification model could generalize to different regions. The first column of the table indicates the dam basins used for model training, while the second column represents the basins used for testing. Although validation was performed for various scenarios, this study describes the results for two representative cases. The case where the model was trained using the data from Dam 2, Dam 3, and Dam 4 basins and tested on the Dam 1 basin yielded the poorest performance. In contrast, the model demonstrated the best performance when trained on data from Dam 1, Dam 2, and Dam 3 basins and tested for drought identification in the Dam 4 basin.
When examining the results of cross-validation across various cases, it was observed that the performance of validation was related to the length of the data used for training and testing. Specifically, the model demonstrated superior performance when trained on longer datasets and evaluated on regions with shorter datasets, compared to the opposite scenario. The dataset for Dam 1 spans 50 years (1974–2023), Dam 2 spans 36 years (1988–2023), Dam 3 spans 33 years (1991-2023), and Dam 4 spans 23 years (2001–2023). Among the various cross-validation scenarios, the evaluation of Dam 4, which utilized the longest dataset for training and the shortest dataset for testing, as shown in Table 3, exhibited the best performance. In contrast, the evaluation of Dam 1, which involved testing the region with the longest dataset, showed the lowest performance.
When comparing performance across threshold levels, the highest performance was observed at Q70, consistent with the previous validation evaluations. In particular, when the model was trained using data from Dam 1, Dam 2, and Dam 3, the Fixed Q70 threshold evaluation yielded performance equivalent to 95.7% of the original Dam 4 evaluation results (Table 2). Additionally, the cross-validation results for the Q70 daily varying threshold showed an average performance of 91.9% relative to the original evaluation, while the monthly varying threshold achieved 89.2%, and the seasonal varying threshold reached 84.7% of the relative performance. Furthermore, the Fixed Q90 threshold resulted in an F1 score equivalent to 89.4% of the original performance. These findings suggest that when the model is trained on datasets from regions with relatively sufficient data lengths, the use of a fixed Q70 threshold for hydrological droughts allows the model to generalize effectively. This indicates the potential applicability of the model to semi-gauged basins.
Drought occurrence forecasting
In this section, we evaluate the performance of a model that forecasts drought occurrence three days in advance using meteorological variables and inflow data. The threshold time resolution is categorized into daily, monthly, seasonal, and fixed, with threshold levels of Q70, Q90, and Q95, consistent with the conditions used in the previous drought identification model. The length of the data used for training is also the same as the drought identification model, including periods of 5, 10, 20, and 30 days. The results are compared with those from the drought identification model under the same conditions.

Comparison of Validation Results from Drought Identification and Forecasting Models Using Previous (a) 5 days, (b) 10 days, (c) 20 days, and 30 days at Soyanggang dam.
Figure 5 compares the performance of the drought identification model trained using meteorological data and inflow from Soyanggang Dam with the model that forecasts droughts three days in advance, using the F1 score. The black color represents the drought identification model, while the red color indicates the scores of the three-day drought forecasting model. Similar to the drought identification model, the performance of the three-day drought forecasting model improves as the threshold time resolution becomes coarser and the threshold level increases.

Comparison of Observed and Forecast Events by Day of Year with Q70 and Monthly-Varying Threshold Level; (a) Dam 1, (b) Dam 2, (c) Dam 3, and (d) Dam 4.
Figure 6 shows the results of predicting droughts three days in advance for each dam, defined using the Q70 and monthly-varying threshold. The red represents historical drought occurrences, while the blue represents the model’s predictions. Like the drought identification model, the drought forecasting model was evaluated using stratified K-fold cross-validation. The results shown in the figure are from the fold with the highest performance, predicting the entire period. The average F1 scores obtained through K-fold cross-validation for all threshold time resolutions and threshold levels are presented in Table 4.
It can be observed that the performance of the model predicting drought three days in advance is lower than that of the drought identification model under the same threshold conditions. However, when using the Q70 threshold level, the average F1 score across all time resolutions was above 0.7. Moreover, the performance of the drought forecasting model using the Q70 threshold was higher than that of the drought identification models for other threshold levels.
Table 5 summarizes the performance of the drought forecasting model on the validation set, evaluated using precision and recall. The evaluation metrics were derived from the results of a fixed threshold approach, which demonstrated strong performance in terms of the F1 score. Precision is the proportion of actual drought occurrences to the total predictions made by the model as drought occurrences, with a maximum possible value of 1. Recall, on the other hand, is the proportion of actual drought events correctly predicted as droughts to the total number of actual drought events, also with a maximum value of 1. Furthermore, bootstrap analysis was conducted to compute 95% confidence intervals for the evaluation metrics in each case, providing an assessment of their variability.
Precision for Q70 drought events averaged 0.796, for Q90 it was 0.674, and for Q95 it was 0.552, while recall showed values of 0.879 for Q70, 0.783 for Q90, and 0.669 for Q95. Except for the Q95 case of Dam 4, the recall was higher than the precision across all other cases. This indicated that the model is more effective at identifying actual positive cases (e.g., drought occurrences) than at ensuring the accuracy of its positive predictions. In other words, the model successfully captures a larger proportion of the true positive cases, but this comes at the cost of potentially including more false positives (e.g., cases where the model predicted a drought, but the prediction was incorrect).
Bootstrap analysis was used to calculate the 95% confidence intervals for precision and recall, allowing an assessment of prediction uncertainty. On average, the 95% confidence interval for precision ranged from 0.766 to 0.826 for Q70 droughts, 0.599 to 0.747 for Q90, and 0.393 to 0.717 for Q95. Similarly, the 95% confidence interval for recall ranged from 0.854 to 0.904 for Q70 droughts, 0.709 to 0.852 for Q90, and 0.486 to 0.842 for Q95. The confidence interval ranges were observed to widen as the threshold level shifted from Q70 to Q95. This suggests that prediction uncertainty increases for more severe droughts, indicating greater variability in the model’s performance under extreme conditions.
Since the drought forecasting model in this study predicts three days using past data, the meteorological variables for those three days are not reflected in the model, which may account for its slightly lower performance than the drought identification model. Nevertheless, the model demonstrated relatively high scores depending on the threshold conditions and the data length used. This indicates the potential applicability of drought forecasting models, suggesting that higher performance can be expected when using meteorological forecast data.
Feature importance analysis for drought model
To interpret the internal mechanisms of the XGBoost-based drought identification model, we employed SHAP (Shapley Additive Explanations), a widely used method for explaining the output of machine learning models. SHAP values quantify the contribution of each input variable to the model’s predictions by computing the marginal impact of a feature over all possible combinations, grounded in cooperative game theory. Especially, SHAP values explain the model’s raw prediction function (in this study, the probability output from XGBoost). SHAP values explain the contribution of each input variable to the predicted probability of drought occurrence. The final binary classification was obtained by applying an optimized threshold using the F1 score.
In this study, SHAP analysis was conducted to evaluate the relative importance of inflow and meteorological variables across different threshold levels (Q70, Q90, and Q95) and threshold time resolutions (daily, monthly, seasonal, and fixed). We focused on the Soyanggang Dam (Dam 1) case and applied the analysis using 5 days to 30 days of accumulated input data. The mean absolute SHAP values were calculated for each variable to assess their influence on the drought classification output.
Figure 7 presents the results of the feature importance analysis for Dam 1 under each threshold condition. The variables used in this analysis include inflow (represented by the blue bars) and meteorological variables. The meteorological variables, in the order displayed in the bar plots, include potential evapotranspiration, daily minimum temperature, daily maximum temperature, actual vapor pressure, atmospheric pressure, wind speed, and precipitation. The y-axis of the figure indicates the mean absolute SHAP value for each variable, while the x-axis represents the length of input data (in days) used in the drought model. Each row in Fig. 7 corresponds to a different threshold time resolution, and each column corresponds to a different threshold level.

Feature Importance Analysis of Input Variables in Drought Model (Dam 1, Soyanggang) for each Threshold; (a) Daily and Q70, (b) Daily and Q90, (c) Daily and Q95, (d) Monthly and Q70, (e) Monthly and Q90, (f) Monthly and Q95, (g) Seasonal and Q70, (h) Seasonal and Q90, (i) Seasonal and Q95, (j) Fixed and Q70, (k) Fixed and Q90, (l) Fixed and Q95.
Across all threshold conditions (comprising four time resolutions and three levels), it was observed that as the input time window-i.e., the length of data used in the drought model-increased, the relative contribution of inflow to model predictions decreased, while the contribution of meteorological variables increased. On average, the contribution of meteorological variables rose progressively with longer input windows, increasing from 45% with a 5-day window to 48% for 10 days, 51% for 20 days, and 53% for 30 days.
In contrast to the previous findings, it was observed that as the threshold time resolution became coarser (from daily to fixed thresholds), the contribution of meteorological variables generally decreased. Specifically, the contribution of meteorological variables was approximately 60% under daily thresholds, 56% for monthly thresholds, 46% for seasonal thresholds, and 35% under fixed thresholds. This indicates that as the threshold resolution becomes more sensitive (i.e., from fixed to daily), the importance of meteorological variables in the drought model increases.
On average, no consistent pattern was found in the contributions of inflow and meteorological variables to the drought model predictions across different threshold levels. The contribution of meteorological variables was approximately 50% for Q70, 48% for Q90, and 50% for Q95. Compared to the influence of input time window and threshold time resolution on feature importance, the variation in contributions due to threshold levels was minimal and did not exhibit a clear trend. However, the observation that the contribution of meteorological variables increased with longer input time windows, and that the contribution of inflow increased under thresholds based on larger temporal scales (from daily to fixed), suggests that the model structure is hydrologically reasonable and appropriately constructed.
link
