Forecasting invasive mosquito abundance in the Basque Country, Spain using machine learning techniques | Parasites & Vectors

0
Forecasting invasive mosquito abundance in the Basque Country, Spain using machine learning techniques | Parasites & Vectors

Entomological and meteorological data

Data on Aedes mosquito egg counts from 2013 to 2023 in the Basque Country were obtained using ovitraps as described in [7, 30]. Following the European Centre for Disease Prevention and Control (ECDC) recommended guidelines [1], the ovitraps were distributed across the three provinces, covering 63 municipalities, as shown in Fig. 1b.

Fig. 1
figure 1

a Basque Country region in Spain (location in the European map). b Meteorological stations and ovitraps locations in the Basque Country provinces during the intersection study period (2016–2023)

The number of ovitraps varies by municipality, with two sampling areas selected in most cases. Each sampling area typically contains five ovitraps, which are positioned in sheltered spots away from direct sunlight and wind, often hidden within vegetation. Therefore, up to ten ovitraps per municipality were placed in most cases. Each ovitrap contains water and a wooden stick (or tablex) that serves as a substrate for mosquito egg-laying. Every 14 days (on average), these paddles are removed, and new ones are put in their place. Thus, each municipality and area is sampled roughly 10–12 times per year, from June through November [7].

Meteorological data for the Basque Country were collected from the Basque Meteorological Agency (Euskalmet) across several weather stations (see Fig. 1b),Footnote 1 covering the period from 2016 to 2023. The data, obtained from the OpenData Euskadi website [31], include precipitation [recorded as cumulative precipitation in millimeters (mm) or liters per square meter (l/m2)], temperature [measured in degrees Celsius (°C)], and humidity [relative air humidity as a percentage (%)]. Weather observations were recorded every 10 min at each station. For this study, we calculated daily averages of temperature and humidity and daily cumulative precipitation for each meteorological station.

Study area and data per provinces

The Basque Country, located in northern Spain, is divided into three administrative provinces: Araba (Álava), Bizkaia (Biscay), and Gipuzkoa (see Fig. 1). With a total area of 7234 km2 and a population of approximately 2.18 million [32], the region is characterized by diverse landscapes and a maritime climate, with temperate conditions and high annual precipitation, particularly in the coastal areas. Araba, the southernmost province, has a more continental influence in its climate, with drier and slightly colder conditions than the coastal provinces of Bizkaia and Gipuzkoa. Bizkaia and Gipuzkoa, bordered by the Cantabrian Sea, experience milder temperatures and higher humidity. These climatic differences across the provinces influence the mosquito abundance patterns, which this study aims to capture and analyze through the environmental data collected.

For this study, we analyzed ovitrap mosquito egg counts collected in various locations across all three provinces. The data were preprocessed by averaging the 20 highest egg counts per province over a 14-day interval, considering that each municipality had a maximum of 10 ovitraps. This approach was necessary to address inconsistencies in the number of monitored ovitraps over the studied period and to avoid skewing the results with prevalent zero counts. By selecting the 20 largest egg counts, the data reflect meaningful mosquito activity (in at least two distinct locations), effectively filtering out areas with consistently low or zero activity.

Meteorological data, specifically daily precipitation [cumulative precipitation in millimeters (mm)], air temperature [in degrees Celsius (°C)], and relative humidity [percentage (%)], were obtained by averaging daily values from all available meteorological stations in each province. These features were then aggregated over the previous 14 days to maintain consistent time intervals between the entomological and meteorological datasets. The average annual temperature and accumulated precipitation in each province align with environmental conditions favorable for A. albopictus survival, approximately 11.5 °C and 878 mm in Araba, 13.8 °C and 1278 mm in Bizkaia, and 13.4 °C and 1610 mm in Gipuzkoa [7], which are consistent with the survival thresholds discussed in the literature for this species [1, 12].

The time series of the average egg counts, temperature, humidity, and cumulative precipitation for each province in the Basque Country are shown in Fig. 2.

Fig. 2
figure 2

Number of mosquitoes eggs collected in a, c, e and average temperature (°C), relative air humidity (%), and cumulative precipitation (mm) in b, d, f. Data were gathered biweekly for Gipuzkoa, Bizkaia, and Araba, respectively

Mosquito eggs are typically found during the summer months, from June to October, when the combination of higher temperatures and favorable humidity conditions promotes their activity and reproduction. As shown in Fig. 2a, the egg count in the entire Gipuzkoa province has significantly increased over the last years of collected data, although this trend may vary between municipalities. For example, in the city of Irun (Supplementary Material S3), the second most populated city in Gipuzkoa, located on the border with France, where variability is present without a clear increasing trend.

In Gipuzkoa, temperature exhibited a clear seasonal annual pattern, while accumulated rainfall showed no apparent trend. Humidity, however, decreased during the winter and followed a quasi-periodic structure (see Fig. 2b). The winter of 2019, right after the expected period of higher egg presence, was exceptionally rainy compared with other winters in the province. Combined with low humidity (below 75%), this may have contributed to the lower egg counts observed in the following summer season (2020). In contrast, the dry summer of 2022, accompanied by higher humidity levels (above 75%), may explain the increased egg counts observed that year.

In Bizkaia, the time series of egg counts has displayed a consistent upward trend over the years, with positive egg traps first recorded in 2017 (Fig. 2c). Notably, the average mosquito egg count in Bizkaia during 2023 serves as a good proxy for the province-wide average, as shown by the time series for Bilbao, the capital of Bizkaia (Fig. S4c in Supplementary Material S3).

The temperature in Bizkaia followed a clear seasonal pattern, while accumulated rainfall showed no apparent trend, with significant cumulative precipitation occurring later in 2021. In contrast to Gipuzkoa, however, humidity in Bizkaia exhibited periodic increases approximately every 2 years, with higher levels typically observed during winter months (Fig. 2d).

Moreover, average precipitation in Bizkaia was slightly lower than in Gipuzkoa. Temperature fluctuations in Bizkaia were more pronounced, as indicated by the steeper slope of its temperature curve compared with Gipuzkoa, potentially explaining the lower average egg counts in the region. In addition, the dry summer of 2021, followed by a rainy winter, may have contributed to the consistent egg count trend observed.

Furthermore, although ovitraps have been distributed and data collected in the province of Araba since 2013, positive egg traps were not recorded until 2018, with no positive ovitraps observed in 2019 or 2020 (Fig. 2e). In Laudio, the second most populated municipality in Araba, positive ovitraps were only recorded in 2021 (Fig. S4e in Supplementary Material S3).

The average temperature in Araba exhibits annual seasonality, while precipitation lacks a clear trend, though cumulative rainfall is typically higher during winter. On the other hand, humidity also tends to increase alongside precipitation (Fig. 2f). The lower average temperature in this province may contribute to the reduced presence of mosquito eggs.

Given the dispersed nature of data in Araba, with many zero values in egg counts (Fig. 2e), there is insufficient information to develop a reliable training dataset for model fitting. Therefore, this province is excluded from further analysis. Smaller spatial units, such as individual municipalities, are similarly excluded, with the focus of this study being the two Basque Country provinces, Gipuzkoa and Bizkaia. Nonetheless, descriptive statistics and detailed analyses at the municipal level for Irun and Bilbao, which have adequate data, are provided in Supplementary Material S3.1 and S3.2, respectively.

Methodological approach

Data processing

After gathering data, preprocessing is a crucial initial step before model training, forecasting, and evaluation. In this study, data preprocessing included the following steps. First, we ensured a consistent interval for both the independent and dependent variables, selecting a biweekly interval for the entomological data based on the average 14-day period in which egg counts were collected.

Next, we addressed missing values through imputation, filling gaps with zero values. This choice is scientifically justified within the context of this dataset, as institutional data indicated that, for months without data collection, ovitrap counts would have likely been zero [7]. This assumption was based on data from four sentinel points (two in Gipuzkoa and two in Bizkaia) monitored over a year to determine the start and end of Aedes mosquito activity in regions with recorded presence in the previous year.

Moreover, we included only the 20 highest egg counts at the provincial level to account for variations in the number of monitored ovitraps over time, helping to reduce dataset skewness. Outliers were then removed using a central moving average as a smoothing method, commonly applied to mitigate white noise, random fluctuations, and extreme values [33].

For the meteorological data, no imputation was required as daily weather data were available for the entire study period. In this case, outliers were retained as they could signal significant events associated with the presence or absence of mosquito eggs. Basic exploratory analysis was then conducted using descriptive statistics and correlation tests, incorporating both the original and lagged versions of the meteorological data.

Finally, we split the data into training and testing sets, with the training data comprising \(85.71\%\) and \(83.33\%\) for Gipuzkoa and Bizkaia, respectively. The remaining 26 data points (1 year of biweekly data) were allocated for testing.

Models

In this study, we applied different models including and excluding the lagged version of eggs count as a proxy and the lagged version of the independent environmental variables. To appropriately handle the discrete and non-negative nature of counts, we restrict our choices and applications of the models presented here [34,35,36,37,38,39,40,41]. More details about each model can be found in Supplementary Material S1.

We implement the generalized linear model (GLM), seasonal autoregressive integrated moving average with exogenous variables (SARIMAX), random forest (RF), and conditional inference tree (CTree) models (and other models discussed in Supplementary Material S1) in the R computing language (R version 3.6.3) using the packages MASS, forecast, randomForest and party, respectively. Nevertheless, only the four models cited earlier will be presented in this study because (as discussed in Supplementary Material S1) some models exhibit over-fitting, others demonstrate under-fitting (as is the case with the ANNs model), and some fail to capture any significant features of the data.

Stationary analysis

We applied the augmented Dickey–Fuller (ADF) test, a commonly used method for testing the presence of a unit root in time series data, to assess whether the time series is nonstationary [17]. Nonstationarity in a time series often presents means, variances, and covariances that change over time, making the series unpredictable and challenging to model or forecast. Although some models, such as SARIMAX, can handle nonstationarity, stationary time series often yield more reliable results [42].

The null hypothesis of the ADF test states that the series contains a unit root, indicating nonstationarity, while the alternative hypothesis suggests that the series is stationary. To test the null hypothesis, we computed the P value. A P value less than 0.05 leads us to reject the null hypothesis, implying stationarity.

We conducted the ADF test using the tseries package in R. For both datasets, Gipuzkoa and Bizkaia, the ADF test on the predictor variable yielded a P value of approximately \(P< 0.01 < 0.05\), indicating that the datasets are stationary.

Evaluation metrics

To compare the performance of statistical and machine learning models, three widely used evaluation metrics were employed: the mean absolute error (MAE), the root mean squared error (RMSE), and the R-squared (\(R^2\)) score.

The MAE is calculated as:

$$\begin{aligned} {\text{MAE}} = \frac{1}{n} \sum _{i=1}^{n} |y_i – \hat{y}_i|, \end{aligned}$$

(1)

where \(y_i\) and \(\hat{y}_i\) represent the observed and predicted values, respectively, and \(| \cdot |\) denotes the absolute value [43]. MAE measures the average magnitude of the errors in a set of predictions, without considering their direction.

The RMSE is given by:

$$\begin{aligned} {\text{RMSE}} = \sqrt{\frac{1}{n} \sum _{i=1}^{n} (y_i – \hat{y}_i)^2}, \end{aligned}$$

(2)

where \(y_i\) and \(\hat{y}_i\) are the observed and predicted values, respectively. RMSE gives a higher weight to large errors compared with MAE and is sensitive to outliers.

The \(R^2\) score, also known as the coefficient of determination, is calculated as:

$$\begin{aligned} R^2 = 1 – \frac{S_r}{S_t} = 1 – \frac{\sum _{i=1}^{n} (y_i – \hat{y}_i)^2}{\sum _{i=1}^{n} (y_i – \overline{y})^2}, \end{aligned}$$

(3)

where \(S_r\) is the residual sum of squares, representing the sum of squared differences between the observed values (\(y_i\)) and the predicted values (\(\hat{y}_i\)); and \(S_t\) is the total sum of squares, calculated as the sum of squared differences between the observed values (\(y_i\)) and their mean (\(\overline{y}\)). An \(R^2\) score of 1 indicates that the model explains all the variability of the response variable, while a score of 0 indicates no explanatory power.

The selection of the best model is based on achieving the lowest MAE or RMSE values or an \(R^2\) score closest to 1. In this study, the MAE is chosen as the primary evaluation metric due to its suitability for machine learning models [43].

link

Leave a Reply

Your email address will not be published. Required fields are marked *