Machine learning to examine adequate awareness and positive perception of HIV pre-exposure prophylaxis among women in sub-Saharan Africa: evidence from 2021-2024 surveys | BMC Infectious Diseases

0
Machine learning to examine adequate awareness and positive perception of HIV pre-exposure prophylaxis among women in sub-Saharan Africa: evidence from 2021-2024 surveys | BMC Infectious Diseases

Study design, setting and data sources

This study employed a cross-sectional design using nationally representative, population-based surveys conducted in sub-Saharan Africa between 2021 and 2024. Data were drawn from the Demographic and Health Surveys (DHS) program, which is funded by the United States Agency for International Development (USAID) and provides financial and technical support for standardized demographic and health data collection worldwide. For this analysis, we used the most recent DHS datasets available from eight countries—Burkina Faso, Côte d’Ivoire, Ghana, Kenya, Tanzania, Democratic Republic of Congo, Lesotho, and Senegal. The DHS employs rigorous multistage stratified sampling techniques to ensure national and regional representativeness, with large sample sizes designed to capture demographic, behavioral, and health-related indicators. The surveys included standardized modules on HIV knowledge, awareness and perceptions of pre-exposure prophylaxis (PrEP), socio-demographic characteristics, and behavioral risk factors. For comparability, survey datasets were harmonized and pooled to create a secondary dataset, enhancing statistical power and enabling cross-country analyses of factors associated with PrEP awareness and perceptions among women of reproductive age.

Source population

The source population comprised women of reproductive age (15–49 years) residing in sub-Saharan Africa who were eligible to participate in the Demographic and Health Surveys (DHS) conducted between 2021 and 2024. The DHS program is designed to collect nationally representative data using standardized sampling and data collection procedures, ensuring comparability across countries. These surveys represent the general population of women in reproductive age groups within each participating country and serve as the foundation for deriving the study population.

Study population

The study population included women aged 15–49 years from eight sub-Saharan African countries—Burkina Faso, Côte d’Ivoire, Ghana, Kenya, Tanzania, Democratic Republic of Congo, Lesotho, and Senegal—that reported data on HIV pre-exposure prophylaxis (PrEP) adequate awareness and positive perception in their most recent DHS rounds. The final pooled dataset consisted of a weighted sample of 123,132 women. Eligible participants were those who completed the PrEP awareness and perception modules without missing key demographic information. Women who reported being HIV-positive at the time of the survey were excluded to focus the analysis on PrEP awareness and perceptions among HIV-negative women (Table 1).

Table 1 List of countries, survey years, sample size and proportion of women with adequate awareness and positive perception based on the demographic and health surveys included in the analysis for eight sub saharan African countries, 2021–2024

Sample size determination and sampling procedures

The Demographic and Health Surveys (DHS) are conducted about every five years in many low- and middle-income countries. They use standardized, pretested questionnaires and consistent methods for sampling, data collection, and coding. This allows for cross-country comparisons and multi-country analyses. In each country included in this study, the surveys relied on the most recent national census as the sampling frame, with samples stratified by urban and rural areas within administrative regions. The DHS applies a two-stage stratified cluster sampling design. In the first stage, clusters—called enumeration areas (EAs)—are randomly chosen from census lists, with the probability of selection proportional to the population size of each stratum. In the second stage, all households in the selected EAs are listed, and a fixed number are systematically chosen (e.g., every nth household) to ensure equal probability of selection. This process generates nationally representative samples of women aged 15–49 years across countries.

For this analysis, data were pooled from eight countries, resulting in a weighted sample of 123,132 women who responded to questions on HIV pre-exposure prophylaxis (PrEP) awareness and perceptions.

Outcome variable

The main outcome of interest in this study was women’s awareness and perceptions of HIV pre-exposure prophylaxis (PrEP), a preventive measure against HIV infection. Participants were asked whether they had ever heard of PrEP and, if so, how they perceived its use. Response options included: never heard of it, heard of it, heard and approved of taking it daily, heard but did not approve of daily use, or heard but were unsure about approving it. For analysis, women who were aware of PrEP and expressed approval of its use were coded as “Yes = 1,” while those who had not heard of PrEP or who did not approve or were unsure were coded as “No = 0.” We acknowledge that combining awareness and approval into a single binary variable may misclassify women who are aware but hesitant; however, this approach reflects the study objective of identifying women both aware of and positively inclined toward PrEP use. Sensitivity analyses were conducted to assess potential misclassification, which did not materially affect the results. There were no missing or unknown values for this outcome variable.

Predictors and feature selection

Predictor variables included socio-demographic characteristics (age, marital status, education level, employment status, income quintile), behavioral factors (number of sexual partners, condom use, history of sexually transmitted infections), health service utilization (recent HIV testing, antenatal care attendance), and contextual variables (urban/rural residence, media exposure, and country-level HIV prevalence). To reduce multicollinearity and improve model efficiency, feature selection was performed using recursive feature elimination (RFE) and correlation analysis, retaining only the most informative predictors for model training [26,27,28].

Data preprocessing

Data cleaning included handling missing values using multiple imputations with chained equations and encoding categorical predictors via one-hot encoding. Continuous variables were normalized with min–max scaling. The dataset was split into training (70%) and testing (30%) subsets, stratified by the outcome variable to ensure balanced class representation [28].

Correlation matrix heatmap

A correlation matrix heatmap was generated to visualize the relationships among the predictors included in the models. The heatmap displays both strong and weak correlations, facilitating the identification of potentially redundant or complementary variables. Insights from these correlation patterns informed the subsequent feature selection and model optimization steps, ensuring that only the most informative predictors were retained for model training (Fig. 1).

Fig. 1
figure 1

Correlation matrix heatmap illustrating pairwise associations among socio-demographic, behavioral, and contextual predictors used in the machine learning models

Feature ranking using recursive feature elimination (RFE)

In this study, feature selection techniques were applied to remove irrelevant or redundant variables during the development of predictive models, improving efficiency and interpretability. Data preprocessing involved systematically reducing the number of features to retain only the most informative predictors. We employed Recursive Feature Elimination (RFE), a method that iteratively evaluates and removes less important features based on model-derived importance scores until the most relevant variables remain. This approach enhances model performance, reduces overfitting by excluding noise, and simplifies model interpretation. Using RFE, the most influential predictors selected for model building included maternal age, educational status, place of residence, marital status, household wealth index, employment status, media exposure, ANC follow-up, place of delivery, number of health visits, total children, under-five children, contraceptive use, ever heard about STIs, ever tested for HIV, age at first birth, sexual partner working status, age at cohabitation, and abortion history. These selected determinants were then used to train the predictive models, as illustrated in Fig. 2.

Fig. 2
figure 2

Ranking of the most important features for predicting women’s awareness and perceptions of HIV pre-exposure prophylaxis (PrEP) using recursive feature elimination

Machine learning models

Five supervised machine learning classifiers were trained to predict awareness and positive perception of PrEP: K-Nearest Neighbors (KNN), XGBoost, CatBoost, LightGBM, and Gradient Boosting. Hyperparameters were optimized using grid search with 5-fold cross-validation, with accuracy and F1-score as the primary criteria. Model performance on the test set was evaluated using accuracy, precision, recall, F1-score, and the area under the receiver operating characteristic curve (ROC AUC).

Model interpretation

To enhance interpretability, Shapley Additive Explanations (SHAP) were computed to identify the most influential predictors in each model. Feature importance rankings were also derived from the algorithms, and SHAP summary plots were used to visualize both the direction and magnitude of feature effects. This provided actionable insights into the drivers of PrEP awareness [28].

Statistical analysis imputation

Descriptive statistics were used to summarize participants’ characteristics and the prevalence of PrEP awareness. Bivariate analyses (chi-square tests for categorical variables and t-tests for continuous variables) were conducted to explore associations between predictors and PrEP awareness. Prior to model training, data preprocessing included checks for multicollinearity among predictors using the variance inflation factor (VIF); highly correlated variables (VIF > 10) were excluded to ensure model stability. Missing values were handled according to the nature of the variable: for categorical variables, the mode imputation method was applied, while for continuous variables, multiple imputation techniques were employed to preserve statistical power and reduce bias. However, the proportion of missing data was minimal across the included variables.

For predictive modeling, multiple supervised machine learning algorithms were applied, including K-Nearest Neighbors (KNN), XGBoost, CatBoost, LightGBM, and Gradient Boosting. Model performance was evaluated using accuracy, precision, recall, F1-score, and ROC AUC metrics. Feature selection was performed using Recursive Feature Elimination (RFE), while model interpretability was assessed through SHAP (Shapley Additive Explanations) values. All analyses were implemented in Python (v3.8+) using libraries such as scikit-learn, XGBoost, CatBoost, LightGBM, SHAP, and pandas.

link

Leave a Reply

Your email address will not be published. Required fields are marked *