Reducing inequalities using an unbiased machine learning approach to identify births with the highest risk of preventable neonatal deaths | Population Health Metrics

admin

8 months ago

Table of Contents

Approach

In this research, the unit of analysis is the individual birth. We aim to identify the births with the highest neonatal mortality risk from preventable causes. As such, we included all the available information from the administrative databases, which improved the precision of our targeting.

Data sources

We use administrative databases from the Brazilian health ministry to obtain birth and death records in the entire country from 2015 to 2017 and information about health facilities, professionals, and available equipment. All data is available at Still, it is organized into three different health information systems: SINASC (Sistema de Informações sobre Nascidos Vivos), SIM (Sistema de Informação sobre Mortalidade), CNES (Cadastro Nacional de Estabelecimentos de Saúde), which we describe below.

SINASC includes all live births in the Brazilian territory, recording epidemiological and administrative information about the mothers and children. SIM, in turn, includes all deaths in the territory, containing epidemiological and administrative information and their circumstances. Fetal deaths are not considered as they are beyond the scope of this paper. Finally, CNES records a snapshot of Brazilian health facilities at a point in time. These systems contain three tables with all live births, deaths, and health facilities information.

To merge SIM and SINASC data, we used the field NUMERODN. It contains a unique number identifying each live birth. Records on SIM contain this information in cases of deaths within the first year since birth. Subsequently, we merged the information with CNES data by the CNES number, a unique identifier for health facilities in both the SINASC and CNES databases. The resulting raw dataset totals 8,829,944 records.

When merging the three databases, we identified and removed duplicate observations in the SIM and SINASC tables to avoid inconsistencies. With the deduplicated tables, deterministic linkages were executed.^{Footnote 1}

We note that we also performed linkages using probabilistic matching according to the method proposed by Enamorado et al. (2019). Although the probabilistic matching enabled us to link a larger proportion of records, we found that the use of probabilistic matching would not improve the predictive power of the algorithm; thus, for the sake of parsimony, we used the dataset without

In the raw dataset, a few additional treatments were performed. SIM records that were not linked to a SINASC record were not considered. Moreover, we did not consider a few residual records with no birthdate and records in which the difference between the birthdate and the date of death was negative. After these treatments, the linkage between SIM and SINASC had a linkage success of 78.78%, whereas the linkage with CNES had a linkage success of 96.83%. The resulting cleaned dataset comprises 8,797,968 births and 59,615 neonatal deaths.

Feature engineering

Our set of features [16] consists of the following variables: place of delivery, health facility type, maternal age at birth, sex, 1-min Apgar score, 5-min Apgar score, birth weight, gestational age, week of gestation, pregnancy type, delivery type, maternal education, presence of congenital anomaly, maternal ethnicity, antenatal visits, month of first antenatal visit, presentation type, induced labor, professional that assisted the labor, number of previous live births, number of previous fetal losses and abortions, number of previous pregnancies, number of previous vaginal deliveries, number of previous cesarean deliveries. In addition, we have also used marital status and state of birth (the definition and type of each of these features are in Table 1)

Table 1 Features information: administrative database, type, and description

We analyze a nominal categorical target variable with three possible outcomes: alive, preventable death [17], and non-preventable death. Among the non-preventable deaths, we have external causes of death and ill-defined deaths. The number of preventable deaths is 42,290, whereas the number of non-preventable deaths is 17,325.

To improve analysis efficiency, categorical variables were stored with codes. We did so by performing a relabeling procedure guided by the data dictionaries issued by the DataSUS, using the package microdatasus [18]. We treated missing data via imputation and applied the package Amelia [19]. Both packages are available in the R Statistical Software repository [20].

As a pre-processing procedure, we centered and scaled the data, by subtracting the mean and dividing by the standard deviation. We also identified and excluded features with zero or near zero variance. Finally, we filtered out highly correlated features. Details are available upon request.

Modeling

The final dataset is partitioned into training and test sets: 7,038,375 observations (80.00% of the total) are used to train six different machine learning algorithms, while 1,759,593 observations (20.00% of the total) are used to evaluate the performance of our targeting criterion on new unseen data.

We estimated preventable neonatal mortality risk for each birth in the data set through flexible ML methods that use the above features. These methods were logistic regression, least absolute shrinkage and selection operator regression (LASSO), elastic-net regularized logistic regression (elastic net), random forest (RF), extreme gradient boosting over trees (XGBoost), and neural networks (NNs). We used the package caret available in the R Statistical Software [20] to run the machine learning algorithms.

Logistic regression [21] is the standard estimation of a linear model that estimates the parameters \(\beta _{j}\) for each feature j to maximize a logistic likelihood function by minimizing the negative log-likelihood. LASSO [22] is essentially an implementation of linear regression that uses a \(L_{2}\) (\(\sum _j \beta _j^2\)) norm penalty to regularize or “shrink” the model, preventing overfitting. It is similar to the logistic regression but includes a penalty term equal to \(\lambda ({2} \sum _j \beta _j^2)\), where the parameter \(\lambda\) is a non-negative real number that determines the strength of the regulation.

Elastic net [23] combines \(L_{1}\) norm (\(\sum _j |\beta _j|\)) and \(L_{2}\) (\(\sum _j \beta _j^2\)) norm penalties to regularize the model. It minimizes the negative log-likelihood plus a penalty term equals to \(\lambda (\frac{\alpha }{2} \sum _j |\beta _j| + \frac{1-\alpha }{2} \sum _j \beta _j^2)\), where the parameters \(\alpha\) and \(\lambda\) are defined on the unit interval and on the non-negative real numbers respectively. As particular cases, elastic net comprises LASSO regression (\(\alpha =1\)) and logistic regression (\(\lambda =0\)).

Our application first tested a cross-validation procedure to choose the parameters \(\alpha\) and \(\lambda\) in elastic net and the parameter \(\lambda\) in LASSO. However, their performances were not close to the logistic regression. For that purpose, we fixed \(\alpha =0.5\) and \(\lambda =0.001\) in elastic net, and \(\lambda =0.001\) in LASSO. The method glmnet was used for all three algorithms.

The methods RF [24] and XGBoost [25] are tree-based algorithms. The simplest tree-based algorithms are classification and regression trees (CART[26]). Both single-tree models recursively group the outcome observations with similar values using cutoff values of the features. Although single-tree models are easy to interpret, their performance is frequently poor and very sensitive to small changes in the input data. By combining several trees, RF and XGBoost methods improve single-tree algorithm performance. The former averages the estimates of a set of trees, each obtained from a random subset of features and trained on a random subset of the observations. The latter also combines several trees, but it initiates with one tree, and new trees are iteratively trained on the errors of the prior set of trees.

As applied by us, in RF, the ranger method was employed, and (i) each forest encompasses 500 trees, (ii) the number of variables randomly sampled for each tree split (mtry) was set to 5 (the square root of the number of features), (iii) the minimal node size (min.node.size) was set to 1, and (iv) we choose the gini index as splitting rule (splitrule). In XGBoost, the xgbTree method was employed, and (i) the number of iterations for the boosting procedure (nrounds) was set to 250, (ii) the learning rate (\(\eta \in (0,1)\) was set to 0.3 to prevent overfitting, (iii) the maximum depth of the trees (max_depth) was set to 4, (iv) the proportion of the variables to be considered for tree construction (colsample_bytree) was set to the interval (0.6, 1), and the proportion of observations from the training set used for modeling (subsample) was set to the interval (0.5, 1).

The NN methods [27] are constituted by an output layer and node layers, including an input layer and one or more hidden layers. The input layer takes the features, and no processing is done. All kinds of processing are executed on the hidden layers and transferred to the output layer. The output layer, in turn, is the final layer, bringing the final value resulting from the learning process in the hidden layers. The nodes, also known as artificial neurons, are connected, and these associations are characterized by their weights, thresholds, and activation functions. Nodes are activated, and data are sent to the next network layer when their outputs exceed a specified threshold value. Otherwise, no data is transmitted to the next layer.

Although we tested specifications with more than one hidden layer using the mlpML method, they performed similarly to the neural network with only one hidden layer. Thus, our application employed the mlp method, specifying a layer with 25 nodes.

Performance metrics

For our task, we did not find it useful to adopt traditional prediction performance metrics, such as classification accuracy, confusion matrices, specificity/sensitivity statistics, or precision/recall statistics, all of which require a threshold for deciding when a risk score is high enough to merit a warning. These can be misleading when applied to rare outcomes, as in the problem we focus on. In our case, if we predicted no neonatal mortality, that model would be right 99% of the time, yet it would be useless, as it wouldn’t allow us to identify those who could be targeted. We neither find it useful to adopt “threshold-free” approaches that report accuracy in a way that does not depend on choosing one threshold, such as ROC-AUC and F-scores do, because they are difficult to give any valuable policy meaning in our context.

We instead recognize that if one has a resource constraint—only a certain fraction of cases one can act on—it gives a reason to compute the proportion of deaths captured by setting the threshold levels of the highest predicted mortality risk. For example, suppose we imagine that a policymaker can only provide intervention to the 5% (or 10%) who need it the most. In this case, the threshold can be set to whatever fraction of high-risk births they have resources for targeting. An appropriate approach, therefore, can concentrate a substantial amount of neonatal deaths in small percentages of high-risk individuals.

Algorithmic bias

Algorithmic bias [10,11,12,13,14] is a well-documented problem with striking implications for health care and public policy. Therefore, besides concentrating a substantial amount of neonatal preventable deaths in small percentages of high-risk individuals, our targeting criterion should also be able to avoid disadvantaging the most vulnerable groups.

To check whether our preferred model would not disadvantage the most vulnerable populations, we checked its performance for four different sub-groups identified using the demographic variables in our dataset. These sub-groups are newborns from non-white mothers, low-education mothers, underage mothers, and single mothers. We used the test sample as a reference and compared its composition with individuals with the highest predicted risk of neonatal preventable death for different threshold levels. For that, we construct confidence intervals based on the Normal approximation for the mortality rate of each group and check whether these intervals contain their respective mortality rates in the test sample. We also perform hypothesis tests to verify whether the proportions of preventable deaths captured by the algorithm (\(\hat{p}\)) are statistically equal to the proportion of preventable deaths in the test sample (\(p_{0}\)). The null hypothesis is \(H_{0}:\hat{p}=p_{0}\) and the alternative hypothesis, \(H_{A}:\hat{p} \ne p_{0}\).

link