Screening autism spectrum disorder in children using machine learning on speech transcripts
Predictive performance
To evaluate our models, we applied 5-fold stratified cross-validation on two datasets from ASD TalkBank. The Nadig dataset had a total of 38 participants who were primarily children with ASD and TD children. SMOTE was used only on the Nadig dataset due to the imbalance in its classes (ASD vs TD). The Eigsti data is more comprehensive and includes data of children with Delayed Development (DD). This is important to consider, since in many instances, ASD and DD characteristics overlap leading to misdiagnosis. A summary of the number of participants and their characteristics is provided in Table 2.
We used 6 different metrics to evaluate the performance of our models and establish their effectiveness. Precision and Recall are highly relevant in medical evaluation and are consolidated using the F1 score to make performance comparison easier. Together, along with the Receiver Operating Characteristic – Area Under the Curve (ROC-AUC) score, are also highly used to evaluate skewed datasets. Accuracy is a general metric used to evaluate classifier performance. P-values are also used in medical research to determine whether the sample estimate significantly differs from a hypothesized value. Below is a more detailed description of the metrics we used along with the formulas to compute them.
Precision: is the ratio that measures the effectiveness of the model in predicting positive labels out of all positive predictions made. High precision ensures that when the model predicts a child has ASD, it is likely to be correct36.
$$\begin{aligned} Precision = \frac{TP}{TP + FP} \end{aligned}$$
Recall: Measures the proportion of true positives that were identified correctly. Having a high recall for the model is important since we want to ensure that ASD kids are identified accurately, minimizing the risk of undiagnosed cases and enabling timely interventions36.
$$\begin{aligned} Recall = \frac{TP}{TP + FN} \end{aligned}$$
F1 Score: The F1 score balances Precision and Recall to evaluate classification performance, particularly in imbalanced datasets, by accounting for both false positives and false negatives.
$$\begin{aligned} F1\ Score = \frac{2* Precision Score * Recall Score}{Precision Score + Recall Score} \end{aligned}$$
Accuracy: Measures the total number of model outcomes that were predicted correctly36.
$$\begin{aligned} Accuracy = \frac{TP + TN}{ TP + FN + TN + FP} \end{aligned}$$
P-Value: Represents the probability that the observed performance of the model-such as its accuracy or ROC-AUC-would have occurred purely by chance, assuming that there is no true effect (i.e., the model is performing at chance level). A p-value below conventional thresholds (e.g., p < 0.05 or p < 0.01) indicates that the model’s performance is statistically significant, meaning that it is very unlikely that the observed results are due to random variation alone31.
Area under the Receiver Operating Characteristic Curve (ROC-AUC): ROC is a probability curve and AUC represents the degree or measure of separability. It indicates how much the model is capable of distinguishing between classes. The higher the AUC, the better the model is at distinguishing between patients with ASD and no ASD37,38,39.
We used the following definitions to compute the evaluation metrics:
-
True Positives (TP): Is the number of children with ASD correctly predicted by the model.
-
False Positives (FP): Is the number of children without ASD incorrectly predicted by the model to have ASD.
-
True Negatives (TN): Is the number of children without ASD correctly predicted by the model.
-
False Negatives (FN): Is the number of children with ASD that the model fails to identify.
Results from our experiments for each of the classifiers using the 6 performance metrics are captured in Table 3. Of particular significance is the consistency of results achieved for each of these models across the various performance measures.
Logistic regression achieved strong performance on the smaller binary datasets (Nadig and Eigsti), with ROC-AUC scores of 0.93 and 0.87, respectively, suggesting that the model was effective at distinguishing between ASD and TD cases in these datasets. When the datasets were merged, TabNet outperformed the other models, achieving a ROC-AUC score of approximately 0.96, which aligns with the tendency of deep learning models to benefit from larger datasets. For the multi-class Eigsti dataset, Random Forest attained the highest ROC-AUC score of 0.71, indicating its ability to capture more nuanced relationships between ASD, TD, and DD classes despite the smaller sample size per class. As shown in Table 3, the p-values for all models are less than conventional thresholds (p < 0.05), indicating the reliability and statistical significance of the observed model results.
Given the limited sizes of the Eigsti (48 participants) and Nadig (38 participants) datasets, there is a potential risk of overfitting, which could limit the generalizability of our models. However, when we compared our training accuracy to the validation accuracy we found a difference of no more than 4% indicating that our model isn’t overfitting.
Feature importance
Given Logistic Regression’s strong performance across both individual and merged datasets, we leveraged its interpretability to analyze the relationship between key features and ASD classification. During the 5-fold cross-validation process on the merged dataset, we recorded the feature coefficients obtained from each fold and stored them in an accumulator. After completing all folds, we computed the average of these coefficients to obtain a final estimate. By examining the coefficients with the largest magnitudes, we identified the most influential features and their potential association with ASD.
We analyzed the top features that were identified by the Logistic Regression model and observed that while certain parts of speech (POS) features were among the highest-ranking, they were excluded to minimize the overall feature set. POS features contribute a large number of variables, and their removal did not lead to a significant drop in performance. To further validate the importance of the remaining top features, we trained a model using only the top four non-POS features and achieved nearly the same accuracy of 86%, reinforcing their relevance in distinguishing ASD cases. These features are:
-
MLU (Mean Length of Utterance): Measures the average number of morphemes per utterance spoken by the child. A morpheme is the smallest meaningful unit in a language (e.g., “in,” “come,” “-ing,” forming “incoming”)40,41. A lower MLU is associated with a higher likelihood of ASD, indicating that children with ASD may use shorter and simpler utterances.
-
MLT Ratio (Mean Length of Turn Ratio): Represents the ratio of the mean length of a child’s turn to that of the mother or investigator. A lower MLT ratio is correlated with ASD, suggesting that children with ASD contribute less in conversational exchanges compared to their counterparts.
-
Age: A positive coefficient suggests that younger children are more likely to exhibit characteristics associated with ASD.
-
Sex: A positive coefficient for this feature indicates that male children are more likely to be classified with ASD, consistent with broader epidemiological trends.

Feature Importance in ASD Prediction: Averaged Logistic Regression Coefficients from 5-Fold Stratified Cross-Validation on the Merged Dataset. This figure was generated using Python’s Matplotlib library, which visualized the relative contributions of key features (e.g., Mean Length of Utterance, Mean Length of Turn Ratio, Age, Sex) to ASD classification.
As shown in Fig. 1, our analysis of these features suggests that a decrease in the MLU and the MLT_ratio of the child is likely suggestive of ASD. The importance of the MLU and the MLT_ratio aligns with prior research showing that children with ASD tend to have shorter and less reciprocal conversational patterns. On the other hand, older children and male children are also likely to be diagnosed with ASD, consistent with broader epidemiological trends.
link
