A scoping review of self-supervised representation learning for clinical decision making using EHR categorical data

This section provides a comprehensive overview of the findings from our scoping review, organized around subsections that emerged during our analysis. We begin by outlining the characteristics of the included studies and the types of data utilized. Next, we examine the studies from the technical aspects including the data preprocessing techniques, SSRL model types, SSRL model comparison, models for downstream tasks, the evaluation metrics used, and the interpretability techniques. Finally, we analyze the studies from a clinical perspective focusing on the fields of clinical application, clinical downstream tasks, and the involvement of medical experts. Error! Reference source not found. summarizes the key features of the technical aspect, and Table 2 provides essential information on the studies from the medical perspective.

Table of Contents

Studies characteristics

As illustrated in Fig. 2, most of the research (n = 33, 72%) was conducted by interdisciplinary teams of medical experts and data scientists. The United States led in the number of published studies (n = 21, 46%), followed by China (n = 9, 20%) and the United Kingdom (n = 4, 9%). Despite this geographic diversity, only a few studies (n = 11, 24%) involved international collaborations. For details on the authors and research teams, refer to Supplementary Data 2

**Fig. 2: Meta-data from reviewed studies.**

Type of model and trend

Five main model types have been identified for representing EHR categorical data: Transformer-based models (n = 20, 43%), Autoencoder (AE) based models (n = 13, 28%), Graph Neural Network (GNN) based models (n = 8, 17%), Word-embedding models (n = 3, 7%), and Recurrent Neural Network (RNN) based models (n = 3, 7%). Studies that combine two or more model types are counted once for each corresponding model type. To assess their impact on research, we analyzed the number of citations for each model type.

Figure 3 shows the papers published from January 2019 to December 2023, their citation counts by July 2024, and their corresponding model types. Based on the number of citations, Transformers, RNN, and GNN models are the most impactful, with Transformer models showing particularly high citation counts for papers published from 2020 to 2023.

**Fig. 3: Number of citations for each study published from 2019 to 2023.**

Type of data

Studies utilize various data types to represent patients and medical knowledge. Typically, patient representation is derived from EHRs, incorporating both categorical and non-categorical data. Additionally, external medical knowledge can be integrated into models through data collected beyond EHRs. For detailed information on the modalities used across studies, see Supplementary Data 3.

Among the categorical data types in EHRs, diagnosis codes are the most frequently used (n = 45, 98%), including ICD-9, ICD-10-CM, and SNOMED-CT. Medication codes (n = 32, 70%), such as ATC and SNOMED-CT, along with procedure codes (n = 20, 43%) like CPT and ICD-10-PCS. To enhance patient representation, non-categorical data may also be included. The most common non-categorical data types are patient age (n = 19, 41%), clinical measurement values (n = 15, 33%) such as BMI, heart rate, and systolic blood pressure, and clinical narratives from physicians and practitioners (n = 7, 15%).

The integration of external data sources can further enrich patient profiles. Medical knowledge graphs and ontologies provide rich hierarchical information, while medical text corpora contain expert medical knowledge. These external sources offer a comprehensive understanding of clinical concept interactions. Among external data sources, ontologies are the most used (n = 7, 15%), they are employed to obtain the medical concept embeddings^{22,23,24,25,26,27,28} and for SSRL training task²³. Other significant external data sources include medical knowledge graph^25,29 and medical text corpora³⁰.

Data preprocessing

Most models treat each data element as a distinct unit or token (n = 44, 95%). The identified data preprocessing techniques address various aspects such as numerical data, categorical data, data cleaning, and data shuffling. Some studies (n = 7, 15%) performed categorization by converting exact ages into intervals and clinical measurements into categories like high, normal, and low, based on clinical evaluation standards^{31,32,33,34,35,36,37}. When maintaining the numerical nature of data, missing value imputation^30,38,39 and value normalization^31,39,40,41 have also been employed.

Some studies standardize data elements by mapping them to known ontologies^23,36,42,43. A common approach to reduce dimensionality and data sparsity is using only the first digits of codes, effectively replacing them with parent nodes in the hierarchical ontology (n = 15, 33%).

In terms of data cleaning, typical practices include the removal of rare medical terms^{14,32,37,42,44,45} and the elimination of duplicated terms within a specific time range^22,42,46,47. Additionally, shuffling the order of medical concepts within a time window^33,47 was shown to help the model to generalize better, by mitigating the impact of arbitrary sequencing and emphasizing the importance of co-occurrence over specific order. This method can also be considered as a form of data augmentation. Detailed information on data preprocessing across studies can be found in Supplementary Data 4.

Self-supervised learning models

There are two primary self-supervised learning training strategies: generative and contrastive. Generative tasks involve models predicting parts of the data from other parts, which may be incomplete, transformed, masked, or corrupted. These tasks, such as autoregressive prediction and masked modeling, help the model learn to recover whole or partial features of its original input^17,48. Contrastive tasks, on the other hand, focus on distinguishing between similar and dissimilar data points, helping the model capture discriminative features that are essential for understanding different types of data⁴⁸. Both task types are crucial for training models to generate rich, generalized representations from unlabeled data^48,49, and they are applied across various model architectures. The objective of these models is to capture essential patterns and features in the data and output the learned representation which is typically a fixed-length, high-dimensional vector that condenses large amounts of information. Five major architecture types have been identified in the studies, each trained with unlabeled data with different training tasks. Details of the SSRL models used and the temporality monitored in each study are provided in Supplementary Data 5.

Transformer-based models are among the most impactful model types in the studies. In the medical domain, most transformer-based models treat patients as documents, visits as sentences, and medical concepts as tokens, capturing detailed patient histories. BERT⁵⁰ is a transformer encoder-only model that effectively learns data representations by processing and contextualizing complex sequences of information. BERT models can be trained using various techniques, such as training with only Masked Language Model (MLM) by predicting randomly masked medical concepts in each EHR sequence^{34,43,44,51,52,53}, enhancing its contextual understanding. Training both with MLM and auxiliary tasks^{13,22,39,54,55,56}, further refine the model’s representations by guiding it with specific medical insights. Additionally, self-contrastive learning techniques help improve BERT’s robustness and accuracy in capturing meaningful patterns in medical data^30,35. Other transformer-based training tasks include next visit code prediction^23,36,45,57, medical code category prediction²³, medication-diagnosis cross prediction²⁶, and token replacement detection ELECTRA⁵⁸.

AE-based models are encoder-decoder models that aim to reconstruct the input, enabling the learning of data representations in a compressed, lower-dimensional space. AEs are designed to learn the most salient features of the data, which can be particularly useful for capturing the underlying structure of categorical EHR data. Various deviations of AE were applied in the studies: Stacked Autoencoder^32,59, Denoising Autoencoder⁶⁰, Autoencoder with RNN units, such as GRU³¹ and LSTM^{38,41,61,62,63}. Additionally, AE can be combined with other models such as collective matrix factorization²⁹, CNN⁴², and clustering algorithms^27,64.

GNN-based models use graph learning to represent medical ontologies, hospital visits, and disease co-occurrence. Nodes represent the medical concepts and personal entities, linked by edges indicating their relationships. Graph attention models were used to learn the medical concept embeddings within medical ontologies^22,26, with these embeddings frequently serving as initializations for further model training. Random walk technique is used to embed doctors according to their specialty⁶⁵. Graph contrastive learning^25,28 generates multiple views of augmented hospital visit graphs by modifying the original graph with node or edge perturbations, allowing the model to learn robust representations by contrasting positive pairs against negative pairs. These approaches ensure that the learned embeddings accurately reflect the complex relationships inherent in medical data⁴⁹.

Word-embedding-based models convert words into numerical vectors, allowing computers to understand their meanings and relationships from their context in a sequence of words. The model learns to map each word or concept to a dense vector representation, capturing semantic similarities based on co-occurrence patterns. Patient EHR data, composed of a sequence of medical concepts ordered by time, are used to train the representation model to predict medical concepts based on their surrounding context, helping the model to understand relationships between concepts. Various algorithms were identified, such as Glove⁴⁶, Word2vec^33,46,47 and FastText⁴⁶.

RNN-based models are designed to capture temporal dependencies in sequential data, making them well-suited for tasks involving time-series EHR data. These models are trained with the objective of predicting future medical events based on a patient’s historical data. Studies^14,36,37 use a specific type of RNN, GRU. The models were trained to predict the set of medical code of day t based on the medical codes of previous days. To enhance the temporality, these studies have also included the time gap information in the input.

SSRL models comparison

Different self-supervised representation learning models offer unique advantages and face specific limitations. The choice of models depends on several factors, including the size of the available dataset, the importance of temporal modeling for the downstream tasks, and the computational resources available at the institution.

AEs excel at dimensionality reduction⁶⁶ and are well-suited to relatively moderate datasets (average size: 166k in the included studies). However, they struggle with high-sparsity data⁶⁷ and cannot inherently model temporal dependencies without incorporating sequential components, such as RNNs, CNNs, and Transformers.

Word embedding models are designed to map medical concepts or tokens into dense vector spaces that capture contextual information and syntactic relationships in the data⁶⁸. They perform well with a moderate dataset (average size: 139k in the included studies). However, traditional word embeddings are static and fail to account for the temporality or the sequential order of the input data, necessitating their integration with sequential components.

GNNs perform well with small to moderate datasets (average size: 55k in the included studies) and are particularly effective at representing relational data, such as knowledge graphs, patient networks, and ontologies⁶⁹. They offer strong interpretability by visualizing relational data, aligning with clinical knowledge. However, GNNs alone cannot fully address temporal dependencies, necessitating their integration with sequential components.

RNNs⁷⁰ are well-suited for larger datasets (average size: 1.8 M in the included studies) and excel at capturing temporal patterns in sequential data. However, their training process is not parallelizable, leading to time inefficiencies⁷¹.

Transformers dominate SSRL research due to their ability to simultaneously capture long-range dependencies and temporal patterns⁷², offering scalability for large datasets and robust performance across diverse tasks. However, training these models from scratch necessitates substantial amounts of data (average size: 3 M in the included studies), and their high computational cost and complexity can pose significant challenges for deployment in resource-limited settings⁷³.

Downstream task models

Predictive models for classification are used with the trained SSRL model as their backbone, to which a specific classification head is added. These predictive models require labeled data for training on specific tasks. Among the articles that have mentioned the predictive models used for classification tasks, different model types have been identified. These models are predominantly characterized by simple architectures which are easy to train. Some studies employ shallow neural networks such as linear layer^23,39,44,57, logistic regression (LR) (n = 8, 17%), and support vector machines (SVM)^31,74. Models that can capture more complex data patterns such as feedforward neural networks (n = 12, 26%) and RNN^{13,40,54,55,62,65} (n = 6, 13%), are also applied.

Clustering and visualization models are used with the data representation vector as input. We identified several techniques employed across the literature. T-distributed Stochastic Neighbor Embedding (t-SNE) emerged as the most frequently used model for data representation visualization and cluster interpretation (n = 12, 26%). In terms of clustering techniques, K-means^33,38,47,62 was found to be the most common method. These clustering models take the embedding vectors generated by trained representation learning models as input.

Evaluation metrics

The evaluation of these tasks is primarily categorized into classification and clustering assessments, each employing different metrics to measure performance.

For classification tasks, the majority were binary, the most frequently used classification metric was AUROC (n = 21, 46%), followed by AUPRC (n = 14, 30%), accuracy (n = 10, 22%), and F1 (n = 9, 20%), while other metrics were also used but less frequently, such as precision (n = 6, 13%) and sensitivity (n = 5, 11%). A few studies have evaluated multi-class classification tasks. Metrics such as average precision^51,74, precision at k^44,45, macro-F1^29,65 and weighted F1^24,29 were each reported in the studies^10,24,28,74.

For clustering tasks, despite the prevalence of clustering studies, only a few employed specific clustering analysis metrics. Silhouette analysis (n = 4, 9%) was the most frequently used metric, followed by Davies-Bouldin index^33,41 (n = 2, 4%) and purity score^42,64 (n = 2, 4%)

Interpretability

Interpretability in machine learning is defined as the extraction of relevant knowledge from a machine-learning model concerning relationships either contained in data or learned by the model⁷⁵. Attention weight analysis was used in several studies (n = 6, 13%). Statistical analysis of the clusters was employed in some papers (n = 3, 6%). For post-hoc interpretability, methods such as Integrated gradient¹³ and Gradient-based saliency⁴⁵ were utilized. Most of the papers interpreted their results using visualization computed by t-SNE (n = 12, 26%) and Uniform Manifold Approximation and Projection for Dimension Reduction (n = 3, 6%). Ten papers involved medical expert interpretation. Overall, only two papers attempted post-hoc interpretability methods on trained models. Refer to Supplementary Data 7 for detailed information on the interpretability methods used in the studies.

Fields of application

Our scoping review identified various tasks across the articles. These tasks were distributed across various clinical domains, with Cardiology^{24,31,32,34,35,40,41,43,53,54,55,56,60,61,74} (n = 15, 33%), both General & multiple diseases (n = 11, 24%), Neurology & Psychiatry and Primary Care (n = 9, 20%) being the most frequently studied areas. Oncology (n = 6, 13%), followed, while Infectious Diseases^39,40,47,61, Endocrinology^35,38,42,60 and Respiratory^13,23,32,64 each had 4 downstream tasks (n = 4, 9%). Gastroenterology^40,42 and Nephrology^27,35 had the lowest number of downstream tasks (n = 2, 4%). A detailed overview of the clinical events and their corresponding clinical domain mapping can be found in Supplementary Data 1.

Evaluation tasks

Upon training, deep learning models have developed an intrinsic representation of the data, which can be general, supporting multiple tasks, or task-specific, focusing on a single or a few similar tasks. Representation quality is evaluated in various clinical tasks, including predictive tasks, or patient phenotyping. For detailed information on the evaluation tasks in the studies, see Supplementary Data 7.

Among the 73 predictive tasks, the primary focus was on disease prediction (n = 27, 59%), followed by mortality prediction (n = 11, 24%), readmission prediction^{14,26,28,32,36,53,55,65,76} (n = 9, 20%), hospitalization (n = 5, 11%), and length of stay prediction (n = 4, 9%). In addition to these, other tasks included medication recommendations^22,26,40 (n = 3, 7%), ICD coding⁵⁶, doctor recommendations⁶⁵, ICU transfers¹⁴, emergency department visits⁶³, and high medical resource utilization⁶³.

Beyond predictive modeling, patient phenotyping plays a crucial role in understanding patient populations. Of the 33 patient phenotyping tasks, clustering was primarily used for visualization (n = 15, 33%), patient similarity assessment (n = 8, 24%), characterization of clusters (n = 3, 9%), patient subtyping (n = 2, 6%), and patient stratification (n = 1, 3%).

Medical expert involvement

Medical experts were involved across different stages of the studies, with varying degrees of participation. Among the reviewed publications, expert participation was the most prominent were study design (n = 14, 30%) and result interpretation (n = 14, 30%). Feature selection also saw substantial expert input (n = 10, 22%), while dataset extraction had more limited expert participation (n = 4, 9%).

link

A scoping review of self-supervised representation learning for clinical decision making using EHR categorical data

Studies characteristics

Type of model and trend

Type of data

Data preprocessing

Self-supervised learning models

SSRL models comparison

Downstream task models

Evaluation metrics

Interpretability

Fields of application

Evaluation tasks

Medical expert involvement

‘Playful’ teaching gaining credibility, say Lego researchers

Machine Learning Boosts Heat Equation Solutions

Professor Creates Unique Social Work Education Resource Guide

Leave a Reply Cancel reply

Penn implements changes to study abroad application process following student feedback

PTDF screens 5,885 candidates for master’s, PhD scholarships to study abroad

Would you study abroad if social media didn’t exist?

New survey data says demand for MBA study abroad is shifting this year – ICEF Monitor

Beyond Big Four: Indian students are dumping default study abroad settings

Studies characteristics

Type of model and trend

Type of data

Data preprocessing

Self-supervised learning models

SSRL models comparison

Downstream task models

Evaluation metrics

Interpretability

Fields of application

Evaluation tasks

Medical expert involvement

More Stories

Leave a Reply Cancel reply

You may have missed