A scoping review of self-supervised representation learning for clinical decision making using EHR categorical data

This section provides a comprehensive overview of the findings from our scoping review, organized around subsections that emerged during our analysis. We begin by outlining the characteristics of the included studies and the types of data utilized. Next, we examine the studies from the technical aspects including the data preprocessing techniques, SSRL model types, SSRL model comparison, models for downstream tasks, the evaluation metrics used, and the interpretability techniques. Finally, we analyze the studies from a clinical perspective focusing on the fields of clinical application, clinical downstream tasks, and the involvement of medical experts. Error! Reference source not found. summarizes the key features of the technical aspect, and Table 2 provides essential information on the studies from the medical perspective.
Studies characteristics
As illustrated in Fig. 2, most of the research (n = 33, 72%) was conducted by interdisciplinary teams of medical experts and data scientists. The United States led in the number of published studies (n = 21, 46%), followed by China (n = 9, 20%) and the United Kingdom (n = 4, 9%). Despite this geographic diversity, only a few studies (n = 11, 24%) involved international collaborations. For details on the authors and research teams, refer to Supplementary Data 2

a Composition of authors, categorized into two groups: those specializing in data science only, and those with expertise in both data science and medical fields. b Annual distribution of published studies from 2019 to 2024, categorized by continent.
Type of model and trend
Five main model types have been identified for representing EHR categorical data: Transformer-based models (n = 20, 43%), Autoencoder (AE) based models (n = 13, 28%), Graph Neural Network (GNN) based models (n = 8, 17%), Word-embedding models (n = 3, 7%), and Recurrent Neural Network (RNN) based models (n = 3, 7%). Studies that combine two or more model types are counted once for each corresponding model type. To assess their impact on research, we analyzed the number of citations for each model type.
Figure 3 shows the papers published from January 2019 to December 2023, their citation counts by July 2024, and their corresponding model types. Based on the number of citations, Transformers, RNN, and GNN models are the most impactful, with Transformer models showing particularly high citation counts for papers published from 2020 to 2023.

Each data point represents a paper, including its corresponding reference and is color-coded by the model type used: Transformer, AE, GNN, Word-embedding, RNN, and others. Papers published in 2024 are not shown.
Type of data
Studies utilize various data types to represent patients and medical knowledge. Typically, patient representation is derived from EHRs, incorporating both categorical and non-categorical data. Additionally, external medical knowledge can be integrated into models through data collected beyond EHRs. For detailed information on the modalities used across studies, see Supplementary Data 3.
Among the categorical data types in EHRs, diagnosis codes are the most frequently used (n = 45, 98%), including ICD-9, ICD-10-CM, and SNOMED-CT. Medication codes (n = 32, 70%), such as ATC and SNOMED-CT, along with procedure codes (n = 20, 43%) like CPT and ICD-10-PCS. To enhance patient representation, non-categorical data may also be included. The most common non-categorical data types are patient age (n = 19, 41%), clinical measurement values (n = 15, 33%) such as BMI, heart rate, and systolic blood pressure, and clinical narratives from physicians and practitioners (n = 7, 15%).
The integration of external data sources can further enrich patient profiles. Medical knowledge graphs and ontologies provide rich hierarchical information, while medical text corpora contain expert medical knowledge. These external sources offer a comprehensive understanding of clinical concept interactions. Among external data sources, ontologies are the most used (n = 7, 15%), they are employed to obtain the medical concept embeddings22,23,24,25,26,27,28 and for SSRL training task23. Other significant external data sources include medical knowledge graph25,29 and medical text corpora30.
Data preprocessing
Most models treat each data element as a distinct unit or token (n = 44, 95%). The identified data preprocessing techniques address various aspects such as numerical data, categorical data, data cleaning, and data shuffling. Some studies (n = 7, 15%) performed categorization by converting exact ages into intervals and clinical measurements into categories like high, normal, and low, based on clinical evaluation standards31,32,33,34,35,36,37. When maintaining the numerical nature of data, missing value imputation30,38,39 and value normalization31,39,40,41 have also been employed.
Some studies standardize data elements by mapping them to known ontologies23,36,42,43. A common approach to reduce dimensionality and data sparsity is using only the first digits of codes, effectively replacing them with parent nodes in the hierarchical ontology (n = 15, 33%).
In terms of data cleaning, typical practices include the removal of rare medical terms14,32,37,42,44,45 and the elimination of duplicated terms within a specific time range22,42,46,47. Additionally, shuffling the order of medical concepts within a time window33,47 was shown to help the model to generalize better, by mitigating the impact of arbitrary sequencing and emphasizing the importance of co-occurrence over specific order. This method can also be considered as a form of data augmentation. Detailed information on data preprocessing across studies can be found in Supplementary Data 4.
Self-supervised learning models
There are two primary self-supervised learning training strategies: generative and contrastive. Generative tasks involve models predicting parts of the data from other parts, which may be incomplete, transformed, masked, or corrupted. These tasks, such as autoregressive prediction and masked modeling, help the model learn to recover whole or partial features of its original input17,48. Contrastive tasks, on the other hand, focus on distinguishing between similar and dissimilar data points, helping the model capture discriminative features that are essential for understanding different types of data48. Both task types are crucial for training models to generate rich, generalized representations from unlabeled data48,49, and they are applied across various model architectures. The objective of these models is to capture essential patterns and features in the data and output the learned representation which is typically a fixed-length, high-dimensional vector that condenses large amounts of information. Five major architecture types have been identified in the studies, each trained with unlabeled data with different training tasks. Details of the SSRL models used and the temporality monitored in each study are provided in Supplementary Data 5.
Transformer-based models are among the most impactful model types in the studies. In the medical domain, most transformer-based models treat patients as documents, visits as sentences, and medical concepts as tokens, capturing detailed patient histories. BERT50 is a transformer encoder-only model that effectively learns data representations by processing and contextualizing complex sequences of information. BERT models can be trained using various techniques, such as training with only Masked Language Model (MLM) by predicting randomly masked medical concepts in each EHR sequence34,43,44,51,52,53, enhancing its contextual understanding. Training both with MLM and auxiliary tasks13,22,39,54,55,56, further refine the model’s representations by guiding it with specific medical insights. Additionally, self-contrastive learning techniques help improve BERT’s robustness and accuracy in capturing meaningful patterns in medical data30,35. Other transformer-based training tasks include next visit code prediction23,36,45,57, medical code category prediction23, medication-diagnosis cross prediction26, and token replacement detection ELECTRA58.
AE-based models are encoder-decoder models that aim to reconstruct the input, enabling the learning of data representations in a compressed, lower-dimensional space. AEs are designed to learn the most salient features of the data, which can be particularly useful for capturing the underlying structure of categorical EHR data. Various deviations of AE were applied in the studies: Stacked Autoencoder32,59, Denoising Autoencoder60, Autoencoder with RNN units, such as GRU31 and LSTM38,41,61,62,63. Additionally, AE can be combined with other models such as collective matrix factorization29, CNN42, and clustering algorithms27,64.
GNN-based models use graph learning to represent medical ontologies, hospital visits, and disease co-occurrence. Nodes represent the medical concepts and personal entities, linked by edges indicating their relationships. Graph attention models were used to learn the medical concept embeddings within medical ontologies22,26, with these embeddings frequently serving as initializations for further model training. Random walk technique is used to embed doctors according to their specialty65. Graph contrastive learning25,28 generates multiple views of augmented hospital visit graphs by modifying the original graph with node or edge perturbations, allowing the model to learn robust representations by contrasting positive pairs against negative pairs. These approaches ensure that the learned embeddings accurately reflect the complex relationships inherent in medical data49.
Word-embedding-based models convert words into numerical vectors, allowing computers to understand their meanings and relationships from their context in a sequence of words. The model learns to map each word or concept to a dense vector representation, capturing semantic similarities based on co-occurrence patterns. Patient EHR data, composed of a sequence of medical concepts ordered by time, are used to train the representation model to predict medical concepts based on their surrounding context, helping the model to understand relationships between concepts. Various algorithms were identified, such as Glove46, Word2vec33,46,47 and FastText46.
RNN-based models are designed to capture temporal dependencies in sequential data, making them well-suited for tasks involving time-series EHR data. These models are trained with the objective of predicting future medical events based on a patient’s historical data. Studies14,36,37 use a specific type of RNN, GRU. The models were trained to predict the set of medical code of day t based on the medical codes of previous days. To enhance the temporality, these studies have also included the time gap information in the input.
SSRL models comparison
Different self-supervised representation learning models offer unique advantages and face specific limitations. The choice of models depends on several factors, including the size of the available dataset, the importance of temporal modeling for the downstream tasks, and the computational resources available at the institution.
AEs excel at dimensionality reduction66 and are well-suited to relatively moderate datasets (average size: 166k in the included studies). However, they struggle with high-sparsity data67 and cannot inherently model temporal dependencies without incorporating sequential components, such as RNNs, CNNs, and Transformers.
Word embedding models are designed to map medical concepts or tokens into dense vector spaces that capture contextual information and syntactic relationships in the data68. They perform well with a moderate dataset (average size: 139k in the included studies). However, traditional word embeddings are static and fail to account for the temporality or the sequential order of the input data, necessitating their integration with sequential components.
GNNs perform well with small to moderate datasets (average size: 55k in the included studies) and are particularly effective at representing relational data, such as knowledge graphs, patient networks, and ontologies69. They offer strong interpretability by visualizing relational data, aligning with clinical knowledge. However, GNNs alone cannot fully address temporal dependencies, necessitating their integration with sequential components.
RNNs70 are well-suited for larger datasets (average size: 1.8 M in the included studies) and excel at capturing temporal patterns in sequential data. However, their training process is not parallelizable, leading to time inefficiencies71.
Transformers dominate SSRL research due to their ability to simultaneously capture long-range dependencies and temporal patterns72, offering scalability for large datasets and robust performance across diverse tasks. However, training these models from scratch necessitates substantial amounts of data (average size: 3 M in the included studies), and their high computational cost and complexity can pose significant challenges for deployment in resource-limited settings73.
Downstream task models
Predictive models for classification are used with the trained SSRL model as their backbone, to which a specific classification head is added. These predictive models require labeled data for training on specific tasks. Among the articles that have mentioned the predictive models used for classification tasks, different model types have been identified. These models are predominantly characterized by simple architectures which are easy to train. Some studies employ shallow neural networks such as linear layer23,39,44,57, logistic regression (LR) (n = 8, 17%), and support vector machines (SVM)31,74. Models that can capture more complex data patterns such as feedforward neural networks (n = 12, 26%) and RNN13,40,54,55,62,65 (n = 6, 13%), are also applied.
Clustering and visualization models are used with the data representation vector as input. We identified several techniques employed across the literature. T-distributed Stochastic Neighbor Embedding (t-SNE) emerged as the most frequently used model for data representation visualization and cluster interpretation (n = 12, 26%). In terms of clustering techniques, K-means33,38,47,62 was found to be the most common method. These clustering models take the embedding vectors generated by trained representation learning models as input.
Evaluation metrics
The evaluation of these tasks is primarily categorized into classification and clustering assessments, each employing different metrics to measure performance.
For classification tasks, the majority were binary, the most frequently used classification metric was AUROC (n = 21, 46%), followed by AUPRC (n = 14, 30%), accuracy (n = 10, 22%), and F1 (n = 9, 20%), while other metrics were also used but less frequently, such as precision (n = 6, 13%) and sensitivity (n = 5, 11%). A few studies have evaluated multi-class classification tasks. Metrics such as average precision51,74, precision at k44,45, macro-F129,65 and weighted F124,29 were each reported in the studies10,24,28,74.
For clustering tasks, despite the prevalence of clustering studies, only a few employed specific clustering analysis metrics. Silhouette analysis (n = 4, 9%) was the most frequently used metric, followed by Davies-Bouldin index33,41 (n = 2, 4%) and purity score42,64 (n = 2, 4%)
Interpretability
Interpretability in machine learning is defined as the extraction of relevant knowledge from a machine-learning model concerning relationships either contained in data or learned by the model75. Attention weight analysis was used in several studies (n = 6, 13%). Statistical analysis of the clusters was employed in some papers (n = 3, 6%). For post-hoc interpretability, methods such as Integrated gradient13 and Gradient-based saliency45 were utilized. Most of the papers interpreted their results using visualization computed by t-SNE (n = 12, 26%) and Uniform Manifold Approximation and Projection for Dimension Reduction (n = 3, 6%). Ten papers involved medical expert interpretation. Overall, only two papers attempted post-hoc interpretability methods on trained models. Refer to Supplementary Data 7 for detailed information on the interpretability methods used in the studies.
Fields of application
Our scoping review identified various tasks across the articles. These tasks were distributed across various clinical domains, with Cardiology24,31,32,34,35,40,41,43,53,54,55,56,60,61,74 (n = 15, 33%), both General & multiple diseases (n = 11, 24%), Neurology & Psychiatry and Primary Care (n = 9, 20%) being the most frequently studied areas. Oncology (n = 6, 13%), followed, while Infectious Diseases39,40,47,61, Endocrinology35,38,42,60 and Respiratory13,23,32,64 each had 4 downstream tasks (n = 4, 9%). Gastroenterology40,42 and Nephrology27,35 had the lowest number of downstream tasks (n = 2, 4%). A detailed overview of the clinical events and their corresponding clinical domain mapping can be found in Supplementary Data 1.
Evaluation tasks
Upon training, deep learning models have developed an intrinsic representation of the data, which can be general, supporting multiple tasks, or task-specific, focusing on a single or a few similar tasks. Representation quality is evaluated in various clinical tasks, including predictive tasks, or patient phenotyping. For detailed information on the evaluation tasks in the studies, see Supplementary Data 7.
Among the 73 predictive tasks, the primary focus was on disease prediction (n = 27, 59%), followed by mortality prediction (n = 11, 24%), readmission prediction14,26,28,32,36,53,55,65,76 (n = 9, 20%), hospitalization (n = 5, 11%), and length of stay prediction (n = 4, 9%). In addition to these, other tasks included medication recommendations22,26,40 (n = 3, 7%), ICD coding56, doctor recommendations65, ICU transfers14, emergency department visits63, and high medical resource utilization63.
Beyond predictive modeling, patient phenotyping plays a crucial role in understanding patient populations. Of the 33 patient phenotyping tasks, clustering was primarily used for visualization (n = 15, 33%), patient similarity assessment (n = 8, 24%), characterization of clusters (n = 3, 9%), patient subtyping (n = 2, 6%), and patient stratification (n = 1, 3%).
Medical expert involvement
Medical experts were involved across different stages of the studies, with varying degrees of participation. Among the reviewed publications, expert participation was the most prominent were study design (n = 14, 30%) and result interpretation (n = 14, 30%). Feature selection also saw substantial expert input (n = 10, 22%), while dataset extraction had more limited expert participation (n = 4, 9%).
link