Explainable machine learning-based classification of traditional Korean ceramics using XRF chemical composition data

Table of Contents

Classification accuracy of machine learning models

A comparative evaluation of the six ML models revealed that both RF and XGB achieved the highest classification accuracy of 95.8% on the test set, followed closely by SVM (93.3%), KNN (88.3%), and DT and PCA-LDA (both 85.0%) (Table 2). The minimal differences between the training and test accuracies across all models indicate effective overfitting control and stable generalization performance. To verify model stability, the learning curve and inter-fold variance analyses are presented in the Supplementary Materials (Fig. S1 and Table S1).

Table 2 Performance comparison of ML models for the classification of traditional Korean ceramics

The comparatively lower performance of the PCA-LDA algorithm can be explained by multiple factors. First, its reliance on linear decision boundaries limits its ability to capture the complex nonlinear relationships inherent in ceramic compositional data. Second, the PCA transformation may have removed subtle but diagnostically meaningful chemical variations, as PCA maximizes overall variance rather than class separability. This issue is particularly critical for ceramic classification, where discriminative cues often reside in minor elemental differences. Finally, with only ten compositional features, dimensionality reduction offers limited benefit and may discard essential information, further constraining the model’s classification capability.

The DT model, which achieved the same test accuracy as PCA–LDA, was further examined through visualization of its decision tree structure (Fig. S2). The visualization revealed that white porcelain could be accurately classified through relatively simple decision rules, whereas celadon and buncheong required more complex, branched pathways. This finding suggests that white porcelain is compositionally distinct from celadon and buncheong, likely reflecting its use of refined clay and reduced concentrations of coloring oxides. However, as a single model rather than an ensemble approach, DT may exhibit lower generalization performance on the test data.

The superior performance of the RF algorithm can be ascribed to its ensemble structure, which effectively models intricate feature interactions by aggregating multiple decision trees. Bagging, employed in RF, reduces the variance and helps to prevent overfitting, enhancing the generalization ability of the model. Similarly, XGB exhibited a robust performance by iteratively correcting residuals from previous models. Although both RF and XGB are tree-based ensemble methods, RF constructs trees in parallel, while XGB builds them sequentially to address previous errors, resulting in subtle differences in model behavior.

McNemar’s test results (Table S2) indicated that the RF and XGB models achieved statistically significant improvements over PCA-LDA, DT, and KNN (p < 0.05). SVM also performed significantly better than PCA-LDA and DT (p < 0.05), but did not differ significantly from RF, XGB, or KNN (p > 0.05). These findings confirm that the enhanced performance of the ensemble models (RF and XGB) reflects genuine methodological advantages rather than random variation, underscoring the robustness of ensemble learning for ceramic classification based on XRF compositional data. Despite consistent preprocessing and optimization protocols across models, intrinsic architectural differences substantially influence classification performance, highlighting the critical role of model design in chemometric interpretation.

Confusion matrix analysis and PCA visualization

To further assess the classification behavior, the confusion matrices for the training and test sets of each model were analyzed (Fig. 2). The training set matrices confirmed the successful model learning, and the overall trends were consistent with those observed in the test set.

**Fig. 2: Confusion matrices of the training (top) and test (bottom) sets for the six ML models.**

A common misclassification trend was observed for all models: buncheong and celadon were frequently confused with each other, whereas white porcelain was consistently classified with high accuracy. The analysis of the misclassified samples revealed that multiple models often failed to correctly classify the same instances. Specifically, the high-performing RF and XGB models were misclassified five samples, with most of these errors occurring simultaneously across the other models. The majority of misclassified samples belonged to the buncheong and celadon classes, indicating that the classification errors reflect intrinsic compositional ambiguities rather than model limitations. This tendency suggests a greater compositional similarity between celadon and buncheong compared to white porcelain. From a material perspective, this result can be attributed to similarities in raw materials and firing conditions. Both celadon and buncheong are typically fired at temperatures exceeding 1200 °C, whereas white porcelain is fired at even higher temperatures, around 1300 °C, resulting in different mineralogical transformations and oxide behaviors⁵². Moreover, the refined nature of the white porcelain clay body, with reduced concentrations of coloring oxides, enhances its chemical distinctiveness. Even under identical firing conditions, white porcelain remains compositionally separate, due to the purity of its raw materials. In contrast, early buncheong often retained features characteristic of inlaid celadon, making them visually and chemically similar⁷. These material and technological similarities explain both the recurrent misclassification patterns observed across all models and the existence of samples that remain fundamentally challenging to classify using XRF-based elemental composition analysis alone.

These compositional relationships were further visualized by analyzing PCA scatter plots. PCA achieves dimensionality reduction by projecting high-dimensional data onto orthogonal axes that capture the maximum variance⁵³. In this study, the first two principal components (PC1 and PC2) explained 49.73% and 23.40% of the variance, respectively, accounting for 73.13% of the total variation. The two-dimensional PCA plot in Fig. 3 reveals that white porcelain samples formed a compact cluster in the positive PC1 direction, indicating a distinct chemical composition (namely, lower levels of coloring oxides such as Fe₂O₃ and TiO₂). In contrast, celadon and buncheong substantially overlapped in the negative PC1 region, reflecting their compositional similarity.

**Fig. 3: Two-dimensional visualization of PCA results, showing the distribution of traditional Korean ceramics.**

These PCA results corroborate the findings based on confusion matrices and underscore the chemical and technological continuity between celadon and buncheong. The blurred compositional boundary between these two ceramic types reflects historical production practices, emphasizing the value of combining statistical analysis with domain-specific knowledge in order to interpret classification outcomes.

Feature importance analysis

To investigate how chemical composition influences the classification of ceramic types, feature importance was evaluated using SHAP values across all ML models (Fig. 4). SHAP quantifies the contribution of each feature to individual model predictions, providing both global and local interpretability within a unified, model-agnostic framework. However, direct numerical comparison of SHAP values across models with different algorithmic structures and feature scaling should be approached with caution. Accordingly, the following analysis focuses on identifying consistent trends in feature importance and interpreting these patterns in conjunction with archeological and technological contexts, rather than making direct quantitative comparisons between models.

Among the six models investigated, the tree-based algorithms (DT, RF, and XGB) demonstrated highly consistent feature importance patterns. These models consistently identified TiO₂ as the most important variable, with Fe₂O₃ ranking relatively high, while other oxides contributed to a much lesser degree. This pattern was particularly pronounced in the ensemble models, RF and XGB. This observation aligns with established knowledge of ceramic coloration mechanisms, which considers the concentrations and oxidation states of TiO₂ and Fe₂O₃ as key determinants of body color⁵⁴.

Unlike TiO₂, which generally yields a yellowish tone under both oxidizing and reducing conditions, the color of Fe₂O₃ shifts from yellowish under oxidizing conditions to bluish-green under reducing conditions due to the partial reduction of Fe³⁺ to Fe²⁺. These oxides are therefore closely related to the visual appearance of ceramics, such as the bluish-green hue of celadon and the brownish tone of buncheong ware. In contrast, white porcelain was produced using highly refined clays with minimal amounts of these coloring oxides, resulting in its bright white appearance²⁸. The low TiO₂ content appears to be critical not only for glaze coloration—through redox interactions with iron^55,56—but also for the body composition itself, as such TiO₂-deficient materials are geologically rare yet essential for achieving the characteristic whiteness of high-quality porcelain across East Asian ceramic traditions. The superior classification performance of the RF and XGB models may thus reflect their ability to effectively capture these nonlinear relationships among elemental composition, firing atmosphere, and technological traditions associated with each ceramic type.

In comparison, the SVM model, which ranked third in classification accuracy, displayed a distinct feature importance pattern. While TiO₂ and SiO₂ contents emerged as significant features, Fe₂O₃ was assigned relatively low importance; this likely limited the effectiveness of the model in differentiating celadon and buncheong, in which Fe₂O₃ plays a crucial role.

Interestingly, PCA–LDA and DT, despite their identical test accuracy (85.0%), exhibited markedly different feature importance profiles. DT placed strong emphasis on TiO₂, while PCA–LDA highlighted not only TiO₂ but also SiO₂ and Al₂O₃ as primary features. This tendency reflects the limitations of single models and the constraints of linear models in capturing the nonlinear relationships inherent in ceramic compositional data, particularly when the number of features is limited.

KNN also utilized TiO₂ as an important component for classification, but underestimated Fe₂O₃, which was effectively leveraged by the two highest-performing models (RF and XGB). This difference suggests that the reliance of KNN on geometric proximity in feature space may cause it to overlook subtle chemical effects that tree-based models can capture more effectively.

To further verify the robustness of the interpretability results, SHAP heatmap analyses were performed for both the RF model, which exhibited the highest classification performance, and the PCA-LDA model, which showed the lowest accuracy, across ten independent cross-validation folds (Fig. S3). The visualizations revealed that the key discriminative features, TiO₂ and Fe₂O₃, consistently ranked first and second across all folds in the RF model, indicating a high degree of stability in their contributions to model predictions. However, minor fold-dependent fluctuations were observed among lower-ranking features, reflecting slight variations in model sensitivity to secondary compositional effects. In contrast, the PCA-LDA model showed greater variability overall, with moderate fluctuations even among the top-ranking features. The relative importance of TiO₂ and Al₂O₃ occasionally shifted between the first and second ranks across folds, indicating that linear models are more sensitive to sampling variation and feature intercorrelations. This comparison highlights the enhanced robustness of tree-based ensemble models in capturing nonlinear composition–class relationships compared to linear models such as PCA-LDA, which are more sensitive to sampling variation and feature intercorrelations.

In summary, the SHAP-based feature importance analysis identified the chemical oxides that drive classification decisions in each model, and confirmed that these ML-derived patterns align well with established knowledge in materials science. TiO₂ and Fe₂O₃ were consistently recognized as key features for white porcelain identification, while TiO₂ also contributed significantly to celadon classification, likely reflecting its synergistic role with Fe₂O₃ in producing the greenish hue of the ceramic under reducing conditions⁵. Moreover, the combination of higher Fe₂O₃ contents and intermediate TiO₂ levels served as a critical marker for buncheong, consistent with its darker body color and intermediate position between celadon and white porcelain². These findings demonstrate that SHAP analysis can enhance the model transparency and facilitate the connection between ML outputs and archeological as well as technological interpretations, thereby strengthening the interpretive value of AI in heritage science.

Analysis of feature impact distribution plots

SHAP beeswarm plots (Fig. 5) were analyzed to investigate both the relative importance and the directional influence of individual features on ceramic classification. These plots clearly show how changes in chemical composition affect the probability of a sample being classified into a particular ceramic type. The analysis focused on comparing the RF with the PCA–LDA model. SHAP beeswarm plots for the remaining models are displayed in Fig. S4.

**Fig. 5: SHAP beeswarm plots comparing feature impact distributions in RF (left) and PCA–LDA (right) models across ceramic types.**

In these plots, the SHAP values on the x-axis indicate how each feature influences classification decisions: positive/negative values reflect an increased/decreased probability of assignment to a specific ceramic type, and the absolute magnitude of the SHAP value represents the strength of the contribution of a feature to the classification outcome.

In the RF model, TiO₂ emerged as the most influential feature for celadon classification. A high TiO₂ content consistently showed positive contributions, indicating that elevated TiO₂ concentrations resulted in increased probability of celadon classification; conversely, low TiO₂ levels contributed negatively. In addition, Fe₂O₃ demonstrated the second-highest contribution in celadon classification. The presence of low levels of Fe₂O₃ has been demonstrated to exert a detrimental influence on the classification of celadon, while intermediate levels have been shown to exert a positive influence.

However, in the buncheong classification, higher Fe₂O₃ concentrations exhibited a strong association with increased classification probability. This contrasting behavior suggests that Fe₂O₃ content serves as a hierarchical classification marker: high concentrations favor buncheong, whereas low concentrations are indicative of white porcelain, and intermediate values are more characteristic of celadon. TiO₂ also functioned as a hierarchical classification marker, with intermediate values positively contributing to buncheong classification, high values to celadon, and low values to white porcelain. P₂O₅ and Na₂O, while not appearing among the top-ranked features in white porcelain classification, exhibited contrasting concentration distributions between celadon and buncheong, indicating their partial contribution to differentiating these two ceramic types. The inverse relationship observed in the SHAP distributions underscores the role of P₂O₅ and Na₂O as complementary features that distinguish these two stylistically similar ceramic types.

In the case of white porcelain, the SHAP distributions for TiO₂ and Fe₂O₃ showed clear bimodal patterns, with low oxide concentrations contributing strongly and positively to the white porcelain classification. This dichotomous pattern reflects the well-known use of high-purity white clay in white porcelain production, where the removal of coloring oxides such as TiO₂ and Fe₂O₃ enhances whiteness. These findings corroborate the classification rules learned by RF models and align closely with established knowledge on production techniques of Joseon white porcelain^28,55.

In contrast, the PCA–LDA model showed different feature contribution patterns, placing greater importance on Al₂O₃ and SiO₂ rather than key coloring oxides such as TiO₂ and Fe₂O₃. As illustrated in Fig. 5 (right), these coloring oxides showed relatively low importance compared to RF and lacked clear directional separation. This suggests that PCA–LDA, due to its linear structure, was less effective at capturing the compositional characteristics most relevant to ceramic classification.

Taken together, the SHAP beeswarm plots provide a detailed explanation of how the elemental composition influences the ceramic classification. In addition to validating the predictive mechanisms of high-performing models such as RF, they also expose the interpretive limitations of lower-performing algorithms. Additionally, they demonstrate the potential of SHAP analysis to link ML outputs with archaeometric knowledge, providing explainable and data-driven insights into ceramic production technologies.

External test dataset validation

To assess the generalizability of the models, an external dataset that was not used for training or internal validation was evaluated (Table 3). The RF model achieved the highest accuracy (93.2%), followed by XGB (91.5%) and SVM (91.5%), consistent with the internal test results and indicating robust generalization capability across independently sourced samples.

Table 3 Performance comparison of ML models on an external test dataset of traditional Korean ceramics

The confusion matrices (Fig. 6) provide a detailed view of class-specific performance. Tree-based models, RF and XGB, demonstrated strong performance in classifying white porcelain, whereas PCA-LDA and SVM, which construct high-dimensional decision boundaries, performed comparatively better for celadon and buncheong but were comparatively weaker for white porcelain. These differences highlight the methodological distinctions inherent in each model.

**Fig. 6: Confusion matrices for external validation across six ML models using data from an independent researcher.**

Importantly, the overall classification patterns observed in the external validation were consistent with the SHAP-based feature importance trends identified in the internal analyses, further supporting the reliability of the compositional relationships captured by the models. It should be noted that the external dataset was derived from a limited number of studies and may be subject to regional bias; therefore, caution is warranted when extrapolating these findings to broader populations or drawing general conclusions regarding ceramic classification trends.

Comparison with previous studies

Recent studies have increasingly applied ML to the classification of ancient ceramics, particularly using chemical composition data. Among these investigations, Sun et al. developed a classification model for ancient Chinese celadon using XRF compositional data and four ML algorithms (RF, Adaboost, KNN, and SVM), combined with Mahalanobis distance analysis for post-classification verification¹³. Their dataset comprised more than 1000 celadon shards from 18 kiln sites, categorized into eight types. Despite using a relatively large dataset, inherent class imbalance was present among the data, due to varying sample availability from each kiln. To enhance the model reliability, the authors employed both leave-one-out cross-validation (LOOCV) and repeated 10-fold cross-validation, achieving a highest average accuracy of 96.41% with RF and a kappa coefficient of 0.985. The study focused on the classification accuracy and characteristic chemical parameters, and model interpretation was mainly based on global feature importance.

Qi et al. applied RF modeling to classify ceramics from six chronological periods, including modern, Qing, Song, Jin, Yuan, and Tang dynasties, based on LIBS spectral data⁵⁵. Despite using only 35 samples, they optimized the model performance through advanced pre-processing, variable selection based on out-of-bag (OOB) error, and sensitivity/specificity trade-offs. The final RF model, which was optimized using variable importance thresholds and OOB error minimization, achieved an accuracy of 94.33% on the test set. Similar to Sun et al., their focus was on classification accuracy, and performance metrics such as sensitivity, specificity, and OOB error were used for evaluation. However, class imbalance was inevitable due to the limited sample availability per period, and the model interpretation remained confined to overall variable importance.

The present study shares a common methodological foundation with previous research, involving the application of ML models to classify ceramics based on chemical composition data. Similar to the investigations by Sun et al. and Qi et al., supervised learning approaches were employed in this study, and the RF model consistently demonstrated high classification performance, achieving a 95.8% accuracy.

Despite this methodological continuity, there are differences in terms of classification objectives and data structure. Previous studies primarily focused on celadon types or chronological periods, whereas the present work targets typological classification among three major Korean ceramic types (celadon, buncheong, and white porcelain), addressing broader stylistic and technological variations. This study employed a balanced dataset with equal representation of each ceramic type, enabling a more controlled assessment of model performance in a multiclass setting compared to prior studies based on imbalanced datasets of excavated samples.

Differences also exist in the validation strategies applied. Sun et al. employed leave-one-out and repeated ten-fold cross-validation, primarily to improve the model reliability and monitor the sample-level classification stability¹³, while Qi et al. applied OOB error estimation and train/test splits alongside variable selection to optimize the classification performance⁵⁷. The present study combined an initial train/test split to enable independent performance evaluation with stratified ten-fold cross-validation within the training set, maintaining the class proportions unchanged during model optimization. Additionally, the generalizability of the model beyond the original dataset was evaluated using an external dataset from independent studies. This approach enables both external performance verification and robust internal validation in a balanced, multiclass classification context.

Another difference lies in the model interpretability. Previous research predominantly relied on overall variable importance to interpret classification outcomes. This study further incorporates both SHAP value analysis and SHAP beeswarm plots. The SHAP values quantify the global and class-specific importance of each chemical feature, while the beeswarm plots further illustrate the direction and magnitude of the contribution of each feature at the individual sample level. This combined approach provides a more comprehensive understanding of how the chemical composition influences classification decisions and enables a clearer connection to be established between ML predictions and archeological interpretations.

Recent advances in materials and heritage science have demonstrated the growing potential of both automated and deep learning frameworks for ceramic classification. Dunn et al.⁵⁸ showed that automated frameworks such as Automatminer achieve excellent performance across diverse materials-prediction tasks, while classical ML remains competitive for smaller datasets (typically <10⁴ samples). Capriotti et al.⁵⁹ reported over 90% accuracy in classifying petrographic thin-section images using CNNs and Vision Transformers. Additionally, Qi et al.⁶⁰ applied a fully connected neural network (FCN) to XRF data for kiln-based ceramic classification, achieving an accuracy of about 93% and demonstrating the growing applicability of deep learning in archaeometric research. These studies highlight the rapid methodological progress in the field and suggest that automated and deep learning frameworks hold great promise for future archaeometric applications, particularly as datasets become larger, multimodal, and more standardized.

In summary, both previous studies and the present work consistently demonstrate the effectiveness of tree-based models, particularly RF, in ceramic classification based on chemical composition data. These studies also emphasize the role of validation strategies in preventing overfitting and improving generalizability, as well as the value of interpretability techniques in linking model outputs to archeological knowledge.

In conclusion, this study demonstrates the potential of ML algorithms for classifying traditional Korean ceramics (celadon, buncheong, and white porcelain) using XRF-derived chemical composition data. Among the six models evaluated, tree-based ensemble methods such as RF and XGB showed the highest accuracy (both 95.8%), followed by SVM, while PCA–LDA exhibited the lowest performance, highlighting the advantages of nonlinear models for this classification task. Analyses using confusion matrices and SHAP values revealed patterns consistent with traditional ceramic typologies and production knowledge. Celadon and buncheong were frequently misclassified due to their shared compositional and technical characteristics, whereas white porcelain was clearly distinguished based on its low TiO₂ and Fe₂O₃ contents. SHAP-based interpretation further demonstrated positive associations between TiO₂ content and celadon classification, and between Fe₂O₃ and buncheong classification. Moreover, contrasting distribution patterns of P₂O₅ and Na₂O were observed between these two ceramic types. In contrast, PCA–LDA assigned greater importance to SiO₂ and Al₂O₃, which contributed less effectively to accurate classification. External validation further confirmed these trends, with the RF, XGB, and SVM models achieving high accuracy using an external dataset, in line with SHAP-based feature importance patterns.

These findings highlight the potential of explainable ML for the classification of traditional Korean ceramics, a task that has historically relied on expert judgment. The developed models achieved accurate and interpretable classification results, providing quantitative insights into the material and technological characteristics underlying celadon, buncheong, and white porcelain. Importantly, the application of compositional data treatment and independent validation confirmed the robustness and reproducibility of the analytical framework. Nevertheless, several limitations should be acknowledged. The current dataset has limited geographical and chronological coverage, which may constrain the model’s generalizability to ceramics from underrepresented kiln sites or production periods. Furthermore, additional factors—such as firing temperature, kiln atmosphere, and regional variability in clay and glaze sources—could contribute to compositional variability and influence classification performance. Expanding the dataset to include more diverse archeological contexts and integrating complementary physicochemical parameters will be essential for further improving model generalization and interpretability.

Although this study focuses on clay body components and is particularly suited for sherds or damaged artifacts, the proposed framework exhibits broader applicability for archeological research. The integration of explainable ML can be further extended to investigate ceramic typologies, kiln production systems, regional distribution patterns, and potentially other cultural heritage materials, contributing to the development of more objective and data-driven approaches in archaeometry.

link