Explainable machine learning-based classification of traditional Korean ceramics using XRF chemical composition data
Classification accuracy of machine learning models
A comparative evaluation of the six ML models revealed that both RF and XGB achieved the highest classification accuracy of 95.8% on the test set, followed closely by SVM (93.3%), KNN (88.3%), and DT and PCA-LDA (both 85.0%) (Table 2). The minimal differences between the training and test accuracies across all models indicate effective overfitting control and stable generalization performance. To verify model stability, the learning curve and inter-fold variance analyses are presented in the Supplementary Materials (Fig. S1 and Table S1).
The comparatively lower performance of the PCA-LDA algorithm can be explained by multiple factors. First, its reliance on linear decision boundaries limits its ability to capture the complex nonlinear relationships inherent in ceramic compositional data. Second, the PCA transformation may have removed subtle but diagnostically meaningful chemical variations, as PCA maximizes overall variance rather than class separability. This issue is particularly critical for ceramic classification, where discriminative cues often reside in minor elemental differences. Finally, with only ten compositional features, dimensionality reduction offers limited benefit and may discard essential information, further constraining the model’s classification capability.
The DT model, which achieved the same test accuracy as PCA–LDA, was further examined through visualization of its decision tree structure (Fig. S2). The visualization revealed that white porcelain could be accurately classified through relatively simple decision rules, whereas celadon and buncheong required more complex, branched pathways. This finding suggests that white porcelain is compositionally distinct from celadon and buncheong, likely reflecting its use of refined clay and reduced concentrations of coloring oxides. However, as a single model rather than an ensemble approach, DT may exhibit lower generalization performance on the test data.
The superior performance of the RF algorithm can be ascribed to its ensemble structure, which effectively models intricate feature interactions by aggregating multiple decision trees. Bagging, employed in RF, reduces the variance and helps to prevent overfitting, enhancing the generalization ability of the model. Similarly, XGB exhibited a robust performance by iteratively correcting residuals from previous models. Although both RF and XGB are tree-based ensemble methods, RF constructs trees in parallel, while XGB builds them sequentially to address previous errors, resulting in subtle differences in model behavior.
McNemar’s test results (Table S2) indicated that the RF and XGB models achieved statistically significant improvements over PCA-LDA, DT, and KNN (p < 0.05). SVM also performed significantly better than PCA-LDA and DT (p < 0.05), but did not differ significantly from RF, XGB, or KNN (p > 0.05). These findings confirm that the enhanced performance of the ensemble models (RF and XGB) reflects genuine methodological advantages rather than random variation, underscoring the robustness of ensemble learning for ceramic classification based on XRF compositional data. Despite consistent preprocessing and optimization protocols across models, intrinsic architectural differences substantially influence classification performance, highlighting the critical role of model design in chemometric interpretation.
Confusion matrix analysis and PCA visualization
To further assess the classification behavior, the confusion matrices for the training and test sets of each model were analyzed (Fig. 2). The training set matrices confirmed the successful model learning, and the overall trends were consistent with those observed in the test set.

Each matrix shows the classification results for celadon (C), buncheong (B), and white porcelain (W), with both raw counts and normalized percentages indicating the classification accuracy for each true class.
A common misclassification trend was observed for all models: buncheong and celadon were frequently confused with each other, whereas white porcelain was consistently classified with high accuracy. The analysis of the misclassified samples revealed that multiple models often failed to correctly classify the same instances. Specifically, the high-performing RF and XGB models were misclassified five samples, with most of these errors occurring simultaneously across the other models. The majority of misclassified samples belonged to the buncheong and celadon classes, indicating that the classification errors reflect intrinsic compositional ambiguities rather than model limitations. This tendency suggests a greater compositional similarity between celadon and buncheong compared to white porcelain. From a material perspective, this result can be attributed to similarities in raw materials and firing conditions. Both celadon and buncheong are typically fired at temperatures exceeding 1200 °C, whereas white porcelain is fired at even higher temperatures, around 1300 °C, resulting in different mineralogical transformations and oxide behaviors52. Moreover, the refined nature of the white porcelain clay body, with reduced concentrations of coloring oxides, enhances its chemical distinctiveness. Even under identical firing conditions, white porcelain remains compositionally separate, due to the purity of its raw materials. In contrast, early buncheong often retained features characteristic of inlaid celadon, making them visually and chemically similar7. These material and technological similarities explain both the recurrent misclassification patterns observed across all models and the existence of samples that remain fundamentally challenging to classify using XRF-based elemental composition analysis alone.
These compositional relationships were further visualized by analyzing PCA scatter plots. PCA achieves dimensionality reduction by projecting high-dimensional data onto orthogonal axes that capture the maximum variance53. In this study, the first two principal components (PC1 and PC2) explained 49.73% and 23.40% of the variance, respectively, accounting for 73.13% of the total variation. The two-dimensional PCA plot in Fig. 3 reveals that white porcelain samples formed a compact cluster in the positive PC1 direction, indicating a distinct chemical composition (namely, lower levels of coloring oxides such as Fe2O3 and TiO2). In contrast, celadon and buncheong substantially overlapped in the negative PC1 region, reflecting their compositional similarity.

The biplot displays the top five contributing variables with the highest loadings, indicating their influence on sample separation.
These PCA results corroborate the findings based on confusion matrices and underscore the chemical and technological continuity between celadon and buncheong. The blurred compositional boundary between these two ceramic types reflects historical production practices, emphasizing the value of combining statistical analysis with domain-specific knowledge in order to interpret classification outcomes.
Feature importance analysis
To investigate how chemical composition influences the classification of ceramic types, feature importance was evaluated using SHAP values across all ML models (Fig. 4). SHAP quantifies the contribution of each feature to individual model predictions, providing both global and local interpretability within a unified, model-agnostic framework. However, direct numerical comparison of SHAP values across models with different algorithmic structures and feature scaling should be approached with caution. Accordingly, the following analysis focuses on identifying consistent trends in feature importance and interpreting these patterns in conjunction with archeological and technological contexts, rather than making direct quantitative comparisons between models.

The y-axis shows mean absolute SHAP values; importance values were quantified using SHAP analysis.
Among the six models investigated, the tree-based algorithms (DT, RF, and XGB) demonstrated highly consistent feature importance patterns. These models consistently identified TiO2 as the most important variable, with Fe2O3 ranking relatively high, while other oxides contributed to a much lesser degree. This pattern was particularly pronounced in the ensemble models, RF and XGB. This observation aligns with established knowledge of ceramic coloration mechanisms, which considers the concentrations and oxidation states of TiO2 and Fe2O3 as key determinants of body color54.
Unlike TiO2, which generally yields a yellowish tone under both oxidizing and reducing conditions, the color of Fe2O3 shifts from yellowish under oxidizing conditions to bluish-green under reducing conditions due to the partial reduction of Fe3+ to Fe2+. These oxides are therefore closely related to the visual appearance of ceramics, such as the bluish-green hue of celadon and the brownish tone of buncheong ware. In contrast, white porcelain was produced using highly refined clays with minimal amounts of these coloring oxides, resulting in its bright white appearance28. The low TiO2 content appears to be critical not only for glaze coloration—through redox interactions with iron55,56—but also for the body composition itself, as such TiO2-deficient materials are geologically rare yet essential for achieving the characteristic whiteness of high-quality porcelain across East Asian ceramic traditions. The superior classification performance of the RF and XGB models may thus reflect their ability to effectively capture these nonlinear relationships among elemental composition, firing atmosphere, and technological traditions associated with each ceramic type.
In comparison, the SVM model, which ranked third in classification accuracy, displayed a distinct feature importance pattern. While TiO2 and SiO2 contents emerged as significant features, Fe2O3 was assigned relatively low importance; this likely limited the effectiveness of the model in differentiating celadon and buncheong, in which Fe2O3 plays a crucial role.
Interestingly, PCA–LDA and DT, despite their identical test accuracy (85.0%), exhibited markedly different feature importance profiles. DT placed strong emphasis on TiO2, while PCA–LDA highlighted not only TiO2 but also SiO2 and Al2O3 as primary features. This tendency reflects the limitations of single models and the constraints of linear models in capturing the nonlinear relationships inherent in ceramic compositional data, particularly when the number of features is limited.
KNN also utilized TiO2 as an important component for classification, but underestimated Fe2O3, which was effectively leveraged by the two highest-performing models (RF and XGB). This difference suggests that the reliance of KNN on geometric proximity in feature space may cause it to overlook subtle chemical effects that tree-based models can capture more effectively.
To further verify the robustness of the interpretability results, SHAP heatmap analyses were performed for both the RF model, which exhibited the highest classification performance, and the PCA-LDA model, which showed the lowest accuracy, across ten independent cross-validation folds (Fig. S3). The visualizations revealed that the key discriminative features, TiO2 and Fe2O3, consistently ranked first and second across all folds in the RF model, indicating a high degree of stability in their contributions to model predictions. However, minor fold-dependent fluctuations were observed among lower-ranking features, reflecting slight variations in model sensitivity to secondary compositional effects. In contrast, the PCA-LDA model showed greater variability overall, with moderate fluctuations even among the top-ranking features. The relative importance of TiO2 and Al2O3 occasionally shifted between the first and second ranks across folds, indicating that linear models are more sensitive to sampling variation and feature intercorrelations. This comparison highlights the enhanced robustness of tree-based ensemble models in capturing nonlinear composition–class relationships compared to linear models such as PCA-LDA, which are more sensitive to sampling variation and feature intercorrelations.
In summary, the SHAP-based feature importance analysis identified the chemical oxides that drive classification decisions in each model, and confirmed that these ML-derived patterns align well with established knowledge in materials science. TiO2 and Fe2O3 were consistently recognized as key features for white porcelain identification, while TiO2 also contributed significantly to celadon classification, likely reflecting its synergistic role with Fe2O3 in producing the greenish hue of the ceramic under reducing conditions5. Moreover, the combination of higher Fe2O3 contents and intermediate TiO2 levels served as a critical marker for buncheong, consistent with its darker body color and intermediate position between celadon and white porcelain2. These findings demonstrate that SHAP analysis can enhance the model transparency and facilitate the connection between ML outputs and archeological as well as technological interpretations, thereby strengthening the interpretive value of AI in heritage science.
Analysis of feature impact distribution plots
SHAP beeswarm plots (Fig. 5) were analyzed to investigate both the relative importance and the directional influence of individual features on ceramic classification. These plots clearly show how changes in chemical composition affect the probability of a sample being classified into a particular ceramic type. The analysis focused on comparing the RF with the PCA–LDA model. SHAP beeswarm plots for the remaining models are displayed in Fig. S4.

Each point represents a test sample, colored by feature value (blue: low, red: high). The x-axis shows SHAP values, indicating both the direction (positive/negative) and the strength of the influence of each feature on the classification. Only the five most influential features are shown, ranked by impact magnitude.
In these plots, the SHAP values on the x-axis indicate how each feature influences classification decisions: positive/negative values reflect an increased/decreased probability of assignment to a specific ceramic type, and the absolute magnitude of the SHAP value represents the strength of the contribution of a feature to the classification outcome.
In the RF model, TiO2 emerged as the most influential feature for celadon classification. A high TiO2 content consistently showed positive contributions, indicating that elevated TiO2 concentrations resulted in increased probability of celadon classification; conversely, low TiO2 levels contributed negatively. In addition, Fe2O3 demonstrated the second-highest contribution in celadon classification. The presence of low levels of Fe2O3 has been demonstrated to exert a detrimental influence on the classification of celadon, while intermediate levels have been shown to exert a positive influence.
However, in the buncheong classification, higher Fe2O3 concentrations exhibited a strong association with increased classification probability. This contrasting behavior suggests that Fe2O3 content serves as a hierarchical classification marker: high concentrations favor buncheong, whereas low concentrations are indicative of white porcelain, and intermediate values are more characteristic of celadon. TiO2 also functioned as a hierarchical classification marker, with intermediate values positively contributing to buncheong classification, high values to celadon, and low values to white porcelain. P2O5 and Na2O, while not appearing among the top-ranked features in white porcelain classification, exhibited contrasting concentration distributions between celadon and buncheong, indicating their partial contribution to differentiating these two ceramic types. The inverse relationship observed in the SHAP distributions underscores the role of P2O5 and Na2O as complementary features that distinguish these two stylistically similar ceramic types.
In the case of white porcelain, the SHAP distributions for TiO2 and Fe2O3 showed clear bimodal patterns, with low oxide concentrations contributing strongly and positively to the white porcelain classification. This dichotomous pattern reflects the well-known use of high-purity white clay in white porcelain production, where the removal of coloring oxides such as TiO2 and Fe2O3 enhances whiteness. These findings corroborate the classification rules learned by RF models and align closely with established knowledge on production techniques of Joseon white porcelain28,55.
In contrast, the PCA–LDA model showed different feature contribution patterns, placing greater importance on Al2O3 and SiO2 rather than key coloring oxides such as TiO2 and Fe2O3. As illustrated in Fig. 5 (right), these coloring oxides showed relatively low importance compared to RF and lacked clear directional separation. This suggests that PCA–LDA, due to its linear structure, was less effective at capturing the compositional characteristics most relevant to ceramic classification.
Taken together, the SHAP beeswarm plots provide a detailed explanation of how the elemental composition influences the ceramic classification. In addition to validating the predictive mechanisms of high-performing models such as RF, they also expose the interpretive limitations of lower-performing algorithms. Additionally, they demonstrate the potential of SHAP analysis to link ML outputs with archaeometric knowledge, providing explainable and data-driven insights into ceramic production technologies.
External test dataset validation
To assess the generalizability of the models, an external dataset that was not used for training or internal validation was evaluated (Table 3). The RF model achieved the highest accuracy (93.2%), followed by XGB (91.5%) and SVM (91.5%), consistent with the internal test results and indicating robust generalization capability across independently sourced samples.
The confusion matrices (Fig. 6) provide a detailed view of class-specific performance. Tree-based models, RF and XGB, demonstrated strong performance in classifying white porcelain, whereas PCA-LDA and SVM, which construct high-dimensional decision boundaries, performed comparatively better for celadon and buncheong but were comparatively weaker for white porcelain. These differences highlight the methodological distinctions inherent in each model.

Each matrix shows the classification results for celadon (C), buncheong (B), and white porcelain (W), with both raw counts and normalized percentages indicating the classification accuracy for each true class.
Importantly, the overall classification patterns observed in the external validation were consistent with the SHAP-based feature importance trends identified in the internal analyses, further supporting the reliability of the compositional relationships captured by the models. It should be noted that the external dataset was derived from a limited number of studies and may be subject to regional bias; therefore, caution is warranted when extrapolating these findings to broader populations or drawing general conclusions regarding ceramic classification trends.
Comparison with previous studies
Recent studies have increasingly applied ML to the classification of ancient ceramics, particularly using chemical composition data. Among these investigations, Sun et al. developed a classification model for ancient Chinese celadon using XRF compositional data and four ML algorithms (RF, Adaboost, KNN, and SVM), combined with Mahalanobis distance analysis for post-classification verification13. Their dataset comprised more than 1000 celadon shards from 18 kiln sites, categorized into eight types. Despite using a relatively large dataset, inherent class imbalance was present among the data, due to varying sample availability from each kiln. To enhance the model reliability, the authors employed both leave-one-out cross-validation (LOOCV) and repeated 10-fold cross-validation, achieving a highest average accuracy of 96.41% with RF and a kappa coefficient of 0.985. The study focused on the classification accuracy and characteristic chemical parameters, and model interpretation was mainly based on global feature importance.
Qi et al. applied RF modeling to classify ceramics from six chronological periods, including modern, Qing, Song, Jin, Yuan, and Tang dynasties, based on LIBS spectral data55. Despite using only 35 samples, they optimized the model performance through advanced pre-processing, variable selection based on out-of-bag (OOB) error, and sensitivity/specificity trade-offs. The final RF model, which was optimized using variable importance thresholds and OOB error minimization, achieved an accuracy of 94.33% on the test set. Similar to Sun et al., their focus was on classification accuracy, and performance metrics such as sensitivity, specificity, and OOB error were used for evaluation. However, class imbalance was inevitable due to the limited sample availability per period, and the model interpretation remained confined to overall variable importance.
The present study shares a common methodological foundation with previous research, involving the application of ML models to classify ceramics based on chemical composition data. Similar to the investigations by Sun et al. and Qi et al., supervised learning approaches were employed in this study, and the RF model consistently demonstrated high classification performance, achieving a 95.8% accuracy.
Despite this methodological continuity, there are differences in terms of classification objectives and data structure. Previous studies primarily focused on celadon types or chronological periods, whereas the present work targets typological classification among three major Korean ceramic types (celadon, buncheong, and white porcelain), addressing broader stylistic and technological variations. This study employed a balanced dataset with equal representation of each ceramic type, enabling a more controlled assessment of model performance in a multiclass setting compared to prior studies based on imbalanced datasets of excavated samples.
Differences also exist in the validation strategies applied. Sun et al. employed leave-one-out and repeated ten-fold cross-validation, primarily to improve the model reliability and monitor the sample-level classification stability13, while Qi et al. applied OOB error estimation and train/test splits alongside variable selection to optimize the classification performance57. The present study combined an initial train/test split to enable independent performance evaluation with stratified ten-fold cross-validation within the training set, maintaining the class proportions unchanged during model optimization. Additionally, the generalizability of the model beyond the original dataset was evaluated using an external dataset from independent studies. This approach enables both external performance verification and robust internal validation in a balanced, multiclass classification context.
Another difference lies in the model interpretability. Previous research predominantly relied on overall variable importance to interpret classification outcomes. This study further incorporates both SHAP value analysis and SHAP beeswarm plots. The SHAP values quantify the global and class-specific importance of each chemical feature, while the beeswarm plots further illustrate the direction and magnitude of the contribution of each feature at the individual sample level. This combined approach provides a more comprehensive understanding of how the chemical composition influences classification decisions and enables a clearer connection to be established between ML predictions and archeological interpretations.
Recent advances in materials and heritage science have demonstrated the growing potential of both automated and deep learning frameworks for ceramic classification. Dunn et al.58 showed that automated frameworks such as Automatminer achieve excellent performance across diverse materials-prediction tasks, while classical ML remains competitive for smaller datasets (typically <104 samples). Capriotti et al.59 reported over 90% accuracy in classifying petrographic thin-section images using CNNs and Vision Transformers. Additionally, Qi et al.60 applied a fully connected neural network (FCN) to XRF data for kiln-based ceramic classification, achieving an accuracy of about 93% and demonstrating the growing applicability of deep learning in archaeometric research. These studies highlight the rapid methodological progress in the field and suggest that automated and deep learning frameworks hold great promise for future archaeometric applications, particularly as datasets become larger, multimodal, and more standardized.
In summary, both previous studies and the present work consistently demonstrate the effectiveness of tree-based models, particularly RF, in ceramic classification based on chemical composition data. These studies also emphasize the role of validation strategies in preventing overfitting and improving generalizability, as well as the value of interpretability techniques in linking model outputs to archeological knowledge.
In conclusion, this study demonstrates the potential of ML algorithms for classifying traditional Korean ceramics (celadon, buncheong, and white porcelain) using XRF-derived chemical composition data. Among the six models evaluated, tree-based ensemble methods such as RF and XGB showed the highest accuracy (both 95.8%), followed by SVM, while PCA–LDA exhibited the lowest performance, highlighting the advantages of nonlinear models for this classification task. Analyses using confusion matrices and SHAP values revealed patterns consistent with traditional ceramic typologies and production knowledge. Celadon and buncheong were frequently misclassified due to their shared compositional and technical characteristics, whereas white porcelain was clearly distinguished based on its low TiO2 and Fe2O3 contents. SHAP-based interpretation further demonstrated positive associations between TiO2 content and celadon classification, and between Fe2O3 and buncheong classification. Moreover, contrasting distribution patterns of P2O5 and Na2O were observed between these two ceramic types. In contrast, PCA–LDA assigned greater importance to SiO2 and Al2O3, which contributed less effectively to accurate classification. External validation further confirmed these trends, with the RF, XGB, and SVM models achieving high accuracy using an external dataset, in line with SHAP-based feature importance patterns.
These findings highlight the potential of explainable ML for the classification of traditional Korean ceramics, a task that has historically relied on expert judgment. The developed models achieved accurate and interpretable classification results, providing quantitative insights into the material and technological characteristics underlying celadon, buncheong, and white porcelain. Importantly, the application of compositional data treatment and independent validation confirmed the robustness and reproducibility of the analytical framework. Nevertheless, several limitations should be acknowledged. The current dataset has limited geographical and chronological coverage, which may constrain the model’s generalizability to ceramics from underrepresented kiln sites or production periods. Furthermore, additional factors—such as firing temperature, kiln atmosphere, and regional variability in clay and glaze sources—could contribute to compositional variability and influence classification performance. Expanding the dataset to include more diverse archeological contexts and integrating complementary physicochemical parameters will be essential for further improving model generalization and interpretability.
Although this study focuses on clay body components and is particularly suited for sherds or damaged artifacts, the proposed framework exhibits broader applicability for archeological research. The integration of explainable ML can be further extended to investigate ceramic typologies, kiln production systems, regional distribution patterns, and potentially other cultural heritage materials, contributing to the development of more objective and data-driven approaches in archaeometry.
link
