Deep learning for tooth detection and segmentation in panoramic radiographs: a systematic review and meta-analysis | BMC Oral Health

Artificial Intelligence-based models are increasingly being explored in dentistry to enhance diagnostic accuracy, particularly while using orthopantomographies – which are fundamental for pathology identification. Nevertheless, these images frequently present superimpositions and deformations that pose unavoidable challenges for training neural networks to perform tasks like object detection (OD) and segmentation (OS). Thus, although human diagnostic capabilities might surpass computer-based ones, it is thought that deep learning models could be a potential solution to refine them bearing in mind inherent limitations – especially when considering inexperienced operators. However, the diagnostic power of these models should be interpreted with caution, as biases and performance variability across different architectures and datasets can impact their generalizability [9].

When referring to image interpretation and Computer Vision (CV), it is crucial to discern between different tasks concerning object identification. This way, OD and OS are the predominant terms used in contemporary literature – with OS further categorized into Semantic Segmentation (SS) and Instance Segmentation (IS). SS involves creating pixel-level boundaries around objects of a specific category (e.g. dental structures segmented individually but all of them categorized as tooth), while IS distinguishes individual instances within a category (e.g. dental structures segmented individually being categorized as tooth but also specifying their tooth number) [1].

Out of the twenty studies included in the present review, only two did not evaluate OD performance [18, 19]. Among those that did, four addressed tooth labeling but did not specify performance metrics for this task [16, 22,23,24]. Most studies employing both OD and labeling methodologies utilized two-stage CNNs, except for three studies that used a single-stage CNN [16, 22, 24]. OS methodologies varied, as one study utilized SS [18], three used IS [9, 19, 25] and another incorporated both [4] to enhance the CNN’s diagnostic performance. This was also performed by two other studies, but results were not reported [17, 26].

Establishing a reliable ground truth (GT) is crucial for evaluating deep learning models. Seven studies reported having multiple operators performing manual annotation and labeling [15, 24, 25, 27,28,29,30], while eight relied on a single clinician. Five did not specify the number of practitioners involved. Operator experience was often unreported, but varied from 5 to 30 years. Notably, studies focused on mesiodens detection with a single operator utilized Cone Beam Computed Tomography (CBCT) as GT [21, 26, 31, 32] – which may be a source of potential biases.

Sample dataset division is another key factor in model evaluation. Eleven studies correctly followed the recommended protocol of creating three independent sets of images for training (TrS), validation (VS), and testing (TeS) [4, 9, 19, 21,22,23,24,25, 29,30,31]. However, eight studies did not follow this protocol [16,17,18, 20, 26,27,28, 32], and one only included a TeS, as the study analyzed the diagnostic performance of a pre-trained and validated CNN CNN [15]. Only one study used a publicly available dataset [4]. Concerns about generalizability were raised in studies that did not include different world populations nor sets from different institutions or acquired with various X-ray machines. This way, only four studies reported using assorted testing sets [9, 18, 22, 25]. Nonetheless, studies with poor datasets tried to reduce overfitting by performing cross-validation techniques [26, 28, 31] or implementing data augmentation processes [4, 9, 17, 19, 22, 26, 27, 31, 32].

Metrics used to evaluate the performance of DL models vary within the included studies, being precision, recall and F1 Score the most frequently reported. Other pixel-based metrics such as IoU were also included. However, performance should be interpreted cautiously due to variability in dataset quality and study design. Vinayahalingam et al. published exceptional results for OD and OS, with OD precision of 0.997, recall of 0.989, and F1 Score of 0.992. While OS also achieved great results, certain limitations to this study were found, since blurred or incomplete OPGs were excluded from the dataset [25]. Similarly, Choi et al. achieved impressive average precision and recall results of 0.991 and 0.996 respectively, but the exclusion of images with primary and mixed dentition, impacted teeth or partially edentulous patients make their results not generalizable and highly biased [22].

Inconsistencies in reporting were evident when comparing studies evaluating similar tasks. When evaluating the same NN, Tuzoff et al. reported high sensitivity and precision for OD (0.9941 and 0.9945, respectively), while Bonfanti-Gris et al. reported a lower sensitivity value for this same task (0.693). Similarly, discrepancies were found for labeling task, with Tuzoff et al. reporting sensitivity and specificity of 0.980 and 0.999 and Bonfanti-Gris et al. reporting 0.500 for both [15, 27]. Other studies were also found to report poor results. Yüksel et al., observed a mean average precision of 0.477 for object detection adjusted to a threshold of 0.5–0.95. Only when this was lowered to 0.5, the model showed a maximum precision of 0.894 [17]. Nevertheless, contrarily to what was found by Bonfanti-Gris et al. with a reduced dataset, Leite et al. obtained great results both for OD and OS (S = 0.989, P = 0.996 and P = 0.958, R = 0.975, IoU = 0.936 and F1 Score = 0.966, respectively) [9]. This way, a threshold effect was observed but was not explicitly discussed, which leaves a gap of information that should be considered when applying these models in clinical settings, as decision-varying thresholds may significantly impact both sensitivity and specificity.

Results for reduced sample sizes without data augmentation techniques were also reported by Kilic et al., who achieved S = 0.9804, P = 0.9571 and F1 Score = 0.9686 outcomes for object detection and labeling [23]. Estai et al., also reported favorable outcomes for OD (P = 0.992 and R = 0,994) and labeling tasks (P, R and F1Score = 0.980, E = 0.999 and A = 0.998) using three different NN for each of them. Nevertheless, the use of a two-set of images instead of three could have introduced a risk of bias factor [28].

Contrary to the previous, both Bilgir et al., and Kaya et al. reported remarkable results using a single CNN for OD. In the first case, authors reported high sensitivity, precision, F1 Score, False Discovery Rate and False Negative Rate, while Kaya et al. demonstrated excellent results with mAP and mAP metrics [20, 24].

Two distinct DL approaches were explored within the included studies. Mahdi et al., utilized an Optimization Technique based on Transfer Learning, showcasing positive results with CNNs like ResNet-101 and ResNet-105. Chandrashekar et al., instead, introduced a collaborative learning approach in which two DL models were integrated to obtain better results. In this case, authors compared the studied CNNs performance metrics both individually and while collaborating, obtaining – for both OD and OS – higher accuracy, F1 Score, and mAP results with the latest (> 0.973).

Finally, even studies focusing solely on OS tasks presented varying results. Sheng et al., reported accuracy values of 0.885, mean IoU of 0.468, and a F1Score of 0.637 [18]. Nevertheless, Lee et al., achieved better performance metrics while using a significantly reduced dataset and implementing data augmentation techniques – IoU = 0.877, F1 Score = 0.875, P = 0.858, and R = 0.893 [19].

When comparing the results obtained from different neural networks, the depth of the CNN should be taken into consideration, as this can affect model performance. It has been reported that deeper architectures improve accuracy but risk overfitting – especially in small datasets [4, 18]. Although data augmentation techniques might mitigate this, increasing model complexity does not always yield proportional accuracy improvements. Also, while some architectures may perform well with specific dataset sizes, others may suffer from overfitting or diminishing returns [18]. Thus, the absence of a standardized framework for selecting optimal depth and learning parameters limits the comparability and reproducibility of results [33].

Deep Learning OD and OS models have also been reported to accurately perform impacted-tooth identification. This systematic review localized six titles in which this objective was tackled by evaluating several CNNs’ mesiodens identification and classification capacities. Overall, results were impressive, as Dai et al., reported A, S, E, P and mAP results of 0.94, 0.95, 0.93, 0.93, and 0.99, respectively [29]. Similarly, Ha et al., obtained outcomes from 0.915 to 0.043 for A, S, and E with a resemblant sample size dataset [21].

When comparing different CNNs, Kuwada et al. observed that DetectNet outperformed AlexNet and VGG-16, with sensitivity, specificity and accuracy values of 0.920, 1.000 and 0.960, respectively [30]. Other studies reported similar outcomes for architectures like ResNet-18, ResNet-101, Inception Resnet-V2 and SqueezeNet [31]. Variability within the outcomes was detected by Aljabri et al. while analyzing four different DL models and studying their performance on two different sample sizes experiments. Overall, worst results were observed for VGG-16 architecture [32].

Kim et al., achieved outstanding results by employing a novel OS technique to restrict the maxillary anterior region, enhancing detection accuracy for mesiodens. However, the study excluded images with distortion and blurriness, so generalizability was not ensured [34].

DL models have also been employed in dentistry for detecting ectopic eruption of maxillary first molars [35] and classifying mandibular third molar positioning [36, 37]. Object segmentation automation is crucial for digital applications, especially in 3D imagery, where manual segmentation is labor-intensive and skill-dependent. This can be particularly relevant for treatment planning, addressing intra-operative complications and planning auto-transplantations [2].

Although AI-based applications have been widely studied, clinical implications of DL models warrant further discussion. While models performed will in tooth detection and segmentation, practical challenges remain – such as the need for standardized training data, external validation and regulatory approvals before implementation in clinical practice. Additionally, model interpretability and clinicians’ trust in AI-generated reports must be addressed.

Despite promising results, the systematic review highlights several limitations. First, including studies within a limited timeframe and only focusing on DL methods could be considered as liabilities.

Based on the reviewed data, future research should prioritize diverse and generalizable datasets, incorporate multicenter images to address overfitting and adopt standardized reference tests and reporting guidelines, such as STARD-AI and the CLAIM Checklist. These steps will enhance research quality, robustness and reliability in AI-based diagnostic tools for dentistry.

link