PointMoment: a mixed-moment self-supervised learning approach for 3D Terracotta Warriors

0
PointMoment: a mixed-moment self-supervised learning approach for 3D Terracotta Warriors

This section presents a comprehensive evaluation of our proposed method. First, we elucidate our pre-training strategy and describe the datasets employed. Second, we assess the transferability of our approach through two prevalent downstream tasks: object classification and segmentation. Third, to validate the efficacy of our loss function and parameter selection, we conduct thorough ablation studies. Finally, we demonstrate the practical utility of our method by applying it to the Terracotta Warriors dataset, showcasing its performance in a real-world scenario.

Pre-training

Dataset. For pre-training, we utilized the ShapeNet38 dataset, a comprehensive repository of synthetic 3D shapes comprising over 50,000 unique models across 55 common object categories. To ensure comparability, we adhered to the training protocol established by STRL24. This procedure involved randomly sampling 2048 points from each model in the dataset. Subsequently, we applied a series of data augmentation techniques, including random rotation, translation, scaling, clipping, and jittering, followed by normalization. These augmented samples were then fed into the network for pre-training, maintaining consistency with established methodologies in the field.

Pre-training Detail. For a fair comparison with existing methods, we adopt the same method as STRL24, OCCO19, etc., using PointNet5 and DGCNN7 as feature extractors for point clouds and a two-layer multilayer perceptron as the projection head to map feature vectors into a 512-dimensional embedding space. We train the model in an end-to-end manner for 200 rounds using an Adam optimizer with a weight decay of 1×10-6 and an initial learning rate of 1 × 10-3. Additionally, we also adjust the learning rate by implementing a decay strategy based on cosine annealing. The batch size is 16. After pre-training, we discarded the projection head \({{\rm{g}}}_{{\rm{\phi }}}({\rm{\cdot }})\) and retained \({{\rm{f}}}_{{\rm{\theta }}}({\rm{\cdot }})\) for the following downstream task.

Downstream tasks

3D Object classification

The 3D object classification task involves categorizing point cloud data to identify the specific class of each point cloud. We assess the shape understanding and generalization of pre-trained models using two benchmark datasets: ScanObjectNN39 and ModelNet4040. ScanObjectNN, a challenging dataset of real-world 3D point clouds from indoor scenes, contains 15 categories with 2880 objects, Among these, 2304 for training and 576 for testing. ModelNet40, featuring synthetic objects, includes 12,311 CAD models across 40 categories, and 2468 for testing and 9843 for training, allowing us to evaluate classification performance on synthetic data.

Linear classification is a common method to evaluate the migration and generalization ability of self-supervised models in classification tasks. We follow the standard protocols of ref. 24 and ref. 19 to test the accuracy of our network model in object classification. On the classification data set, a linear Support Vector Machine (SVM) classifier is employed. This classifier is trained on features extracted from the training set using a pre-trained feature extractor, whose parameters remain fixed during this process. Subsequently, the trained SVM is applied to predict classifications based on the 3D features extracted from the test set. This methodology is widely adopted in the field for assessing the effectiveness of learned feature representations in downstream classification tasks. For our experiments, we employ two commonly used backbone networks, PointNet and DGCNN, as feature extractors.

Table 1 presents the accuracy of PointMoment for linear classification on ModelNet40. To conserve computational resources, we randomly chose one network branch for the high-order mixed moment constraint, achieving comparable results to using all branches, and adopted this single-branch constraint for subsequent experiments. Our method surpasses other state-of-the-art (SOTA) unsupervised and self-supervised algorithms when employing PointNet or DGCNN as the backbone network. Notably, our approach uses a basic network architecture without the complex features of STRL, such as asymmetric networks or gradient stopping. Specifically, our method outperforms STRL by 0.5% and 0.1% when using PointNet and DGCNN, respectively, highlighting the effectiveness of our approach. We assessed the generalizability of PointMoment in real-world scenarios by testing on ScanObjectNN with an SVM classifier.

Table 1 Comparison of the linear SVM classification on ModelNet40

Table 2 compares the linear classification accuracy of other self-supervised methods on ScanObjectNN. Our method outperformed the previous SOTA approaches by 4.8% and 2.6% when using PointNet and DGCNN as feature extractors, respectively. This result underscores the generalizability of representations learned from synthetic data, confirming the effectiveness of our approach.

Table 2 Comparison of classification on ScanObjectNN

3D Object part segmentation

Object part segmentation, a complex and crucial task in 3D recognition, involves categorizing each point of an object into specific part classes, such as a table’s leg or a car’s tire. We conducted experiments using the ShapeNetPart41 dataset, which includes 16,991 objects across 16 categories with 50 distinct parts, ranging from 2 to 6 parts per object. As a benchmark, ShapeNetPart effectively measures object part segmentation performance. Following the approach of previous studies24 and19, we pre-trained our model using DGCNN as the backbone network, followed by fine-tuning to enhance performance. Specifically, we conducted fine-tuning experiments on the ShapeNetPart dataset in an end-to-end manner. For the fine-tuning process, we employed SGD as the optimizer, with an initial learning rate of 0.1 and weight decay of 1 × 10−4. The momentum was set to 0.9, with a batch size of 8. The model was trained for 300 epochs. We selected mean Intersection over Union (mIoU) as the evaluation metric, given its precision and widespread use in the field.

Table 3 compares segmentation outcomes for supervised learning methods and various self-supervised approaches on ShapeNetPart. Our pre-training approach offered better initial weights for DGCNN than those from random initialization by supervised learning, increasing mIoU by 0.3%. Additionally, our model outperformed the current SOTA self-supervised method by 0.3% in mIoU, indicating that our use of high-order mixed moments yields more discriminative and less redundant features. The visualization results are presented in Fig. 3 Our segmentation outcomes demonstrate a high degree of similarity to the ground truth, indicating that our method effectively captures fine-grained information within the point cloud. This close correspondence between our results and the actual segmentation underscores the capability of our approach to discern and represent detailed structural features in point cloud data.

Table 3 Part segmentation results on ShapeNetPart dataset
Fig. 3: The visualization of part segmentation results on ShapeNetPart.
figure 3

The first row is the ground truth, and the second row is our method.

3D semantic segmentation

Semantic segmentation is a challenging task that aims to assign a semantic label to each point in a point cloud, enabling the grouping of regions with meaningful significance. This task is particularly important in complex indoor and outdoor scenes, which are often characterized by substantial background noise. To evaluate the representational capacity and generalization capability of our model, we conducted semantic segmentation experiments on the Stanford Large-Scale 3D Indoor Spaces (S3DIS) dataset. S3DIS is a widely used 3D indoor scene dataset that comprises scanned data from 272 rooms across 6 zones, covering a total area of approximately 6000 square meters. The dataset defines 13 semantic categories and provides fine-grained, point-wise semantic labels, where each point is annotated with comprehensive 9-dimensional feature information, including spatial coordinates (XYZ), color attributes (RGB), and normalized positional coordinates.

In our experiments, we fine-tuned the pre-trained model on all areas except Area 5 (the largest region in the dataset) and evaluated it on Area 5. The backbone network of our model is PointNet. To ensure experimental fairness, we strictly adhered to the experimental protocol proposed by Qi et al.5 and Wang et al.7 Specifically, we divided each room into small blocks of 1 m × 1 m and randomly sampled 4096 points from each block as inputs to the model, using only geometric features (XYZ coordinates) for each point.

The experimental results are summarized in Table 4. Our method demonstrates significant performance improvements compared to existing supervised and self-supervised learning approaches. Specifically, our method achieves a mIoU (mean Intersection-over-Union) improvement of 0.2 over the Multi-view rendering method. Furthermore, by applying self-supervised pre-training on PointNet, our approach achieves a 1.5-point gain in mIoU compared to PointNet trained from scratch. These results clearly highlight the effectiveness of self-supervised pre-training in learning robust and transferable features, particularly in scenarios with limited labeled data.

Table 4 Semantic segmentation results on S3DIS dataset, evaluated on Area 5

Ablations and analysis

Impact of high-order mixed moment

To assess the impact of high-order mixed moments, we conducted ablation studies with three loss functions: i) an invariance-based loss as a baseline; ii) the baseline plus a second-order mixed moment loss to evaluate its effectiveness; iii) the second-order loss plus a third-order mixed moment loss to understand the third-order’s impact. Table 5 presents the comparative results. Relying only on the invariance-based loss led to an uneven feature distribution and reduced classification accuracy. Adding the second-order mixed moment mitigated model collapse and redundancy, significantly improving the model’s representational power for both PointNet and DGCNN across datasets. The third-order mixed moment further minimized redundancy. It increased classification accuracy by 0.8% for PointNet and 1.7% for DGCNN on ModelNet40 compared to using only the second-order moment. These findings underscore the importance of incorporating higher-order mixed moments.

Table 5 The accuracy of linear SVM classification using retrained embedding on ModelNet40 and ScanObjectNN for PointMoment

Figure 4 visualizes features from the ModelNet1040 test set using t-SNE, extracted with the pre-trained PointNet. Incorporating second- and third-order mixed moments improved the separability of different classes. Notably, the addition of the third-order mixed moment allowed for clearer differentiation of objects with less distinct boundaries, including sofas, beds, and bathtubs.

Fig. 4: The T-SNE feature visualization on the ModelNet10 test set, post the training of PointNet as the self-supervised backbone network.
figure 4

The feature learned by three-order mixed moment(right) provides better discrimination of classes (e.g., sofas, beds and bathtubs) than using only invariance(left) or two-order mixed moment(middle).

Sensitivity analysis of \({\rm{\lambda }}\)

Intuitively, the parameter \({\rm{\lambda }}\) significantly influences pre-training and, consequently, the performance on downstream tasks. Our study examines how varying \({\rm{\lambda }}\) affects classification performance. We tested \({\rm{\lambda }}\) values from 0.001 to 5, using PointNet as the backbone network for pre-training, and conducted linear classification on both ScanObjectNN and ModelNet40 datasets. As Table 6 shows, an optimal classification performance on both datasets was achieved when \({\rm{\lambda }}\) was set to 0.5.

Table 6 Linear classification results for different \({\boldsymbol{\lambda }}\) parameters on ModelNet40 and ScanObjectNN datasets after pre-training using PointNet

Application on the Terracotta Warrior Dataset

Terracotta Warriors Dataset

The Terracotta Warriors, renowned as one of the Seven Wonders of the World, represent a significant ceramic cultural relic in China. Their virtual restoration holds great importance for cultural heritage preservation and transmission. This study focuses on the 3D digitization and processing of Terracotta Warrior fragments for neural network analysis. Our dataset was acquired using a Creaform VIU 718 handheld 3D scanner in the Visualization Laboratory. Due to the high resolution of the resulting point clouds, which poses challenges for direct neural network input, we employed a preprocessing step. The Clustering Decimation method, available in the Meshlab tool, was utilized to downsample the point cloud data. This approach effectively preserves structural information while reducing each fragment to a uniform 2048 points. The dataset was categorized according to the anatomical parts of the Terracotta Warriors: arms, heads, legs, and bodies (as illustrated in Fig. 5). The sample distribution across these categories is presented in Table 7. For our experimental protocol, we adopted an 80–20 split, allocating 80% of the data for training and the remaining 20% for testing.

Fig. 5
figure 5

Illustration of the Terracotta Warriors fragments.

Table 7 Number of fragments for each class in the Terracotta Warriors fragments dataset

Classification of Terracotta Warrior Dataset

To validate the efficacy of our method on real-world 3D Terracotta Warrior fragments, we conducted classification experiments using our dataset. We fine-tuned a pre-trained DGCNN model on the Terracotta Warrior dataset, with the results presented in Table 8. It’s worth noting that research on self-supervised representation learning methods based on Terracotta Warrior point cloud data is scarce. Consequently, our comparisons primarily involve traditional and supervised methods. The highest accuracy achieved by existing traditional methods is 87.64%. Our approach significantly outperforms this benchmark, demonstrating an improvement of 4.46%. Moreover, our method yields competitive results when compared to supervised learning approaches. These experimental outcomes indicate that our technique effectively bridges the gap between supervised and self-supervised learning. It provides a robust set of initial model parameters for the task of Terracotta Warrior fragment classification, thus contributing a valuable research methodology for the virtual restoration of these artifacts. These results not only demonstrate the potential of our method for the specific task of Terracotta Warrior restoration but also suggest its applicability to broader cultural heritage preservation efforts involving 3D artifact reconstruction.

Table 8 Compared with other methods on the 3D Terracotta Warrior fragment datasets

Segmentation of Terracotta Warrior Dataset

Segmentation of Terracotta Warriors plays a crucial role in the effective and accurate restoration of cultural relics, particularly in the virtual reconstruction of ceramic artifacts. Unlike the Terracotta fragment classification task, our segmentation study utilizes complete Terracotta Warrior models. We compiled a dataset of 150 complete Terracotta models using 3D scanners and data augmentation techniques, and all terracotta warriors models are uniformly downsampled into 4096 point clouds. Traditionally, the three-dimensional model of a Terracotta Warrior is divided into six parts: head, body, left arm, right arm, left leg, and right leg. However, to enhance the restoration process and rigorously evaluate our method’s performance, we manually annotated the original Terracotta models into eight distinct segments: head, body, left hand, left arm, right hand, right arm, left leg, and right leg. We employed an 8-2 split for our dataset, allocating 80% for training and 20% for testing. To validate the effectiveness of our approach in segmenting Terracotta Warriors, we fine-tuned a pre-trained Dynamic Graph CNN (DGCNN) on our Terracotta Warriors dataset. The segmentation results are presented in Table 9. The empirical evidence demonstrates that our approach significantly outperforms existing unsupervised segmentation methods for Terra-cotta warriors. Specifically, our method achieves improvements of 6.8% and 3.8% in segmentation accuracy compared to SRG(DGCNN) and EGG(DGCNN), respectively. The resulting segmentation outcomes are illustrated in Fig. 6. The visual results demonstrate that our method achieves high-quality segmentation, effectively distinguishing between the eight predefined parts of the Terracotta Warriors. These results not only showcase the capability of our method in accurately segmenting Terracotta Warrior models but also highlight its potential in facilitating more precise virtual restoration processes. The improved granularity of segmentation (eight parts instead of six) allows for more detailed analysis and reconstruction, potentially leading to more accurate and comprehensive restoration outcomes.

Table 9 Comparison of different methods on Terracotta Warrior dataset
Fig. 6
figure 6

PointMoment segmentation results on Terracotta Warrior Datasets.

Conclusion

The digital preservation of cultural heritage has become increasingly crucial in our technologically advancing world. This paper has explored the application of point cloud self-supervised learning technology to the Terracotta Warriors, introducing high-order mixed moments as an innovative approach to enhance feature characterization while reducing redundant information in high-dimensional embedded features. Firstly, our research demonstrates the significant potential of high-order mixed moments in feature redundancy reduction. By effectively minimizing redundant information in high-dimensional embedded features, we have developed a feature extractor with superior representational capabilities. This approach results in more independent and compact representational information, which is crucial for accurate analysis and restoration of complex artifacts like the Terracotta Warriors. Secondly, a key advantage of our method is its ability to address the model collapse problem inherent in self-supervised learning without resorting to complex techniques such as asymmetric network frameworks. This simplification in implementation, while maintaining robust performance, represents a significant step forward in the field of self-supervised learning for 3D data. Furthermore, extensive testing on multiple downstream tasks using existing public datasets has validated the versatility and effectiveness of our technique. The competitive results achieved across various applications underscore the broad applicability of our approach beyond the specific context of cultural heritage preservation. Notably, when applied to the Terracotta Warriors dataset, our method has shown remarkable performance, outperforming existing approaches in both fragment classification accuracy and segmentation precision. These results are particularly significant given the complexity and historical importance of the Terracotta Warriors. The improved accuracy in classification and segmentation can potentially lead to more precise virtual restorations and deeper insights into the manufacturing techniques and artistic styles of ancient China. Looking ahead, we recognize that the computational complexity of calculating high-order mixed moments is a challenge that requires further improvement. To address this, we plan to explore more efficient algorithms to approximate or estimate high-order mixed moments. Our initial idea is to utilize feature decomposition techniques to decompose high-order mixed moments into combinations of lower-order moments, thereby reducing computational complexity. Additionally, we can employ block processing techniques to divide large-scale data into smaller chunks, compute each separately, and then combine the results. This approach would reduce memory usage and improve computational efficiency. Furthermore, we intend to perform random sampling on the final result matrix based on existing high-order mixed-moment algorithms. By avoiding the inclusion of all matrix elements in the loss calculation, we can significantly reduce computational complexity. Specifically, we plan to adopt Monte Carlo sampling techniques to randomly select a subset of matrix elements for estimation, approximating the true values. We will then apply weighted sampling according to the importance of feature variables to ensure that the impact of key features is fully considered. Finally, during backpropagation, we will compute gradients only for the important sampled points, further reducing computational load. This continued research aims to optimize the performance of our methods and push the boundaries of what’s possible in digital cultural heritage preservation. Additionally, we anticipate that our approach could be extended to other types of 3D cultural artifacts, potentially opening up new possibilities in the field of virtual museums and digital archiving. In conclusion, our work not only presents a novel technical approach but also demonstrates the transformative potential of interdisciplinary research combining computer science and archaeology. As we continue to refine and expand these techniques, we move closer to a future where our cultural heritage is not only preserved but also made more accessible and understandable through advanced digital technologies. The success of our method in both general point cloud tasks and specific Terracotta Warrior applications underscores the potential of high-order mixed moments as a powerful technique for point cloud representation learning, opening up new avenues for research in self-supervised learning for 3D data across various domains.

link

Leave a Reply

Your email address will not be published. Required fields are marked *