A deep learning approach for automatic 3D segmentation of hip cartilage and labrum from direct hip MR arthrography

Study design
The study protocol was approved by the local ethics committee (Kantonale Ethikkomission Bern, Switzerland, KEK 2022 − 00618) with a waiver for informed consent and performed in accordance with appropriate guidelines. We performed a retrospective feasibility study to develop and validate a deep learning approach for automated 3D segmentation of hip cartilage and labrum based on a direct hip MR arthrography using a recently introduced 3D T1 cartilage mapping sequence (MP2RAGE = Magnetization-prepared 2 Rapid Gradient-Echo) and a high-resolution balanced steady state free precession sequence (TrueFISP = true fast imaging with steady state precession). This study was performed in accordance with all relvant guidelines and regulations including the declaration of Helsinki.
Data
The internal dataset was developed by querying the picture archiving and communication system of the radiology department of the Bern University hospital for direct hip MR-arthrograms between January 2020 and October 2021. Inclusion criteria were a symptomatic hip deformity, patient aged > 18 years and a complete MRI scan according to the institutional routine protocol including a 3D T1 MP2RAGE sequence (Table 1). This resulted in a consecutive series of 197 patients. Exclusion criteria were posttraumatic deformity, previous hip surgery, pediatric hip deformities and insufficient image quality such as motion artefacts or extra-articular contrast agent. This resulted in a total of 100 patients in the internal dataset (Fig. 1).

Patient inclusion and exclusion. Flowchart of patient inclusion and exclusion.
The external dataset was developed by querying the picture archiving and communication system of the radiology department of the Bern University hospital for direct hip MR-arthrograms between December 2021 and September 2022. Inclusion criteria were symptomatic hip deformity, patient aged > 18 years and a complete MRI scan according to the institutional routine protocol including a T2-weighted 3D TrueFISP sequence (Table 1). This resulted in a consecutive series of 55 patients. The same exclusion criteria as for the internal data set were applied. This resulted in a total of 40 patients in the external dataset (Fig. 1).
MR image acquisition
For both internal and external data sets, direct hip MR arthrography was performed under fluoroscopic guidance with injection of 1—2 ml of iodinated contrast agent (Iopamidol 200 mg/ml; Iopamiro 200; Bracco), 2-5 ml of local anaesthetic (ropivacaine hydrochloride; 2 mg/ml; Ropinaest; GebroPharma), and 15-20 ml of diluted MR contrast agent (gadopentetate dimeglumine, 2mmol/l, Magnevist; Bayer Healthcare). Multiplanar proton density (PD) weighted turbo spin echo (TSE) images of the hip were performed in coronal-, sagittal- and radial image orientation. The internal data set was acquired at a 3T unit (Magnetom Skyra, Siemens Healthineers) and included an axial-oblique 3D T1 MP2RAGE sequence which was used for manual and automatic segmentation as well as for postcontrast T1 mapping of the hip joint (dGEMRIC). The external dataset was acquired at a 1.5T unit (Magnetom Aera, Siemens Healthineers) and included an axial-oblique 3D T2-w TrueFISP sequence which was used for manual and automatic segmentation.
Ground truth segmentation
Manual segmentations were performed in a standardized multistep-approach with a commercially available software (Amira 6.1, FEI, Hilsboro, Oregon, USA) using the 3D T1 MP2RAGE (internal data set) and 3D T2-w TrueFISP (external data set) in the axial-oblique plane, without any preprocessing steps. Since the T1 maps do not provide adequate image contrast for segmentation the raw data acquired with the second inversion pulse was used (Inv2 images). These images yield improved image contrast of the chondro-labral structures and bone. Manual segmentations were performed by two residents (MKM, AB), each with 3 years of experience in hip imaging, and checked by a radiologist (FS) with 8 years of experience in hip imaging, i.e. each slice was checked and corrected if needed. Using a standardized approach for segmentation (Figs. 2 and 3), the osseous acetabulum and proximal femur were first labelled using a threshold-assisted method. In a second step, the femoroacetabular cartilage layers were labelled extending from the medial (acetabular fossa) to the lateral border using the acetabular rim as reference. Finally, based on the created models of hip cartilage and the acetabular rim, the labrum was labelled from its attachment at the acetabular rim and cartilage to its apex (Figs. 2 and 3). To assess inter- and intra-rater variability, a subset of randomly chosen 20 patients of the internal data set was additionally segmented by a second reader (AB) and repeated 6 weeks later by one of the readers (AB) to serve as a benchmark for automatic segmentation accuracy.

Manual segmentation workflow of internal dataset. Manual segmentation workflow for hip cartilage and labrum on the internal data set. (a) Raw image (Inv2) of the 3D T1 magnetization prepared 2 rapid acquisition gradient echoes (MP2RAGE) sequence were used for manual segmentation of hip cartilage (blue) and labrum (red). (b) Corresponding manual and automatic 3D morphologic models. (c) Masks of cartilage and labrum were then applied to the co-registered T1 map for (d) 3D visualization of post-contrast T1 relaxation time (dGEMRIC) using a voxel wise color-graded scale. Dice Similarity Coefficient between manual and automatic segmentation of this particular example was 0.95 for cartilage and 0.87 for labrum. 3D models were visualized using a custom-made software application for automatic segmentation and visualization8.

Manual segmentation workflow of external dataset. Workflow of manual segmentation of hip cartilage and labrum for the external data set. (a) The T2 weighted true fast imaging with steady state precession (TrueFISP) sequence was used for manual segmentation of hip cartilage (blue) and labrum (red). (b) Corresponding 3D visualization of cartilage (blue) and labrum (red) morphology. Dice Similarity Coefficient of this particular example was 0.92 for cartilage and 0.78 for labrum. 3D models were visualized using a custom-made software application for automatic segmentation and visualization8.
Data partitions
The overall sample size of this study included 140 patients, 100 patients in the internal and 40 patients in the external data set (Fig. 1). Of the internal data set, the first 80 patients were consecutively selected for model training (internal training set). The remaining 20 patients were excluded from training and used as unseen data for model testing (internal testing set).
For the external validation, the 40 patients were split in two external datasets. The first external testing set included 20 unseen cases (external set 1 test). To investigate if model performance can be improved with additional model training, we used the external test set 1 for additional training and the “external set 2 retraining” for testing.
Model
The pipeline contained initial cropping, segmentation and metric calculation8. As preprocessing step, the image volume was automatically cropped to 80 × 200 × 200 voxels around the femoral head center to increase efficiency and to reduce background complexity for the convolutional neural network. The automatic cropping was performed using a landmark detection algorithm, based on a U-Net architecture with heatmap regression and a receptive field width of 140 pixels8,11. No image reformation was performed. The external sets were resampled to the same voxel spacing and cropped to the same size. For segmentation of three-dimensional structure of hip cartilage and labrum, the 3D U-net architecture12 was used. For generation of the 3D U-net, the training, evaluation, and prediction, the self configurating nnU-net framework from Isensee et al.13 was applied.
The nnU-Net prevents overfitting through data augmentation, adaptive patch sampling, and an ensemble of five networks trained during cross-validation. This ensemble improves generalization, achieving similar or better results on the test set than the validation set. Early stopping also helps by halting training when validation performance stabilizes, preventing excessive fitting.
The deep learning network architecture included encoder and decoder paths, which were connected via skip connections, as well as a bottle neck at the lowest layer (Fig. 4). The base building block was a convolution followed by an instance normalization (IN) layer and a leaky rectified linear Unit (LReLU) (negative slope, 0.01) as activation function. Two consecutive building blocks were used for each resolution step. The first layer in the encoder and decoder paths operated only on the axial planes in a pseudo 2D configuration to create or extract from isotropic image features. Downsampling was implemented as strided convolution. Upsampling was implemented as transposed convolution. The initial feature map consisted of 32 channels and doubled with each downsampling operation, limited to a maximum of 320 channels. The patch size was 64 × 192 × 160 voxels. Segmentation masks were generated with a 1 × 1 × 1 convolution and a SoftMax layer followed by an ArgMax operation for background, cartilage or labrum tissue. The network was trained with deep supervision; additional auxiliary losses were added in the decoder to all but the two lowest resolutions. This allowed the gradients to be injected deeper into the network, facilitating the training of all layers in the network13.

3D U-net architecture. 3D U-net architecture with a single channel input patch of 64 × 192 × 160 voxels. Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The size of the volumetric patch is constant per layer and indicated on the left and the right. The white boxes represent copied feature maps. The arrows denote the different operations. The first layer is based on a pseudo 2D configuration. The deeper layers are based on a 3D configuration. The output segmentation feature maps, including the maps for the deep supervision, consists of 3 channels, each related to the background, cartilage, or labrum label.
Training
The network was trained on the internal data set (n = 80) for 60 epochs from scratch with random weights. One epoch was defined as iteration over 250 mini batches. A mini batch consisted of two patches due to limited GPU memory. Stochastic gradient descent with Nesterov momentum (µ = 0.99) and an initial learning rate of 0.01 was used for learning network weights. The learning rate was decayed throughout the training, (1 − epoch/epochmax)0.9. The loss function was the sum of cross-entropy and Dice loss.
The framework was trained 5-fold for a cross-fold validation to evaluate the performance on the validation sets and to define the post processing steps. nnU-Net empirically opted for ‘non-largest component suppression’ as a post-processing step if performance gains were measured13. The final network was an ensemble to increase the generalizability, where the final SoftMax layer was a result of the averaged 5 subnetworks.
Evaluation
The model performance was evaluated by the following metrics: Dice similarity coefficient (DSC), average symmetric surface distance (ASSD), precision, recall, absolute and relative differences in volume and dGEMRIC indices.
Code availability statement
The code for preprocessing, training and evaluation is publicly available on GitHub. https://doi.org/10.5281/zenodo.14316889
Statistical analysis
Data was tested for normal distribution using Kolmogorov Smirnov test. For comparison between data sets Kruskall Wallis Test with Dunn’s correction for multiple comparisons was used for continuous parameters and Chi-Square Test for binary data. Difference in dGEMRIC indices of manual and automatic cartilage segmentation was assessed with Wilcoxon Rank Test for paired data. All statistical tests were conducted at the two-sided 5% significance level using GraphPad Prism (Version 9.5, GraphPad Software)
link