Reviewer #1 (Public Review):
Main contributions / strengths
The authors propose a process to improve the ground truth segmentation of fetal brain MRI via a semi-supervised approach based on several iterations of manual refinement of atlas label propagations. This procedure represents an impressive amount of work, likely resulting in a very high-quality ground truth dataset. The corrected labels (obtained from multiple different datasets) are then used to train the final model which performs the brain extraction and tissue segmentation tasks. We also acknowledge the caution paid by the authors regarding the future application of their pipeline to unseen datasets.
The conclusions of this paper are mostly well supported by data, but some aspects of the analysis and validation procedure need to be clarified and extended. In addition, the article would greatly benefit from providing further descriptions of crucial aspects of the study.
Main limitations and potential improvements
1) New nomenclature/atlas not sufficiently described/justified.
The proposed nomenclature and atlas are one of the main contributions of this work. We clearly acknowledge the importance for the community of such a contribution. The definition of any nomenclature implies that decisions were taken regarding the acceptable level of ambiguity in the identification of the boundary between neighboring anatomical structures with respect to the gradient in the intensities in the MRI. It is acceptable (and probably inevitable) to set relatively arbitrary criteria in ambiguous regions, providing that these criteria are explicitly stated. The explicit statement of the decisions taken is essential in particular for better interpretation of residual segmentation inaccuracies in application studies.
As a matter of comparison, the postnatal atlas and nomenclature were based on the Albert protocol, which is described in extensive detail. While such a complete description might fall beyond the scope of this work, we believe that an additional description of the nomenclature and protocol, allowing reproduction the manual segmentation on external datasets is required, at least for most ambiguous junctions between structures. For instance, the boundaries across substructures within the DGM are difficult to visualize on the exemplar subjects shown in Fig. 5 and Fig. 6.
Please provide additional precision on how the following were defined: boundaries between lateral ventricles and cavum; between cavum and CSF; the delineation of 3rd and 4th ventricles; the definition of the vermis, especially its junctions with the cerebellum and the brainstem.<br />
How are these boundaries impacted by the changes in the image intensities related to tissue maturation?
We would also greatly appreciate an extension of the qualitative comparison with the two most commonly used protocols (Albert and FETA), for instance, why didn't the authors isolate the hippocampus/amygdala structure? And then how is the boundary between gray and white matter defined in this region?
2) More detailed comparison with FETA for some structures would be informative despite obvious limitations.
More specifically, the GM should have a very similar definition. In the "Impact of anomalies' section (page 7) the authors compare their results with the dice score from the FETA challenge and conclude that the difference "highlights the advantages of using high-quality consistent ground truth labels for training". The better performances (from ~0.78 to ~0.88) might be mostly due to the improvement of the ground truth (of the test set). This could be confirmed by observing the ground truth from FETA of the GM for a few cases for which the dice shows a strong increase in performance with respect to FETA. Note that the gain in performance is appreciable even if it is due to a better ground truth.
3) Improvement of the ground truth labels is an important contribution of this work, thus we would appreciate a more quantitative description of the impact of the manual correction, such as reporting the change in the dice score induced by the correction.
Quantification of the refinement process would help to better evaluate the relevance of the proposed approach in future studies e.g. introducing a different nomenclature. More specifically, a marked change would be expected after the first training when there is a switch (and refinement) from the registration-propagated labels to the ones predicted by the DL model (as shown in Fig. 5, the changes are quite strong). Again a dice score indicating quantitatively how much improvement results from each iteration would be informative. In the same line, is the last iteration of this process needed or did the authors observe a 'stabilization' (i.e. less manual editing needing to be performed)?
4) The testing / training data-splitting strategy is not sufficiently detailed and difficult to follow. The following points deserve clarification:
a) Why did the authors select only four sites for the test set (out of six studies presented in the 3.1 section)?
b) Data used for training: in the first step the authors selected 200 for label propagation and selected only the best 100. In the second stage, the predictions are computed for all training/validation sets (380) and only 200 are selected. When the process was iterated, why did the authors select only 200 out of the 380? Are the same subjects selected across iterations?<br />
Were the acquisition parameters / gestational age controlled for each selection? If yes please specify the distributions precisely.
Did the authors control the potential imbalanced proportion that is present in the dataset (more subjects from dHCP for instance)? (line 316, 100 subjects were selected from only three centers. Why only three? Did the authors keep the same sub-site for other stages?)
c) "The testing dataset includes 40 randomly selected images from four different acquisition protocols" which shows that attention was paid to variations in the scanning parameters, which is of crucial importance. However, no precision is provided regarding the gestational age of this dataset, which impedes the interpretation since a potential influence of age on the accuracy of the segmentation would be problematic. Indeed, the authors mention that the manual correction deserved special attention for late GA (>34 weeks). Please specify precisely the age distribution across the 10 subjects of each of the four acquisition protocols. In addition, the qualitative results shown in Fig.6 and subsection "Impact of GA at scan" are not sufficient and an additional result table reporting the same population and metrics as in Table 2, but dissociating younger versus older fetuses, would be much more informative to rule out potential bias related to gestational age.
d) The definition of the ground truth labels for the test set is not described.
We understand (from the result) that the ground truth for the test set is defined by manual refinement of the atlas label propagated. This should be explicitly described on page 5 after the "Preparation of training datasets" section.
5) The validation of segmentation accuracy based on the volumetry growth chart is invalid.
In Section "4.3. Growth charts of normal fetal brain development", since manual corrections were involved, the reported results cannot be considered as a validation of the segmentation pipeline. Regarding the validation of the segmentation pipeline, the quantitative and qualitative results provided in Table 2 and the corresponding text and figures seem sufficient to us (providing our concerns above are addressed, especially regarding the impact of the gestational age).
The growth charts are still valuable to support the validity of the nomenclature and segmentation protocol, but then why are the growth charts computed only for some structures? Reporting the growth chart and statistical evaluation of the impact of acquisition settings using ANCOVA for all the substructures from the proposed protocol would be expected here, in particular for the structures for which the delineation might be ambiguous such as the cavum, the vermis, and DGM substructures such as the thalamus.
Finally, please provide further details on the type and amount of manual correction needed for computing the growth charts.
6) MRI data was acquired only on Phillips scanners.
We acknowledge the efforts to maximize heterogeneity in the MRIs,e.g. with both 1.5T and 3T scanners, variations in TE and image resolution, but still, all MRIs included in this study were acquired using the SSTSE sequence on Phillips scanners. The study does not include any MRI acquired on Siemens nor GE scanners, and no image was acquired using the balance-FFE/TRUFISP/FIESTA type sequence. This might limit generalizability.