Reviewer #3 (Public review):
Summary:
The authors compare how well their automatic dimension prediction approach (DimPred) can support similarity judgements and compare it to more standard RSA approaches. The authors show that the DimPred approach does better when assessing out-of-sample heterogeneous image sets, but worse for out-of-sample homogeneous image sets. DimPred also does better at predicting brain-behaviour correspondences compared to an alternative approach. The work appears to be well done, but I'm left unsure what conclusions the authors are drawing.
In the abstract, the authors write: "Together, our results demonstrate that current neural networks carry information sufficient for capturing broadly-sampled similarity scores, offering a pathway towards the automated collection of similarity scores for natural images". If that is the main claim, then they have done a reasonable job supporting this conclusion. However the importance of automating this process for broadly-sampled object categories is not made so clear.
But the authors also highlight the importance that similarity judgements have been for theories of cognition and brain, such as in the first paragraph of the paper they write: "Similarity judgments allow us to improve our understanding of a variety of cognitive processes, including object recognition, categorization, decision making, and semantic memory6-13. In addition, they offer a convenient means for relating mental representations to representations in the human brain14,15 and other domains16,17". The fact that the authors also assess how well a CLIP model using DimPred can predict brain activation suggests that their work is not just about automating similarity judgements, but highlighting how their approach reveals that ANNs are more similar to brains than previously assessed.
My main concern is with regards to the claim that DimPred is revealing better similarities between ANNs and brains (a claim that the authors may not be making, but this should be clarified). The fact that predictions are poor for homogenous images is problematic for this claim, and I expect their DimPred scores would be very poor under many conditions, such as when applied to line drawings of objects, or a variety of addition out-of-sample stimuli that are easily identified by humans. The fact that so many different models get such similar prediction scores (Fig 3) also raises questions as to the inferences you can make about ANN-brain similarity based on the results. Do the authors want to claim that CLIP models are more like brains?
With regards to the brain prediction results, why is the DimPred approach doing so much better in V1? I would not think the 49 interpretable categories are encoded in V1, and the ability to predict would likely reflect a confound rather than V1 encoding these categories (e.g., if a category was "things that are burning" then DNN might predict V1 activation based on the encoding of colour).
In addition, more information is needed on the baseline model, as it is hard to appreciate whether we should be impressed by the better performance of DimPred based on what is provided: "As a baseline, we fit a voxel encoding model of all 49 dimensions. Since dimension scores were available only for one image per category36, for the baseline model, we used the same value for each image of the same category and estimated predictive performance using cross-validation". Is it surprising that predictions are not good with one image per category? Is this a reasonable comparison?
Relatedly, what was the ability of the baseline model to predict? (I don't think that information was provided). Did the authors attempt to predict outside the visual brain areas? What would it mean if predictions were still better there?
Minor points:
The authors write: "Please note that, for simplicity, we refer to the similarity matrix derived from this embedding as "ground-truth", even though this is only a predicted similarity". Given this, it does not seem a good idea to use "ground truth" as this clarification will be lost in future work citing this article.
It would be good to have the 49 interpretable dimensions listed in the supplemental materials rather than having to go to the original paper.
Strengths:
The experiments seem well done.
Weaknesses:
It is not clear what claims are being made.