Reviewer #2 (Public review):
Summary:
The authors attempt to estimate the heritability of brain activity evoked from a naturalistic fMRI paradigm. No new data were collected; the authors analyzed the publicly available and well-known data from the Human Connectome Project. The paper has 3 main pieces, as described in the Abstract:
(1) Heritability of movie-evoked brain activity and connectivity patterns across the cortex.
(2) Decomposition of this heritability into genetic similarity in "where" vs. "how" sensory information is processed.
(3) Heritability of brain activity patterns, as partially explained by the heritability of neural timescales.
Strengths:
The authors investigate a very relevant topic that concerns how heritable patterns of brain activity among individuals subjected to the same kind of naturalistic stimulation are. Notably, the authors complement their analysis of movie-watching data with resting-state data.
Weaknesses:
The paper has numerous problems, most of which stem from the statistical analyses. I also note the lack of mapping between the subsections within the Methods section and the subsections within the Results section. We can only assess results after understanding and confirming the methods are valid; here, however, Methods and Results, as written, are not aligned, so we can't always be sure which results are coming from which analysis.
(A) Intersubject correlation (ISC) (section that starts from line 143): "We used non-parametric permutation testing to quantify average differences in ISC for each parcel in the Schaefer 400 atlas for each day of data collection across three groups: MZ dyads, DZ dyads, and unrelated (UR) dyads, where all UR dyads were matched for gender and age in years." ... "some participants contributed to ISC values for multiple dyads (thus violating independence assumptions)"
This is an indirect attempt to demonstrate heritability. And it's also incorrect since, as the authors themselves point out, some subjects contribute to more than one dyad.
Permutation tests don't quantify "average differences", they provide a measure of evidence about whether differences observed are sufficient to reject a hypothesis of no difference.
Matching subjects is also incorrect as it artificially alters the sample; covarying for age and sex, as done in standard analyses of heritability, would have been appropriate.
It isn't clear why the authors went through the trouble of implementing their own non-parametric test if HCP recommends using PALM, which already contains the validated and documented methods for permutation tests developed precisely for HCP data.
The results from this analysis, in their current form, are likely incorrect.
(B) Functional connectivity (FC) (section that starts from line 159): Here the authors compute two 400x400 FC matrix for each subject, one for rest, one for movie-watching, then correlate the correlations within each dyad, then compared the average correlation of correlations for MZ, DZ, and UR. In addition to the same problems as the previous analysis, here it is not clear what is meant by "averaging correlations [...] within a network combination". What is a "network combination"? Further, to average correlations, they need to be r-to-z transformed first. As with the above, the results from this analysis in its current form are likely incorrect.
(C) ISC and FC profile heritability analyses (section that starts from line 175): Here, the authors use first a valid method remarkably similar to the old Haseman-Elston approach to compute heritability, complemented by a permutation test. That is fine. But then they proceed with two novel, ill-described, and likely invalid methods to (1) "compare the heritability of movie and rest FC profiles" and (2) to "determine the sample size necessary for stable multidimensional heritability results". For (1), they permute, seemingly under the alternative, rest and movie-watching timeseries, and (2), by dropping subjects and estimating changes in the distribution.
The (1) might be correct, but there are items that are not clearly described, so the reader cannot be sure of what was done. What are the "153 unique network combinations"? Why do the authors separate by day here, whereas the previous analyses concatenated both days? Were the correlations r-to-z transformed before averaging?
The (2) is also not well described, and in any case, power can be computed analytically; it isn't clear why the authors needed to resort to this ad hoc approach, the validity of which is unknown. If the issue is the possibility that the multidimensional phenotypic correlation matrix is rank-deficient, it suffices that there are more independent measurements per subject than the number of subjects.
(D) Frequency-dependent ISC heritability analysis (from line 216): Here, the authors decompose the timeseries into frequency bands, then repeat earlier analyses, thus bringing here the same earlier problems and questions of non-exchangability in the permutations given the dyads pattern, r-z transforms, and sex/age covariates.
(E) FC strength heritability analysis (from line 236): Here, the authors use the univariate FC to compute heritability using valid and well-established methods as implemented in SOLAR. There is no "linkage" being done here (thus, the statement in line 238 is incorrect in this application. SOLAR already produces SEs, so it's unclear why the authors went out of their way to obtain jackknife estimates. If the issue is non-normality, I note that the assumption of normality is present already at the stage in which parameters themselves are estimated, not just the standard errors; for non-normal data, a rank-based inverse-normal transformation could have been used. Moreover, typically, r-to-z transformed values tend to be fairly normally distributed. So, while the heritabilities might be correct, the standard errors may not be (the authors don't demonstrate that their jackknife SE estimator is valid). The comparison of h2 between dyads raises the same questions about permutations, age/sex covariates, and r-z transforms as above.
(F) Hyperalignment (from line 245): It isn't clear at this point in the manuscript in what way hyperalignment would help to decompose heritability in "where vs. how" (from the Abstract). That information and references are only described much later, from around line 459. The description itself provides no references, and one cannot even try to reproduce what is described here in the Methods section. Regardless, it isn't entirely clear why this analysis was done: by matching functional areas, all heritabilities are going to be reduced because there will be less variance between subjects. Perhaps studying the parameters that drive the alignment (akin to what is done in tensor-based and deformation-based morphometry) could have been more informative. Plus, the alignment process itself may introduce errors, which could also reduce heritability. This could be an alternative explanation for the reduced heritability after hyperalignment and should be discussed. An investigation of hyperaligment parameters, their heritability, and their co-heritability with the BOLD-phenotypes can inform on this.
(G) Relationships between parcel area and heritability (from line 270): As under F), how much the results are distorted likely depends on the accuracy of the alignment, and the error variance (vs heritable variance) introduced by this.
(H) Neural timescale analyses (from line 280): Here, a valid phenotype (NT) is assessed with statistical methods with the same limitations as those previously (exchangability of dyads, age/sex covariates, and r-z transforms). NT values are combined across space and used as covariates in "some multivariate analyses". As a reader, I really wanted to see the results related to NT, something as simple as its heritability, but these aren't clearly shown, only differences between types of dyads.
(I) Significance testing for autocorrelated brain maps and FC matrices (from line 310): Here, the authors suddenly bring up something entirely different: reliability of heritability maps, and then never return to the topic of reliability again. As a reader, I find this confusing. In any case, analyses with BrainSMASH with well-behaved, normally distributed data are ok. Whether their data is well behaved or whether they ensured that the data would be well behaved so that BrainSMASH is valid is not described. As to why Spearman correlations are needed here, Mantel tests, or whether the 1000 "surrogate" maps are valid realizations of the data under the null, remains undemonstrated.
(J) Global signal was removed, and the authors do not acknowledge that this could be a limitation in their analyses, nor offer a side analysis in which the global signal is preserved.
(K) FDR is used to control the error rate, but in many cases, as it's applied to multiple sets of p-values, the amount of false discoveries is only controlled across all tests, but not within each set. The number of errors within any set remains unknown.
(L) Generally, when studying the heritability of a trait, the trait must be defined first. Here, multiple traits are investigated, but are never rigorously defined. Worse, the trait being analyzed changes at every turn.