- Jul 2018
-
europepmc.org europepmc.org
-
On 2015 Sep 17, Daniel Himmelstein commented:
Thanks Dr. Levine for your thoughtful response. As you mention, the practices I criticize in your discovery phase are not unique to your study. Your study caught my attention because of its remarkable finding that such a small sample yielded a highly predictive and robust classifier for such a complex phenotype. Hopefully, others will benefit from our discussion here.
Additionally, I agree that replication grants researchers the freedom to discover as they wish. Suboptimal model training should not cause "type I" replication error, if the replication dataset is independent.
However, a replication p-value alone is insufficient to identify the probability of a model being true. This probability depends on the plausibility of the model. Since I think that the odds are low that your study design could produce a true model, I require extraordinary evidence before accepting the proposed PRS model. I think your replication provides good evidence but not extraordinary.
Follow-up studies on different populations will be important for establishing extraordinary evidence. I think it would be helpful to specify which allele is minor for each PRS SNP: my impression is that minor alleles are sample not population specific.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2015 Sep 16, Morgan E Levine commented:
Daniel Himmelstein, thank you for your comments. I will try to address them to the best of my ability.
We acknowledge that the sample size is very small, which we mention in our limitations section of the paper. Because we are studying such a rare phenotype, there is not much that can be done about this. “Long-lived smokers” is a phenotype that has been the subject of a number of our papers, and that we think has strong genetic underpinnings. Despite the small sample size, we decided to go ahead and see if we could detect a signal, since there is evidence to suggest that the genetic influence may be larger for this phenotype compared to many others—something we discuss at length in our introduction section.
To the best of our knowledge the finding that highly connected genes contain more SNPs, has not been published in a peer-reviewed journal. Therefore, we had no way of knowing or evaluating the importance of this for our study. Similarly, we used commonly used networks and do acknowledge the limitations of these networks in our discussion section. The network you link to was not available at the time this manuscript was accepted.
We acknowledge the likelihood of over-fitting in our PRS, which is probably due to our sample size. This score did validate in two independent samples. Therefore, while it is likely not perfect, we feel that it may still capture some of the true underlying signal. We followed standard protocol for calculating our score (which we reference). In the literature there are many examples of scores that have been generated by linearly combining information from SNPs that are below a given p-value threshold in a GWAS. While, not all of these replicate, many do. Our study used very similar methods, but just introduced one additional SNP selection criteria—SNPs had to also be in genes that were part of an FI network. I don't think this last criteria would introduce additional bias that would cause a type I error in the replication analysis. However, we still recognize and mention some of the limitations of our PRS. We make no claim that the score is free from error/noise or that it should be used in a clinical setting. In fact, in the paper we suggest future methods that can be used to generate better scores.
We feel we have provided sufficient information for replication of our study. The minor alleles we used are consistent with those reported for CEU populations, which is information that is readily available. Thus, the only information we provide in Table S2 pertain to things specific to our study, that can't be found elsewhere. Lastly, the binning of ages is not 'bizarre' from a biogerontology and longevity research perspective. A number of leaders in the filed have hypothesized that the association between genes and lifespan is not linear (variants that influence survival to age 100+ are not the same as variants that influence survival to age 80+). Thus, using a linear model would not be appropriate in this case and instead we selected to look at survival by age group.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2015 Sep 15, Daniel Himmelstein commented:
I have several concerns with the discovery phase of this study. Specifically:
Underpowered: The sample size (90 cases) is insufficient for a complex phenotype where effect sizes are small. My own research, Himmelstein DS, 2015, does not consider GWAS with under 1000 samples because any findings are likely to be false (Sawcer S, 2008). The pathway-based prioritization attempts to alleviate the underpowered study design but suffers from the following two criticisms.
SNP abundance confounding: Genes were selected for the network analysis if they contained any SNPs with p < 5×10<sup>-3.</sup> Therefore, genes containing more SNPs were more likely to get selected by chance. Since genes with more SNPs appear more frequently in curated pathway databases, the enrichment analysis is confounded. A permutation test that shuffles case-control status and recomputes SNP p-values would prevent SNP abundance confounding. However, this does not appear to be the permutation test that was performed.
Limited network relevance: The network analysis uses only curated pathway databases. These databases are heavily biased towards well-studied genes as well as being incomplete. In our recent network that includes more curated pathways than Reactome FI, only 9,511 genes are in any curated pathway. In other words, the majority of genes aren't curated to a single pathway and hence cannot contribute to this study's prioritization approach.
Overfitting: The polygenic risk score (PRS) was ruthlessly overfit. The PRS perfectly discriminated the 90 long-lived smokers from the younger smokers. The authors don't appear to appreciate that the performance is due to overfitting and write:
Results showed that the score completely accounted for group membership, with no overlap between the two groups.
Not only were scores significantly higher for the long-lived group, but scores also appeared to be more homogeneous.
Egregious overfitting is guaranteed by their PRS approach since 215 logistic regressions are fit, each with only 90 positives and without regularization or cross-validation. When a model is overfit on training data, its perfomance on novel data diminishes.
Unreplicable: Table S2 of the supplement does not specify which allele is minor for each SNP. Therefore, the PRS computation cannot be replicated by others.
Given these issues, I find it unlikely that the study found a reliable genotype of longevity. Rather, I suspect the successful validation resulted from confounding, p-value selection bias, or an implementation error.
Finally, the binning of continuous outcomes, primarily age, is bizarre. The binning serves only to reduce the study's power, while providing much room for unintended p-value selection bias.
Update 2015-09-15: I am not suggesting any misconduct, negligence, or intentional bad practices. Rather the methods are clearly described and the validation quite impressive and seemingly honest. I believe the study makes a valuable contribution by proposing a genotype of longevity, which future studies can confirm or deny.
Update 2015-09-19: I replaced "p-hacking" with "p-value selection bias". My intended meaning is the greater investigation and preferential publication granted to more significant findings.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-
- Feb 2018
-
europepmc.org europepmc.org
-
On 2015 Sep 15, Daniel Himmelstein commented:
I have several concerns with the discovery phase of this study. Specifically:
Underpowered: The sample size (90 cases) is insufficient for a complex phenotype where effect sizes are small. My own research, Himmelstein DS, 2015, does not consider GWAS with under 1000 samples because any findings are likely to be false (Sawcer S, 2008). The pathway-based prioritization attempts to alleviate the underpowered study design but suffers from the following two criticisms.
SNP abundance confounding: Genes were selected for the network analysis if they contained any SNPs with p < 5×10<sup>-3.</sup> Therefore, genes containing more SNPs were more likely to get selected by chance. Since genes with more SNPs appear more frequently in curated pathway databases, the enrichment analysis is confounded. A permutation test that shuffles case-control status and recomputes SNP p-values would prevent SNP abundance confounding. However, this does not appear to be the permutation test that was performed.
Limited network relevance: The network analysis uses only curated pathway databases. These databases are heavily biased towards well-studied genes as well as being incomplete. In our recent network that includes more curated pathways than Reactome FI, only 9,511 genes are in any curated pathway. In other words, the majority of genes aren't curated to a single pathway and hence cannot contribute to this study's prioritization approach.
Overfitting: The polygenic risk score (PRS) was ruthlessly overfit. The PRS perfectly discriminated the 90 long-lived smokers from the younger smokers. The authors don't appear to appreciate that the performance is due to overfitting and write:
Results showed that the score completely accounted for group membership, with no overlap between the two groups.
Not only were scores significantly higher for the long-lived group, but scores also appeared to be more homogeneous.
Egregious overfitting is guaranteed by their PRS approach since 215 logistic regressions are fit, each with only 90 positives and without regularization or cross-validation. When a model is overfit on training data, its perfomance on novel data diminishes.
Unreplicable: Table S2 of the supplement does not specify which allele is minor for each SNP. Therefore, the PRS computation cannot be replicated by others.
Given these issues, I find it unlikely that the study found a reliable genotype of longevity. Rather, I suspect the successful validation resulted from confounding, p-value selection bias, or an implementation error.
Finally, the binning of continuous outcomes, primarily age, is bizarre. The binning serves only to reduce the study's power, while providing much room for unintended p-value selection bias.
Update 2015-09-15: I am not suggesting any misconduct, negligence, or intentional bad practices. Rather the methods are clearly described and the validation quite impressive and seemingly honest. I believe the study makes a valuable contribution by proposing a genotype of longevity, which future studies can confirm or deny.
Update 2015-09-19: I replaced "p-hacking" with "p-value selection bias". My intended meaning is the greater investigation and preferential publication granted to more significant findings.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-