On 2015 Sep 15, Daniel Himmelstein commented:
I have several concerns with the discovery phase of this study. Specifically:
Underpowered: The sample size (90 cases) is insufficient for a complex phenotype where effect sizes are small. My own research, Himmelstein DS, 2015, does not consider GWAS with under 1000 samples because any findings are likely to be false (Sawcer S, 2008). The pathway-based prioritization attempts to alleviate the underpowered study design but suffers from the following two criticisms.
SNP abundance confounding: Genes were selected for the network analysis if they contained any SNPs with p < 5×10<sup>-3.</sup> Therefore, genes containing more SNPs were more likely to get selected by chance. Since genes with more SNPs appear more frequently in curated pathway databases, the enrichment analysis is confounded. A permutation test that shuffles case-control status and recomputes SNP p-values would prevent SNP abundance confounding. However, this does not appear to be the permutation test that was performed.
Limited network relevance: The network analysis uses only curated pathway databases. These databases are heavily biased towards well-studied genes as well as being incomplete. In our recent network that includes more curated pathways than Reactome FI, only 9,511 genes are in any curated pathway. In other words, the majority of genes aren't curated to a single pathway and hence cannot contribute to this study's prioritization approach.
Overfitting: The polygenic risk score (PRS) was ruthlessly overfit. The PRS perfectly discriminated the 90 long-lived smokers from the younger smokers. The authors don't appear to appreciate that the performance is due to overfitting and write:
Results showed that the score completely accounted for group membership, with no overlap between the two groups.
Not only were scores significantly higher for the long-lived group, but scores also appeared to be more homogeneous.
Egregious overfitting is guaranteed by their PRS approach since 215 logistic regressions are fit, each with only 90 positives and without regularization or cross-validation. When a model is overfit on training data, its perfomance on novel data diminishes.
Unreplicable: Table S2 of the  supplement does not specify which allele is minor for each SNP. Therefore, the PRS computation cannot be replicated by others.
Given these issues, I find it unlikely that the study found a reliable genotype of longevity. Rather, I suspect the successful validation resulted from confounding, p-value selection bias, or an implementation error.
Finally, the binning of continuous outcomes, primarily age, is bizarre. The binning serves only to reduce the study's power, while providing much room for unintended p-value selection bias.
Update 2015-09-15: I am not suggesting any misconduct, negligence, or intentional bad practices. Rather the methods are clearly described and the validation quite impressive and seemingly honest. I believe the study makes a valuable contribution by proposing a genotype of longevity, which future studies can confirm or deny.
Update 2015-09-19: I replaced "p-hacking" with "p-value selection bias". My intended meaning is the greater investigation and preferential publication granted to more significant findings.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.