- Jul 2018
-
europepmc.org europepmc.org
-
On 2014 Oct 11, Igor Zwir commented:
Pre-selected SNPs The list of 2,891 pre-selected SNPs from the MGS study (1) utilized in (2) (see supplemental material) is available at http://m4m.ugr.es/resources/2891-SNP-PreSelection.txt. This list contains 99% of the SNPs reported in the supplemental material of (1).
(1) Shi J, Levinson DF, Duan J, Sanders AR, Zheng Y, Pe'er I, Dudbridge F, Holmans PA, Whittemore AS, Mowry BJ, Olincy A, Amin F, Cloninger CR, Silverman JM, Buccola NG, Byerley WF, Black DW, Crowe RR, Oksenberg JR, Mirel DB, Kendler KS, Freedman R, Gejman PV. Common variants on chromosome 6p22.1 are associated with schizophrenia. Nature. 2009;460I:753-7.
(2) Arnedo J, Svrakic DM, del Val C, Romero‑Zaliz R, Hernández-Cuervo H, Molecular Genetics of Schizophrenia Consortium, et al. Uncovering the Hidden Risk Architecture of the Schizophrenias: Confirmation in Three Independent Genome-Wide Association Studies. American J of Psychiatry 2014 (in press);172:1–15. Epub 9/15/2014.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 07, Igor Zwir commented:
Interdisciplinary work not so fast … Geneticist with a statistic background vs. engineers/computer scientists with a genetic background
Unfortunately, interdisciplinary works have been concurrently stimulated and frozen at the same time. Much of this contradiction is due to the lack of universal reviewers that know about everything. Science is specialized, in terms of education and consequently in terms of results. For example, most studies of complex disorders usually focus on single sources of data (genetics, neuroimages), eliminating the possibility of having complementary perspectives of the patients. In addition, there is a sort of disrupted communication between the biomedical researchers and math or computer science investigators. Biologist and physicians use to search for articles in pubmed (http://www.ncbi.nlm.nih.gov/pubmed), however, most of the engineers look for the Institute of Electrical and Electronics Engineers (IEEE) communications (https://www.ieee.org) and many of them ignore Pubmed. In contrast, many molecular biologists ignore what IEEE exists and what it represents. There are few IEEE publications that are available in pubmed and most of them are old. However, many of those that are not in Pubmed are much more rigorous than any method proposed in the best biomedical journals. This comment encourages both Pubmed and IEEE sites to solve their differences, which in turn, will help to have more and more informed reviewers for novel techniques in the post genomic era.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 07, Igor Zwir commented:
Rationale of the method applied in Arnedo et. al.
Clustering methods have been applied for long time in pattern recognition problems and in a variety of fields and actually are embedded in most of our devices, from a simple refrigerator to a Boing engine. In biomedical sciences there is a trend of considering that few classes of clustering exist (e.g., hierarchical cluster of K-Means). However, datamining, knowledge discovering, and machine leaning are fields of computer sciences with very rich and different either theoretical or empirical methods. Much of the power of these methods is based on the fusion of them, taking advantages of each one.
The methodology and framework proposed in this Arnedo et. al encodes a generalized factorization method (GFM) that was designed to identify structural patterns or clusters (substructures) that characterize complex objects (1-8) embedded in databases (6, 7, 9, 10). Unfortunately, solving these kinds of complex problems cannot be performed by a single clustering method but utilizing and combining the advantages of many of them, as were implemented in the GFM and summarized below.
First, we utilized the notion of Model-based clusters (5, 11, 12), where the centroid/prototype of a cluster is not a point but a model (e.g., line segment, ellipsoids, (5, 13, 14) or see C-varieties algorithm (12)) -or a family of models (8) - defined as a relation between cohesive subset of points that meet certain constraints (e.g., Relational clustering (12), undirected graph (12)). When the subjacent equations of the models are unknown, they can be estimated from the data by identifying relations between subsets of observations that show similar patterns under a specific subset of relevant descriptive features (attributes) (4, 5, 15). Here we proposed that in a pure data driven analysis these unknown models (grey boxes) can be learned as local relationships between subsets of features and observations, which are termed biclusters. The biclusters are often represented as sub-matrices (4, 5, 15). Other models proposed in this work (black boxes) consist of those that are encoded by a particular method devoted to recognize certain classes of patterns.
Second, to identify a family of models (8) or biclusters we planned the use of the NMF (16) factorization algorithms (tensor analysis), which can uncover cohesive data represented as sub-matrices (factors). When sorted and thresholded, each sub-matrix represents a bicluster. Notably, biclustering produces several local models of the data (17), each one capturing intrinsic relationships between subsets of observations sharing subsets –instead of all- features. Diverse biclusters –instead of uniform data explanations- provide a better understanding of the described object that is diluted when together, all features are used in a single global model of data typically performed by clustering methods (17).
Third, we proposed in this work that biclusters can be fuzzy (overlapping, see Fuzzy clustering (18)), where voxels and/or subjects can belong to more than one bicluster), as well as possibilistic (non-exhaustive partitions, see Possibilistic clustering (12)), where certain voxels and/or subjects may not be assigned to any bicluster, and thus, outliers can be detected.
Fourth, we demonstrated that biclusters could be identified without choosing a priori a particular number of clusters (11, 13, 14, 19). Setting this parameter as a small number may eventually generate few but large data partitions here defined as general biclusters that may underfit the data. In contrast, using a big number of clusters will generate many clusters with partitions of small size (i.e., more granular partitions) here defined as specific biclusters that may overfit the data (1, 5, 11, 13, 14, 19-21). Although there are many different validity indices (12, 18) that suggest the best number of clusters for a given dataset, they often produce contradictory results (11, 13, 14, 19). This problem is exacerbated when Centroid-based (e.g., k-means, NMF) or Distribution-based clustering (e.g., EM) is utilized because two successive numbers of clusters may generate completely different cluster arrangements (see (12, 18) for a review). Instead of fighting over the lost battle of uncovering a single optimal number of clusters, biclusters were calculated from partitions generated by all number of clusters between 2 and √n (12), where n is the number of observations (subjects). We postponed the selection of the best clusters until all partitions were examined to select a set of clusters that together provide an optimal description of the sample. These clusters could be chosen from different partitions that were generated by different number of clusters (see Consensus clustering (1, 5, 11, 13, 14, 19-21)).
Fifth, formulation of the bicluster identification problem from different partitions may produce redundant, excessively specific and/or excessively general biclusters (4, 5, 15, 22, 23). To select optimal biclusters, we proposed a GFM that performs the following Consensus clustering strategy: (i) eliminates all redundant biclusters by comparing the similarity between their subjacent sub-matrices using Hypergeometric statistics (see Methods and Material in Supplement 1), and selecting only one representative bicluster from each set of equivalent sub-matrices. Biclusters harboring <10% of the total number of subjects were not considered to avoid a trend to obtain singleton biclusters. (ii) From the non-redundant biclusters, GMF selects all biclusters that are optimal in terms of specificity, generality and diversity. To do so, it selects all non-dominated biclusters where one bicluster is not worse (i.e. dominated) than another in both objectives: specificity and generality (4, 5, 15, 22, 23). Moreover, to ensure diversity, GMF applies the non-dominance metric only among biclusters within a neighborhood (22, 23) (i.e., biclusters with a relatively high overlapping of subjects and voxels with). This Multiobjective and Multimodal optimization process is analogous to Minimum Description-Length methods (24) and was carried out as described in (1-8). (iii) The remaining optimal biclusters are hierarchically organized by inclusion of their subjects into connected or disjoint hierarchies (i.e., subgraphs or subnetworks).
Sixth, and after the first step of identifying interesting patterns describing objects embedded in the data, we went one step further by also finding characteristic descriptions for each group, where each group represents a concept or class using Conceptual clustering. In this work we incorporated factors such as the generality and simplicity of the derived bicluster descriptions by (1) relational clustering validation against data driven independent descriptions obtained in other domains of knowledge, and (2), incorporate the previous structured knowledge into predictive multiclassifiers mixing machine learning and multiobjetive optimization techniques.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 07, Igor Zwir commented:
References
- Cordon O, Herrera F, Zwir I (2002): Linguistic modeling by hierarchical systems of linguistic rules. Ieee T Fuzzy Syst. 10:2-20.
- Ruspini EH, Zwir I (2002): Automated generation of qualitative representations of complex objects by hybrid soft-computing methods. In: Pal SK, Pal A, editors. Pattern recognition : from classical to modern approaches. New Jersey.: World Scientific, pp 454-474.
- Cordon O, Herrera F, Zwir I (2003): A hierarchical knowledge-based environment for linguistic modeling: models and iterative methodology. Fuzzy Set Syst. 138:307-341.
- Zwir I, Shin D, Kato A, Nishino K, Latifi T, Solomon F, et al. (2005): Dissecting the PhoP regulatory network of Escherichia coli and Salmonella enterica. Proc Natl Acad Sci U S A. 102:2862-2867.
- Zwir I, Huang H, Groisman EA (2005): Analysis of Differentially-Regulated Genes within a Regulatory Network by GPS Genome Navigation. Bioinformatics. 21:4073-4083.
- Romero-Zaliz R, C. Rubio R, Cordón O, Cobb P, Herrera F, Zwir I (2008): A multi-objective evolutionary conceptual clustering methodology for gene annotation within structural databases: a case of study on the gene ontology database. IEEE Transactions on Evolutionary Computation . 12:6:679-701.
- Romero-Zaliz R, Del Val C, Cobb JP, Zwir I (2008): Onto-CC: a web server for identifying Gene Ontology conceptual clusters. Nucleic acids research. 36:W352-357.
- Harari O, Park SY, Huang H, Groisman EA, Zwir I (2010): Defining the plasticity of transcription factor binding sites by Deconstructing DNA consensus sequences: the PhoP-binding sites among gamma/enterobacteria. PLoS computational biology. 6:e1000862.
- Cheeseman P, Stanford University. (1990): Bayesian learning. [Palo Alto, Calif.]: Board of Trustees of the Leland Standord Junior University.
- Cook DJ, Holder LB, Su S, Maglothin R, Jonyer I (2001): Structural mining of molecular biology data. IEEE Eng Med Biol Mag. 20:67-74.
- Fraley C, Raftery AE (1998): How many clusters? Which clustering method? Answers via model-based cluster analysis. The computer journal. 41:578-588.
- Bezdek JC (1998): Pattern Analysis. In: Pedrycz W, Bonissone PP, Ruspini EH, editors. Handbook of Fuzzy Computation. Bristol: Institute of Physics, pp F6.1.1-F6.6.20.
- Bittner T, Smith B (2003): A theory of granular partitions. Foundations of geographic information science.117-151.
- Fred AL, Jain AK (2005): Combining multiple clusterings using evidence accumulation. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 27:835-850.
- Arnedo J, del Val C, de Erausquin GA, Romero-Zaliz R, Svrakic D, Cloninger CR, et al. (2013): PGMRA : a web server for ( phenotype x genotype ) many-to-many relation analysis in GWAS. Nucleic Acid Research. 75.
- Pascual-Montano A, Carazo JM, Kochi K, Lehmann D, Pascual-Marqui RD (2006): Nonsmooth nonnegative matrix factorization (nsNMF). IEEE transactions on pattern analysis and machine intelligence. 28:403-415.
- Madeira SC, Oliveira AL (2004): Biclustering algorithms for biological data analysis: a survey. IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM. 1:24-45.
- Bezdek JC, Pal SK, IEEE Neural Networks Council (1992): Fuzzy models for pattern recognition : methods that search for structures in data. New York: IEEE Press.
- Latorre Carmona P, Sánchez JS, Fred ALN, SpringerLink (Online service) (2013): Mathematical Methodologies in Pattern Recognition and Machine Learning Contributions from the International Conference on Pattern Recognition Applications and Methods, 2012. Springer Proceedings in Mathematics & Statistics,. New York, NY: Springer New York : Imprint: Springer,, pp VIII, 194 p. 158 illus., 140 illus. in color.
- Kim EY, Kim SY, Ashlock D, Nam D (2009): MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering. BMC bioinformatics. 10:260.
- Saeed F, Salim N, Abdo A (2012): Voting-based consensus clustering for combining multiple clusterings of chemical structures. Journal of cheminformatics. 4:37.
- Deb K (2001): Nonlinear goal programming using multi-objective genetic algorithms. J Oper Res Soc. 52:291-302.
- Deb K (2001): Multi-objective optimization using evolutionary algorithms. 1st ed. Chichester ; New York: John Wiley & Sons.
- Rissanen J (1989): Stochastic complexity in statistical inquiry. Singapore: World Scientific.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 07, Igor Zwir commented:
None
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 07, Igor Zwir commented:
None
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 07, Igor Zwir commented:
None
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On date unavailable, commented:
None
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 06, C. Robert Cloninger commented:
References
- Arnedo J, Svrakic DM, Del Val C, Romero-Zaliz R, Hernandez-Cuervo H, Fanous AH, et al. Uncovering the Hidden Risk Architecture of the Schizophrenias: Confirmation in Three Independent Genome-Wide Association Studies. The American journal of psychiatry. 2014.
- Arnedo J, del Val C, de Erausquin GA, Romero-Zaliz R, Svrakic D, Cloninger CR, et al. PGMRA: A web server for (Phenotype X Genotype) many-to-many relation analysis in GWAS. Nucleic Acids Res. 2013(Web Server issue). PMCID: 2447763.
- Risch N. Linkage strategies for genetically complex traits. I. Multilocus models. American journal of human genetics. 1990;46(2):222-8. PMCID: 1684987.
- Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014;511(7510):421-7. PMCID: 4112379.
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics. 2007;81(3):559-75. PMCID: 1950838.
- Shi J, Levinson DF, Duan J, Sanders AR, Zheng Y, Pe'er I, et al. Common variants on chromosome 6p22.1 are associated with schizophrenia. Nature. 2009;460(7256):753-7. PMCID: 2775422.
- Graves JA. Review: Sex chromosome evolution and the expression of sex-specific genes in the placenta. Placenta. 2010;31 Suppl:S27-32.
- Reich DE, Cargill M, Bolk S, Ireland J, Sabeti PC, Richter DJ, et al. Linkage disequilibrium in the human genome. Nature. 2001;411(6834):199-204.
- Slatkin M. Linkage disequilibrium--understanding the evolutionary past and mapping the medical future. Nat Rev Genet. 2008;9(6):477-85.
- Wiehe T, Slatkin M. Epistatic selection in a multi-locus Levene model and implications for linkage disequilibrium. Theor Popul Biol. 1998;53(1):75-84.
- Pritchard JK, Donnelly P. Case-control studies of association in structured or admixed populations. Theor Popul Biol. 2001;60(3):227-37.
- Koch E, Ristroph M, Kirkpatrick M. Long range linkage disequilibrium across the human genome. PLoS One. 2013;8(12):e80754. PMCID: 3861250.
- Wright S. The shifting balance theory and macroevolution. Annual review of genetics. 1982;16:1-19.
- Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, et al. Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet.86(6):929-42. PMCID: 3032061.
- Zwir I, Shin D, Kato A, Nishino K, Latifi T, Solomon F, et al. Dissecting the PhoP regulatory network of Escherichia coli and Salmonella enterica. Proc Natl Acad Sci U S A. 2005;102(8):2862-7.
- Zwir I, Huang H, Groisman EA. Analysis of Differentially-Regulated Genes within a Regulatory Network by GPS Genome Navigation. Bioinformatics. 2005;21(22):4073-83.
- Romero-Zaliz R, Del Val C, Cobb JP, Zwir I. Onto-CC: a web server for identifying Gene Ontology conceptual clusters. Nucleic Acids Res. 2008;36(Web Server issue):W352-7.
- Romero-Zaliz R, C. Rubio R, Cordón O, Cobb P, Herrera F, Zwir I. A multi-objective evolutionary conceptual clustering methodology for gene annotation within structural databases: a case of study on the gene ontology database. IEEE Transactions on Evolutionary Computation . 2008;12:6:679-701.
- Harari O, Park SY, Huang H, Groisman EA, Zwir I. Defining the plasticity of transcription factor binding sites by Deconstructing DNA consensus sequences: the PhoP-binding sites among gamma/enterobacteria. PLoS Comput Biol. 2010;6(7):e1000862. PMCID: 2908699.
- Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401(6755):788-91.
- Mejia-Roa E, Carmona-Saez P, Nogales R, Vicente C, Vazquez M, Yang XY, et al. bioNMF: a web-based tool for nonnegative matrix factorization in biology. Nucleic Acids Res. 2008;36(Web Server issue):W523-8. PMCID: 2447803.
- Pascual-Montano A, Carmona-Saez P, Chagoyen M, Tirado F, Carazo JM, Pascual-Marqui RD. bioNMF: a versatile tool for non-negative matrix factorization in biology. BMC Bioinformatics. 2006;7:366. PMCID: 1550731.
- Tamayo P, Scanfeld D, Ebert BL, Gillette MA, Roberts CW, Mesirov JP. Metagene projection for cross-platform, cross-species characterization of global transcriptional states. Proc Natl Acad Sci U S A. 2007;104(14):5959-64. PMCID: 1838404.
- Cichocki A. Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blinded separation. Chichester, U.K.: John Wiley; 2009.
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat Genet. 1999;22(3):281-5.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 06, C. Robert Cloninger commented:
Conclusion
We appreciate the opportunity to clarify the fundamental differences between the assumptions and goals of traditional GWAS and our novel approach that addresses the complexity of common disorders with sophisticated and well-validated machine-learning and data-mining methods. We hope that the profound differences in the approaches with which Breen and colleagues are familiar and those developed by us should stimulate greater understanding of the challenges faced by the fields of psychiatric and medical genetics. We recognize that this new approach will cause a period of reexamination of standard methodology in this field, but every major advance in genetics, and in all of science for that matter, has always required flexibility and creative thinking. There are always things that we can improve upon in any method, and we recognize that many incremental improvements are essential for the advance of science.
We have put forth a new data-driven method that allows the uncovering of complex genotypic-phenotypic relations when they are present without imposing this as an a priori assumption. We uncovered relationships are in fact highly complex, which allowed us to identify individuals at high risk and to associate specific SNP clusters with specific clinical syndromes despite the presence of extensive pleiotropy and heterogeneity. This approach, like all those that have preceded it, is undoubtedly imperfect and will also require refinement and may ultimately give way to yet another approach that will explain more. Such methodological evolution is nothing more than the typical course of advancement in science. We hope that these exciting developments will lead to new ways to push the boundaries of accepted science, and help us to question a priori assumptions that restrict our understanding of all the information embedded in data.
If this discussion has shown us nothing else, it is that this process of questioning and reflection has already begun. Ultimately, beyond all of the technical issues, our main goal is to help those in need. With schizophrenia, we know the need is great from the tremendous outpouring of requests for guidance and help that we have received, and we know that there are many people with other diseases who may benefit from our new approach. We can all be comforted knowing that our debate can bring us closer to doing what we are really here to do – that is, helping those suffering from debilitating diseases and finding ways to promote their health and well-being. Whatever path leads us there is worth considering. So let us not permit our philosophical or scientific differences to prevent us from allowing for a sufficient diversity in our tactics, because we never know what path will lead us towards our common goals of improving health and reducing the burden of disease.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 06, C. Robert Cloninger commented:
The Facts about Replication and Significance Testing of SNP Selection
B-S expressed concern about our replication process. Of course in traditional GWAS, replication has always been a serious problem, which is the basis for the rationale of PGC to carry out meta-analysis of large collections of samples despite their heterogeneity and limited phenotypic description. It was most challenging for us to identify samples with adequate clinical description to apply our novel approach, but the reward was in identifying strong effects that replicated consistently across three samples, including the Portuguese Islands study that used the same diagnostic instrument in a specific ethnic sample. The samples were independently recruited and independently analyzed, as we stated clearly in the published report. SNP sets, phenotypic sets and associations were separately calculated for the three samples to avoid weighted or biased aggregations. Then, we used a well-known co-clustering test based on the hypergeometric distribution to establish the replicability of results from one sample in the other. This test has been used widely in molecular biology (15, 16, 25), and as a general strategy for validating clusters. For example, it has also been implemented into software packages such as TIBCO/Spotfire. The concerns expressed by B-S about replication have no reasonable justification. Thus we feel that the concerns expressed by B-S about replication are overstated and empirically unfounded. Again, the strength of this new approach is it allows us to avoid some of the major problems that plague traditional GWAS approaches.
B-S also expressed their concern about the use of a permutation test, claiming that "because SNP sets differ in allele frequency between cases and controls, this procedure does not generate a valid null distribution". The permutation test was used not to establish the significance of the SNP sets, which was evaluated by the SKAT method (14), but rather to test the validity (and approximate probability) of the association between SNP sets and symptom sets. Controls were not used in this test at all, as they have no symptoms of psychosis. Moreover, these symptoms were not even evaluated in the reported inventories. The misunderstanding of B-S is probably due to a lack of familiarity with this new statistical procedure, which highlights the previously discussed difficulties people have when first trying to understand a novel approach.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 06, C. Robert Cloninger commented:
The Facts about Linkage Disequilibrium
B-S also expressed concern that it is likely that our SNP sets may be merely artifacts of blocks of markers in linkage disequilibrium (LD). LD is strictly defined as the non-random association of alleles of neighboring polymorphisms derived from single ancestral chromosomes, but some broad measures of LD extend the concept to include covariation of polymorphisms that are not linked, including even associations among genetic variants on separate chromosomes (8). Many variables influence covariation of polymorphisms, including demographic variables (admixture, population size, migration, ancestral population bottlenecks), selection (including epistasis), and variation In recombination rates in different parts of the genome (9, 10). Consequently LD is a serious problem for the traditional group-wise statistical approach of traditional GWAS, so care is taken to analyze groups with distinct ancestries separately in order to help disentangle different causes of association. However, in our novel person-centered approach, the identification of subpopulations is an intrinsic aspect of identifying the genotypic-phenotypic architecture. We identify sets of variables that naturally cluster within individual subjects as measured by covariance of polymorphisms or phenotypic traits within particular subgroups of individuals in a population. We identify SNP sets and phenotypic sets independently of one another and then test how these independently identified sets fit together like a lock and a key. In a strictly data-driven manner the hidden structure of the overall population is decomposed into subpopulations of subjects to allow valid tests of genotypic-phenotypic association despite admixture in the total population from which SNP sets and phenotypic sets are extracted (11). We allow the possibility that some constituent SNPs of a particular set may be associated (in LD) as adventitious hitchhikers that are closely linked or may be epistatic sets that are functionally adaptive and maintained by selection pressure even though they are unlinked (12). However, being linked (co-localized) or in LD is neither a necessary nor a sufficient condition for being a constituent in a SNP set: set membership depends on co-variation of polymorphisms in particular subpopulations of individuals whether the genetic variants are in LD in the total population or not. LD is actually one way that epistatic sets of genetic variants can be maintained in functionally adaptive blocks if the epistatic selection is strong, but most interactive sets of genetic variants are not in LD. Accordingly, we uncover constituents of SNP sets regardless of their LD status as candidates for functionally adaptive epistatic sets. Then we measure their potential functional interaction by testing for their differential association with phenotypic variability. We also consider the known function of the genes and regulatory sites as part of our analysis of the complex pathway from genotypic networks to distinct clinical syndromes. Thus we jointly utilize genotypic, biological, and phenotypic information as part of an integrated systems analysis that allows for observed or hidden stratification in the total population.
In addition to this fundamental difference in conceptual and procedural approach to LD, the concerns of B-S about our findings are simply unfounded empirically. In total, approximately 2/3 of the SNPs in high-risk sets map to regions that are so far apart in genomic distance (greater than 100,000 base pairs) that they are highly unlikely to be in LD. We found that 9 of 42 high-risk SNP sets have some SNPs located on different chromosomes. These facts indicate that the identified SNP sets are not the result of particular genomic constraints such as LD or being within the region of the same gene. In any case, the presence of LD would not explain or invalidate the association of groups of SNPs within a particular SNP set with a particular phenotypic set. For example, one of our SNP sets maps exclusively to SNPs upstream of the NTRK3 gene, as was also found to be strongly associated with schizophrenia by standard GWAS techniques published by the authors of the commentary. In addition SNPs from another SNP set map inside the same gene. Each of these SNP sets involving different components of the NTRK3 gene are associated with different symptoms. Although LD is viewed as a statistical problem for traditional GWAS, in our person-centered approach it is viewed as the result of adaptive mechanisms that can conserve the functional connectivity of epistatic sets of genetic variants, thereby contributing to the differential development of individuals in subpopulations. The functional adaptation facilitated by gene-gene interactions is fundamentally important for healthy development of individuals and for the evolution of populations, as described in Sewall Wright’s classical work on complex adaptive systems and evolution (13). Our concurrent consideration of the functions of gene products and associations between different genotypic networks with specific phenotypic syndromes precludes any suggestion that the highly replicable effects we observed are artifacts.
B-S have also suggested matrix factorization algorithms similar to the methods employed by us have been used to identify regions with long-range LD. This is certainly true and is not a problem in itself. Long-range LD is an indicator of functional connectivity that is not adequately explained by physical proximity, so it is included in what we want to detect in order to account for gene-gene interactions thoroughly (14). Matrix factorization methods like ours have been used for most of the current software applications in data-mining, including a wide variety of biomedical problems (15-19), facial recognition (20), gene expression (21-23), and other complex problems (24). There is no reason to avoid the use of this powerful method for pattern recognition within fuzzy data sets for uncovering hidden order within the complex association of genotypic and phenotypic variables that characterize complex medical disorders.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 06, C. Robert Cloninger commented:
The Challenge of Understanding and Accepting a Change in Perspective
Is our approach to concurrent genotypic-phenotypic of possible complex relationships a fruitful new approach without the limiting assumptions of standard GWAS, or are our observations really artifactual in ways that have been overlooked by us and by multiple sets of peer reviewers with relevant expertise about these novel methods? The American Journal of Psychiatry gave us generous space for the published article, including clinical vignettes with associated genotypic information to help people see what we have done even if the technical details of the statistical procedures may seem obscure when you first start looking at complex genotype-phenotype relationships through the illuminating lens of sophisticated machine-learning and data mining procedures. We expected that there would be widespread interest and scrutiny of this new data-driven approach with less restrictive assumptions, so we prepared an extensive online supplement specifying procedures, all components of the sets of SNPs and clinical variables as well as a detailed analysis of the associated gene products, their functions, and disease associations.
The full list of SNPs used in our analysis is being made available for others to continue to test. We believe in transparency and collegiality as key ingredients in the advance of science because it is essential for the spirit of empiricism that our data driven method emphasizes. The precise procedures for reproducing the list of SNPs was detailed already in our supplemental information and should be reproducible by experienced investigators. We will continue to consider reasonable requests for assistance from qualified investigators.
B-S expressed their concern that the “methodology is opaque (even to experts), meaning that their results cannot be independently validated”. First of all, the complexity of a method does not invalidate the approach: complex methods may be necessary to deconstruct and understand complex processes. We cannot continue to look for hidden relationships with methods that do not shine light where it is needed. That said, the manuscript was exhaustively evaluated under strict peer review process, which included a separate report from an independent statistician. Because of their many insightful comments, there is no doubt that the referees understood the method and provided recommendations that we conscientiously addressed in the re-submission process. Moreover, the PGMRA method utilized in this work was also evaluated by expert reviewers in bioinformatics and genetics for the journal Nucleic Acid Research (2). The method is well-described but does require relevant expertise beyond what is required for traditional GWAS. Fortunately, we have made a web-server application of the method publicly available as a service to the field. PGMRA is applicable to a wide variety of analyses besides GWAS, including brain imaging and related methods for uncovering order in complex hidden relationships, which may help to further characterize the pathway from genotype to phenotype more objectively than can be done by categorical diagnoses or symptom inventories in samples so large that costs become prohibitive for thorough assessment. We know that many are increasingly criticizing overspecialization in the fields of science, but the neglect of strictly data-driven techniques from machine-learning and data-mining that do not require restrictive a priori assumptions may well be precisely what has prevented us from understanding the complex genotypic-phenotypic architecture of common disorders like the schizophrenias.
We understand that our new approach is challenging long-held assumptions and that there may be a desire by some to put the genie back in the bottle, but we feel that looking at the complexity of the schizophrenias is a necessary evolution for the field; it is an evolution whose time has come and is currently transforming other fields of science and genetics. There is overwhelming evidence across multiple disciplines that living systems and psychosocial behavior are simply too complex and interactive to ignore the real underlying complexity. Nonetheless, we were a bit surprised to see this discussion in a public forum that is not peer reviewed. We would rather have thoughtful constructive consideration of the scientific merits of alternative approaches, including their fundamental philosophical differences in perspective and goals, as well as scientific differences in assumptions and procedures. One of the major obstacles to evaluating GWAS is that it can be difficult or impossible for scientists in many fields to evaluate complex technical procedures with which they are unfamiliar. The challenge of changing one’s perspective can be great and feel counterintuitive, as physicists experienced more than a century ago when quantum mechanics called into question our more natural inclination to a Newtonian perspective. That is why we feel it would have been more constructive to have neutral review by people with relevant expertise in many aspects of methods that span bioinformatics, statistics, genomics, and phenomics, all of which are needed to adequately judge the strengths and weaknesses of a novel approach like ours. Even people with extensive experience in traditional approaches to GWAS may not be sufficiently knowledgeable about these well-tested, but relatively new, machine-learning and data-mining techniques that have allowed us to develop a new, and, we hope, more generative approach to GWAS.
Nevertheless, as scientists we are dedicated to identifying and learning how to move our fields of inquiry forward in order to better understand the underlying mechanisms of disease and to identify effective personalized treatments for complex disorders. We have found it is crucial to pay balanced attention to both phenotypic variability and genotypic variability if we are ever to describe the complex development of common and complex medical disorders like the schizophrenias. We do not feel that this public forum is the best place to have this discussion with Breen and colleagues, but here too we may have a philosophical difference. That said, because they have chosen this forum to voice their criticism, we feel it is important that we take the time to address the facts and give people a broader context so that they can understand the arguments and our responses to their concerns. Ultimately, we feel that this is more of a misunderstanding and a miscommunication due to a lack of a common scientific and philosophical approach, and that with time we hope to find more common ground. Ultimately, the data will settle any dispute.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 06, C. Robert Cloninger commented:
Reply to Breen et al (their Part 2 of 2) On behalf of: C. Robert Cloninger, M.D., Ph.D. (Departments of Psychiatry and Genetics, Washington University School of Medicine, St. Louis, MO, USA); Igor Zwir, Ph.D. (Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA; Department of Computer Science and Artificial Intelligence, University of Granada, Spain); Gabriel A. de Erausquin, M.D., Ph.D. (Roskamp Laboratory of Brain Development, Modulation and Repair, Department of Psychiatry and Behavioral Neurosciences, University of South Florida, Tampa, FL, USA); Dragan M. Svrakic, M.D., Ph.D. (Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA); Coral del Val, Ph.D. (Department of Computer Science and Artificial Intelligence, University of Granada, Spain); Javier Arnedo, M.S. (Department of Computer Science and Artificial Intelligence, University of Granada, Spain); Rocío Romero-Zaliz, Ph.D. (Department of Computer Science and Artificial Intelligence, University of Granada, Spain); Helena Hernández-Cuervo, M.D., B.Sc. (Roskamp Laboratory of Brain Development, Modulation and Repair, Department of Psychiatry and Behavioral Neurosciences, University of South Florida, Tampa, FL, USA);
The reply to the comments of Breen et al (their Part 2 of 2) by Cloninger on behalf of all of us responsible for analysis in the Arnedo et al article has the following five components which are posted separately as follows: (4) The Challenge of Understanding and Accepting a Change in Perspective (5) The Facts about Linkage Disequilibrium (6) The Facts about Replication and Significance Testing of SNP Selection (7) Conclusion (8) References for all sections of our postings
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Sep 28, Gerome Breen commented:
Part 2 of 2 (see below) On behalf of: Gerome Breen, PhD (Institute of Psychiatry, King’s College London, London, UK) * Brendan Bulik-Sullivan (Broad Institute, Cambridge, MA, USA) * Mark Daly, PhD (Broad Institute, Cambridge, MA, USA) * Sarah Medland, PhD (QIMR Berghofer, Brisbane, Australia) * Benjamin Neale, PhD (Broad Institute, Cambridge, MA, USA) * Michael O’Donovan, MD PhD (Cardiff University, Cardiff, UK) * Stephan Ripke, PhD (Broad Institute, Cambridge, MA, USA) * Patrick Sullivan, MD (Karolinska Institutet, Stockholm, Sweden) * Peter Visscher, PhD (University of Queensland, Brisbane, Australia) * Naomi Wray, PhD (University of Queensland, Brisbane, Australia) *
- Authors ordered alphabetically, all made equivalent contributions
C. Linkage disequilibrium (LD).
Pairs of SNPs that are physically close in the genome are often correlated due to LD. Furthermore, in samples containing individuals with different ancestry, SNPs on different chromosomes whose allele frequencies differ between populations will appear to be correlated. These are both well-known phenomenon from population genetics.
The typical size of blocks defined by high LD is on the order of 20,000 bases, but LD is far from uniform across the genome. Using a large European sample genotyped with Affymetrix 6.0 arrays, we had previously computed the locations of particularly large blocks of LD (defined using SNPs with r2 > 0.5). The first step in the statistical methodology described by Arnedo et al. is to identify so-called “SNP sets” – sets of SNPs that travel together – which the authors believe contain some information about clinical subtypes of schizophrenia: “we first identified sets of interacting … SNPs that cluster within subgroups of individuals … regardless of clinical status” (no LD limitations were imposed). Of the 237 SNPs in Table S3 from Arnedo et al., 153 (65%) mapped to exceptionally large LD blocks larger than 100,000 bases (median 275kb, interquartile range 165-653kb, maximum 1.2 mb).
Arnedo et al. claim repeatedly that sets of SNPs that travel together are informative about clinical subtypes of schizophrenia. A more parsimonious interpretation of the SNP clusters identified by Arnedo et al. is that these SNPs represent a combination of (1) SNPs in large LD blocks and (2) SNPs whose allele frequencies differ substantially between European and African sample subsets. Indeed, matrix factorization algorithms similar to the methods employed by Arnedo et al. have been used to identify regions with long-range LD (see Price et al., AJHG, 2008, link below).
D. SNP selection.
Arnedo et al. conducted genetic clustering analyses on 2,891 SNPs selected on the basis of in-sample P-values from analysis of association with case-control status and selected from a total of ~700,000 SNPs. It is therefore expected that linear or non-linear combinations of these SNPs will be associated with case-control status in the same sample (their risk statistic); this is true even if the selected SNPs are not truly associated. A permutation test is used to assess the significance of the observed phenotype/genotype clustering. In this permutation test, subjects are randomly allocated to “SNP sets” but, since the SNPs were selected because they differ in allele frequency between cases and controls, this procedure does not generate a valid null distribution. As a result, the reported P-values are incorrect.
The strategy used by Arnedo etl al. is an example of estimation and selection of effects in a dataset and then testing (or re-estimating) them in the same data, a common pitfall of prediction analyses (Wray et al. 2013, Nature Reviews Genetics, link below). To construct a valid permutation test, the authors should have randomized case-control status in the association analysis step, selected a new set of ~3,000 SNPs and generated a distribution of their coincident test index under a truly null distribution.
E. Replication.
Replication of results is a well-acknowledged strategy for generating confidence in reported findings. Arnedo et al. state that they replicated their findings in two samples but, upon closer examination, it is unclear precisely what replicated, exactly how this was done, and whether the degree of “replication” deviated from that expected by chance. It was also unclear whether the replication control samples were or were not independent from the discovery sample. Such non-independence is another common pitfall in prediction or validation analysis.
Conclusions.
Given the remarkable claims made by Arnedo et al., it is essential that alternative explanations be excluded. Unfortunately, the authors do not provide the necessary evidence. As presented, their methodology is opaque (even to experts), meaning that their results cannot be independently validated. Arnedo et al. do not consider alternative explanations for the phenomena that they observe, such as confounding from ancestry and LD, even though these are well-known issues for the statistical methods that they employ and have been studied extensively in the statistical and population genetics literature. In addition, their multistep analysis approach is subject to multiple issues as noted above.
We believe that it is highly likely that the results of Arnedo et al. are not relevant for schizophrenia. We urge great caution in the interpretation of the results of study.
Links: • Arnedo et al. in PubMed, http://www.ncbi.nlm.nih.gov/pubmed/25219520 • Press release from Washington University (St Louis), https://news.wustl.edu/news/Pages/27358.aspx • Media coverage via Google News, https://news.google.com/news?ncl=dwmwseq1Mco0xVMLXz5qVDa7uJqbM&q=cloninger&lr=English&hl=en&sa=X&ei=hgIjVO-0GoGOyASxtIGAAw&ved=0CD0QqgIwCQ • Nonnegative matrix factorization and ancestry inference: http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1001117 • Pitfalls of predicting complex traits from SNPs: http://www.ncbi.nlm.nih.gov/pubmed/23774735 • Using principal components analysis to identify regions with long-range LD: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2443852
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Sep 28, Gerome Breen commented:
None
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 06, C. Robert Cloninger commented:
The Facts about Gender and the X-Chromosome
B-S express concern about gender effects in our results. Traditional GWAS focuses on average effects in heterogeneous groups, but our novel approach focuses on uncovering genotypic-phenotypic relationships in individuals regardless of their gender. B-S were concerned about the possible bias of results from 15 SNPs on the X-chromosome among 245 SNPs in high-risk SNP sets. However, a simple test based on the number of chromosomes shows that 15 SNPs cannot substantially confound the results. It is true that three of our 42 high-risk SNP sets have some SNPs on the X Chromosome, but when this is considered in context along with the remaining 39 SNP sets, the influence of gender was insignificant by a Kolmogorov test. All SNP sets have consistent associations with distinct phenotypic sets regardless of gender. The effects of gender and location of genetic variants on the X-chromosome have a negligible influence on our findings. In fact, the small number of SNPs on the X-chromosome in SNP sets at high risk for schizophrenia shows that our person-centered method does not select SNPs that are in LD indiscriminately; the X-chromosome has many highly conserved sets of epistatic genes in LD that influence gender and brain function (7), but these are not overrepresented in our SNP sets at high risk for schizophrenia. We thank B-S for calling attention to another finding that demonstrates that our method for identifying SNP sets is highly selective for particular phenotypes.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 06, C. Robert Cloninger commented:
The Facts about Ancestry and Population Stratification
B-S expressed concern that our findings may be an artifact. As scientists we are committed to trying to disconfirm findings we have made, no matter how strong the existing evidence may be. Findings about association need to be resolved experimentally at the molecular level not only for our findings but also for the findings of other published GWAS, which we plan to do in the near future. We have previously considered the variables that concern B-S carefully, but did not report these observations in detail for two reasons. First, their impact was empirically negligible for our approach, as we will describe, and we prioritized space to the variables that were most significant. Second, although the phenomena that concern B-S are often a serious problem for traditional group-wise approaches to GWAS, they are not problematic for our novel approach because it directly tests the association between genotypic variability and phenotypic variability within individuals after deconstruction of the observed or hidden structure of the population. We will describe the facts about each of these phenomena and explain why these criticisms fall short of explaining our findings.
B-S expressed concern that we did not take the necessary steps to correct for population stratification bias in the way they would have done using their traditional group-wise statistical approach. They suggest that the clusters we identified simply reflect “SNPs whose allele frequencies differ substantially between European and African sample subsets”, and go so far as to claim that the SNP sets we uncovered may not be relevant to schizophrenia at all. However, for that claim to be valid, ethnicity must have a strong influence on the risk and symptoms of schizophrenia, but that requirement is unlikely to be satisfied based on previous observations and was not found in our data. We did consider both sex and ancestry as co-variates in the pre-selection of SNPs with at least loose association with schizophrenia. This pre-selection was performed to reduce the large search space using the logistic association function included in the PLINK software suite (5). Our analysis was performed in this way to be compatible with the supplementary tables reported in (6) for African Americans (AA), European-Americans (EA), and individuals of mixed African and European ancestry (AA-EA). The most important fact about ethnic stratification is that there were multiple examples of SNP sets containing varying mixes of subjects from different subpopulations for each disease sub-type in each of our three independent samples of subjects. For example, in the Molecular Genetics of Schizophrenia (MGS) sample, the SNP set 22-11 was represented by 48% AA and 52% EA, SNP set 21-8 by 55% AA and 45% EA, SNP set 31-22 by 53% AA and 47% EA, SNP set 54-51 by 79% AA and 21% EA, and SNP set 71-55 by 52% AA and 48% EA.
In fact, all the SNP sets that appeared to be ethnically stratified (i.e., contained mostly AA or EA subjects in MGS sample, such as 56-30 or 42-37) replicated their association with specific phenotypic indicators of different classes of schizophrenia in subjects of another ethnicity in CATIE or in the Portuguese sample. Although concerns about ethnic stratification may be valid elsewhere, ethnic stratification had little impact on our results and cannot explain the robust association of specific SNP sets with specific phenotypic sets regardless of ethnicity or sample. These observations show the great utility of detailed consideration of phenotypic variability in individual people in our approach, compared to the sensitivity to confounding by population stratification in traditional GWAS when heterogeneous phenotypes are lumped together indiscriminately as cases. The concerns of B-S about ethnic stratification point out a limitation of traditional group-wise GWAS that is averted by our novel approach. We thank them for drawing attention to another strength of our approach, one that we did not have enough space to report previously.
We will discuss population stratification in more detail together with our reply about linkage disequilibrium (see posting 5), and discuss significance testing following our comments on replication in later sections of our reply to Part 2 of their comments (see posting 6).
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 06, C. Robert Cloninger commented:
Two Distinct Perspectives and Methodological Approaches to GWAS
We expected our paper uncovering the hidden risk architecture of the schizophrenias to be controversial because it takes a fundamentally new approach to solve problems that have plagued the field of medical genetics for more than a decade without resolution (1). We went through rigorous peer review regarding the method with experts in bioinformatics and genetics (2) and then again regarding the application of our new approach to the schizophrenias (1). The critical comments of Breen and other colleagues of Sullivan (B-S) highlight the fact that we have a fundamentally new approach with a distinct perspective and properties from the traditional method they have used for several years. It is important to understand how our novel approach differs from the traditional one in order to appreciate the opportunities it provides for the advancement of science.
First, in our novel approach, common disorders are recognized to have a complex etiology in which multiple genetic and environmental variables interact in complex ways to influence the risk of disease in an individual person. B_S are experienced in approaches to genome-wide association studies (GWAS) that allow detection of only the average (additive) effects of individual genes in groups of people. We regard the traditional group-wise approach to GWAS as overly restrictive because it is well established that genes typically function in concert with one another, resulting in substantial epistasis in schizophrenia and many other common disorders (3). Fitness, health, and behavior are properties of persons, not genes. Nevertheless, the traditional approach can be useful when its a priori assumptions are satisfied. The traditional and novel approaches to GWAS should be viewed as being complementary perspectives and procedures.
Second, our novel approach allows for the possibility of complex relationships between multilocus genotypes and multifaceted phenotypes. In other words, different sets of genetic polymorphisms can be associated with the same phenotype (“equi-finality” or genetic heterogeneity), and the same set of genetic polymorphisms can be associated with multiple distinct phenotypes (“multifinality” or pleiotropy). They focus only on heterogeneous groups of cases, neglecting phenotypic variability among cases. It is important to note that our novel approach does not make any a priori assumption that complexity is present, but we do allow it to emerge from the data when present, as occurred clearly in our analysis of multiple independent samples of people with the schizophrenias.
Third, we carry out person-centered analyses that specify genotypic-phenotypic relationships within each individual by using clustering methods in which subjects are one matrix dimension and the other matrix dimension is either genotypic or phenotypic information. In other words, our analyses are informative about each individual, thereby providing a basis for identifying specific causes of illness in each person as a basis for tailoring treatment in a personalized way. In contrast, the traditional GWAS considers only average effects in groups of people, making it unjustified to say anything with confidence about a specific individual. Such traditional methods have failed to produce any reliable genetic test for the diagnosis of any psychiatric disorder in an individual person. In addition, phenomena such as population stratification and linkage disequilibrium may confound interpretation of the group-wise statistics of traditional GWAS, whereas these phenomena are easily evaluated in our person-centered approach.
Fourth, our approach is entirely data-driven using machine learning and data mining procedures that are unbiased and unsupervised (i.e., no a priori assumptions are made). Such data-driven methods have been used successfully in many fields of science but not for GWAS prior to our work. In contrast, the a priori assumptions made in the traditional GWAS approach, as used by B-S, have produced only weak associations between the average effects of individual genes and the diagnosis of schizophrenia. In fact, Sullivan and others proposed the formation of the Psychiatric Genetics Consortium (PGC) to address the problem of weak and inconsistent associations. Unfortunately, even large samples still produce only weak associations (4). In addition, even large collections of subjects have encountered what is called the “missing heritability problem” in medical genetics: most of the variability in risk for the schizophrenias has remained unexplained. For example, the resemblance of monozygotic cotwins of people with schizophrenia is much greater than can be explained by the average effects of individual genes, indicating that multiple genes act in concert to influence risk (3). Whereas the heritability of schizophrenia is estimated to be 81% from twin studies, only about 25% of the variability has been explained by traditional GWAS.
In contrast, by applying our novel approach to GWAS we observed that sets of single nucleotide polymorphisms (SNPs) allow identification of individuals at very high risk (70% or more) and replicated the findings consistently in three independent samples. Our results can explain much more about disease risk than traditional group-wise approaches, so we are not surprised that such strong findings may come as a shock to those who have become accustomed to the weak associations identified by traditional GWAS. Again, we expected that this would be very controversial, but we’re optimistic that this fundamentally new approach will open up many new opportunities for people interested in medical genetics. Our results were so unexpected that peer-reviewers demanded replication in independent samples before acceptance. Even we were delighted with the strong replication: 81% of 42 SNP sets associated with 70 to 100% risk of schizophrenia replicated almost exactly across three independent samples, and different SNP sets were associated with distinct phenotypic syndromes in which the gene products suggest possible pathways by which the functions and expression of the genes in the brain may explain the different clinical features of individual patients.
In summary, traditional approaches to GWAS focus on the average effects of individual genes in groups of people, whereas our novel approach focuses on the interactive effects of groups of genes in an individual person. Consequently the differences between these two approaches have profound consequences for the way they view and handle phenomena like linkage disequilibrium, population stratification, and X-linkage. Unfortunately, Breen and colleagues have not adequately appreciated the profound differences between the traditional methods of GWAS with which they are familiar and the novel approach we have developed. As a result, the criticisms they made reflect their concerns about problems that regularly occur when a traditional group-wise approach is implemented, but these concerns may have minimal pertinence to our person-centered approach and findings.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 06, C. Robert Cloninger commented:
None
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 06, C. Robert Cloninger commented:
Reply to Breen et al (their Part 1 of 2)
On behalf of: C. Robert Cloninger, M.D., Ph.D. (Departments of Psychiatry and Genetics, Washington University School of Medicine, St. Louis, MO, USA); Igor Zwir, Ph.D. (Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA; Department of Computer Science and Artificial Intelligence, University of Granada, Spain); Gabriel A. de Erausquin, M.D., Ph.D. (Roskamp Laboratory of Brain Development, Modulation and Repair, Department of Psychiatry and Behavioral Neurosciences, University of South Florida, Tampa, FL, USA); Dragan M. Svrakic, M.D., Ph.D. (Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA); Coral del Val, Ph.D. (Department of Computer Science and Artificial Intelligence, University of Granada, Spain); Javier Arnedo, M.S. (Department of Computer Science and Artificial Intelligence, University of Granada, Spain); Rocío Romero-Zaliz, Ph.D. (Department of Computer Science and Artificial Intelligence, University of Granada, Spain); Helena Hernández-Cuervo, M.D., B.Sc. (Roskamp Laboratory of Brain Development, Modulation and Repair, Department of Psychiatry and Behavioral Neurosciences, University of South Florida, Tampa, FL, USA);
The reply to the comments of Breen et al (their Part 1 of 2) from all of us responsible for analysis in the Arnedo et al article has the following three components which are posted separately as comments as follows: (1) Two Distinct Perspectives and Methodological Approaches to GWAS (2) The Facts about Ancestry and Population Stratification (3) The Facts about Gender and the X-chromosome
The references for all the parts of our reply are included in the final postings about Part 2 of their comments (see posting 8 of our reply).
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Sep 28, Gerome Breen commented:
Part 1 of 2. On behalf of Gerome Breen, PhD (Institute of Psychiatry, King’s College London, London, UK) * Brendan Bulik-Sullivan (Broad Institute, Cambridge, MA, USA) * Mark Daly, PhD (Broad Institute, Cambridge, MA, USA) * Sarah Medland, PhD (QIMR Berghofer, Brisbane, Australia) * Benjamin Neale, PhD (Broad Institute, Cambridge, MA, USA) * Michael O’Donovan, MD PhD (Cardiff University, Cardiff, UK) * Stephan Ripke, PhD (Broad Institute, Cambridge, MA, USA) * Patrick Sullivan, MD (Karolinska Institutet, Stockholm, Sweden) * Peter Visscher, PhD (University of Queensland, Brisbane, Australia) * Naomi Wray, PhD (University of Queensland, Brisbane, Australia) *
- Authors ordered alphabetically, all made equivalent contributions
In this study published on September 15, Arnedo et al. asserted that schizophrenia is a heterogeneous group of disorders underpinned by different genetic networks mapping to differing sets of clinical symptoms. As a result of their analyses, Arnedo et al. have made remarkable and perhaps unprecedented claims regarding their capacity to subtype schizophrenia. This paper has received considerable media attention. One claim features in many media reports, that schizophrenia can be delineated into “8 types”. If these claims are replicable and consistent, then the work reported in this paper would constitute an important advance into our knowledge of the etiology of schizophrenia.
Unfortunately, these extraordinary claims are not justified by the data and analyses presented. Their claims are based upon complex (and we believe flawed) analyses that are said to reveal links between clusters of clinical data points and patterns of data generated by looking at millions of genetic data points. Instead of the complexities favored by Arnedo et al., there are far simpler alternative explanations for the patterns they observed. We believe that the authors have not excluded important alternative explanations – if we are correct, then the major conclusions of this paper are invalidated.
Analyses such as these rely on independence in many ways: among variables used in prediction, absence of artifactual relationships between genotypes and clinical variables, and between the methods of assessing significance and replication. Below we identify five specific areas of concern that are not adequately addressed in the manuscript, each of which calls into question the conclusions of this study.
A. Ancestry/population stratification.
Two of the three samples the authors studied (MGS and CATIE) have substantial proportions of subjects of European and African ancestry. The third sample is from southern Europe. Ancestry is an extremely well known confounder in genetic studies with a great capacity to yield false associations. Correct inference from genomic data in samples like these requires exceptional care. In the analyses they present, there is almost no mention of how this known bias was addressed or evaluation of its impact on their results. In the samples they used, their references to sets of SNPs that track together is essentially the definition of uncorrected population structure/stratification. Indeed, a central component of their statistical methodology – nonnegative matrix factorization – has been previously employed as a method for ancestry inference in the population genetics literature (Englehardt and Stephens, PLoS Genetics, 2010, link below).
We were unsuccessful in attempts to obtain the full list of SNPs that Arnedo et al. analyzed. Instead, we evaluated the SNPs listed in Table S3 (448 SNP entries, 245 unique SNPs as SNPs could be present more than once, and 237 SNPs with valid allele frequencies in HapMap3). We computed the absolute value of the difference in allele frequencies between the CEU (northwest European) and YRI (Yorubas from Nigeria) groups for all HapMap3 SNPs passing basic quality control (688K SNPs genotyped using Affymetrix 6.0 arrays to match the MGS sample). We then contrasted the SNPs used by Arnedo et al. with all other affy6 SNPs. The Table S3 SNPs had markedly larger differences between a European and an African group. The mean for the absolute difference in allele frequency was 0.27 for the Table S3 SNPs used by Arnedo et al. versus 0.19 for all other SNPs. These highly significant differences underscore our concerns about population stratification bias.
B. X chromosome (chrX).
We noted that 15 of 237 of the SNPs in Table S3 were on chrX (again, Table S3 contains a fraction of the SNPs used in the modeling). Inclusion of chrX SNPs will partly reflect the sex of participants. Arnedo et al. say in their supplement that they include sex as a covariate in their regressions, but they do not describe how they account for sex in their matrix factorization. For example, since males have only one copy of chrX, genotypes for males will be either 0 or 1 whereas chrX genotypes for females will be either 0, 1 or 2. This difference will be salient to clustering algorithms such as those employed by the authors, so it seems likely that some component of the clusters of individuals identified by Arnedo et al. simply reflect genotype differences between sexes rather than clinical features of schizophrenia. It is well-known in statistical genetics that the sex chromosomes require special handling, but this issue is not addressed by Arnedo et al.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-
- Feb 2018
-
europepmc.org europepmc.org
-
On 2014 Sep 28, Gerome Breen commented:
Part 1 of 2. On behalf of Gerome Breen, PhD (Institute of Psychiatry, King’s College London, London, UK) * Brendan Bulik-Sullivan (Broad Institute, Cambridge, MA, USA) * Mark Daly, PhD (Broad Institute, Cambridge, MA, USA) * Sarah Medland, PhD (QIMR Berghofer, Brisbane, Australia) * Benjamin Neale, PhD (Broad Institute, Cambridge, MA, USA) * Michael O’Donovan, MD PhD (Cardiff University, Cardiff, UK) * Stephan Ripke, PhD (Broad Institute, Cambridge, MA, USA) * Patrick Sullivan, MD (Karolinska Institutet, Stockholm, Sweden) * Peter Visscher, PhD (University of Queensland, Brisbane, Australia) * Naomi Wray, PhD (University of Queensland, Brisbane, Australia) *
- Authors ordered alphabetically, all made equivalent contributions
In this study published on September 15, Arnedo et al. asserted that schizophrenia is a heterogeneous group of disorders underpinned by different genetic networks mapping to differing sets of clinical symptoms. As a result of their analyses, Arnedo et al. have made remarkable and perhaps unprecedented claims regarding their capacity to subtype schizophrenia. This paper has received considerable media attention. One claim features in many media reports, that schizophrenia can be delineated into “8 types”. If these claims are replicable and consistent, then the work reported in this paper would constitute an important advance into our knowledge of the etiology of schizophrenia.
Unfortunately, these extraordinary claims are not justified by the data and analyses presented. Their claims are based upon complex (and we believe flawed) analyses that are said to reveal links between clusters of clinical data points and patterns of data generated by looking at millions of genetic data points. Instead of the complexities favored by Arnedo et al., there are far simpler alternative explanations for the patterns they observed. We believe that the authors have not excluded important alternative explanations – if we are correct, then the major conclusions of this paper are invalidated.
Analyses such as these rely on independence in many ways: among variables used in prediction, absence of artifactual relationships between genotypes and clinical variables, and between the methods of assessing significance and replication. Below we identify five specific areas of concern that are not adequately addressed in the manuscript, each of which calls into question the conclusions of this study.
A. Ancestry/population stratification.
Two of the three samples the authors studied (MGS and CATIE) have substantial proportions of subjects of European and African ancestry. The third sample is from southern Europe. Ancestry is an extremely well known confounder in genetic studies with a great capacity to yield false associations. Correct inference from genomic data in samples like these requires exceptional care. In the analyses they present, there is almost no mention of how this known bias was addressed or evaluation of its impact on their results. In the samples they used, their references to sets of SNPs that track together is essentially the definition of uncorrected population structure/stratification. Indeed, a central component of their statistical methodology – nonnegative matrix factorization – has been previously employed as a method for ancestry inference in the population genetics literature (Englehardt and Stephens, PLoS Genetics, 2010, link below).
We were unsuccessful in attempts to obtain the full list of SNPs that Arnedo et al. analyzed. Instead, we evaluated the SNPs listed in Table S3 (448 SNP entries, 245 unique SNPs as SNPs could be present more than once, and 237 SNPs with valid allele frequencies in HapMap3). We computed the absolute value of the difference in allele frequencies between the CEU (northwest European) and YRI (Yorubas from Nigeria) groups for all HapMap3 SNPs passing basic quality control (688K SNPs genotyped using Affymetrix 6.0 arrays to match the MGS sample). We then contrasted the SNPs used by Arnedo et al. with all other affy6 SNPs. The Table S3 SNPs had markedly larger differences between a European and an African group. The mean for the absolute difference in allele frequency was 0.27 for the Table S3 SNPs used by Arnedo et al. versus 0.19 for all other SNPs. These highly significant differences underscore our concerns about population stratification bias.
B. X chromosome (chrX).
We noted that 15 of 237 of the SNPs in Table S3 were on chrX (again, Table S3 contains a fraction of the SNPs used in the modeling). Inclusion of chrX SNPs will partly reflect the sex of participants. Arnedo et al. say in their supplement that they include sex as a covariate in their regressions, but they do not describe how they account for sex in their matrix factorization. For example, since males have only one copy of chrX, genotypes for males will be either 0 or 1 whereas chrX genotypes for females will be either 0, 1 or 2. This difference will be salient to clustering algorithms such as those employed by the authors, so it seems likely that some component of the clusters of individuals identified by Arnedo et al. simply reflect genotype differences between sexes rather than clinical features of schizophrenia. It is well-known in statistical genetics that the sex chromosomes require special handling, but this issue is not addressed by Arnedo et al.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Sep 28, Gerome Breen commented:
None
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Sep 28, Gerome Breen commented:
Part 2 of 2 (see below) On behalf of: Gerome Breen, PhD (Institute of Psychiatry, King’s College London, London, UK) * Brendan Bulik-Sullivan (Broad Institute, Cambridge, MA, USA) * Mark Daly, PhD (Broad Institute, Cambridge, MA, USA) * Sarah Medland, PhD (QIMR Berghofer, Brisbane, Australia) * Benjamin Neale, PhD (Broad Institute, Cambridge, MA, USA) * Michael O’Donovan, MD PhD (Cardiff University, Cardiff, UK) * Stephan Ripke, PhD (Broad Institute, Cambridge, MA, USA) * Patrick Sullivan, MD (Karolinska Institutet, Stockholm, Sweden) * Peter Visscher, PhD (University of Queensland, Brisbane, Australia) * Naomi Wray, PhD (University of Queensland, Brisbane, Australia) *
- Authors ordered alphabetically, all made equivalent contributions
C. Linkage disequilibrium (LD).
Pairs of SNPs that are physically close in the genome are often correlated due to LD. Furthermore, in samples containing individuals with different ancestry, SNPs on different chromosomes whose allele frequencies differ between populations will appear to be correlated. These are both well-known phenomenon from population genetics.
The typical size of blocks defined by high LD is on the order of 20,000 bases, but LD is far from uniform across the genome. Using a large European sample genotyped with Affymetrix 6.0 arrays, we had previously computed the locations of particularly large blocks of LD (defined using SNPs with r2 > 0.5). The first step in the statistical methodology described by Arnedo et al. is to identify so-called “SNP sets” – sets of SNPs that travel together – which the authors believe contain some information about clinical subtypes of schizophrenia: “we first identified sets of interacting … SNPs that cluster within subgroups of individuals … regardless of clinical status” (no LD limitations were imposed). Of the 237 SNPs in Table S3 from Arnedo et al., 153 (65%) mapped to exceptionally large LD blocks larger than 100,000 bases (median 275kb, interquartile range 165-653kb, maximum 1.2 mb).
Arnedo et al. claim repeatedly that sets of SNPs that travel together are informative about clinical subtypes of schizophrenia. A more parsimonious interpretation of the SNP clusters identified by Arnedo et al. is that these SNPs represent a combination of (1) SNPs in large LD blocks and (2) SNPs whose allele frequencies differ substantially between European and African sample subsets. Indeed, matrix factorization algorithms similar to the methods employed by Arnedo et al. have been used to identify regions with long-range LD (see Price et al., AJHG, 2008, link below).
D. SNP selection.
Arnedo et al. conducted genetic clustering analyses on 2,891 SNPs selected on the basis of in-sample P-values from analysis of association with case-control status and selected from a total of ~700,000 SNPs. It is therefore expected that linear or non-linear combinations of these SNPs will be associated with case-control status in the same sample (their risk statistic); this is true even if the selected SNPs are not truly associated. A permutation test is used to assess the significance of the observed phenotype/genotype clustering. In this permutation test, subjects are randomly allocated to “SNP sets” but, since the SNPs were selected because they differ in allele frequency between cases and controls, this procedure does not generate a valid null distribution. As a result, the reported P-values are incorrect.
The strategy used by Arnedo etl al. is an example of estimation and selection of effects in a dataset and then testing (or re-estimating) them in the same data, a common pitfall of prediction analyses (Wray et al. 2013, Nature Reviews Genetics, link below). To construct a valid permutation test, the authors should have randomized case-control status in the association analysis step, selected a new set of ~3,000 SNPs and generated a distribution of their coincident test index under a truly null distribution.
E. Replication.
Replication of results is a well-acknowledged strategy for generating confidence in reported findings. Arnedo et al. state that they replicated their findings in two samples but, upon closer examination, it is unclear precisely what replicated, exactly how this was done, and whether the degree of “replication” deviated from that expected by chance. It was also unclear whether the replication control samples were or were not independent from the discovery sample. Such non-independence is another common pitfall in prediction or validation analysis.
Conclusions.
Given the remarkable claims made by Arnedo et al., it is essential that alternative explanations be excluded. Unfortunately, the authors do not provide the necessary evidence. As presented, their methodology is opaque (even to experts), meaning that their results cannot be independently validated. Arnedo et al. do not consider alternative explanations for the phenomena that they observe, such as confounding from ancestry and LD, even though these are well-known issues for the statistical methods that they employ and have been studied extensively in the statistical and population genetics literature. In addition, their multistep analysis approach is subject to multiple issues as noted above.
We believe that it is highly likely that the results of Arnedo et al. are not relevant for schizophrenia. We urge great caution in the interpretation of the results of study.
Links: • Arnedo et al. in PubMed, http://www.ncbi.nlm.nih.gov/pubmed/25219520 • Press release from Washington University (St Louis), https://news.wustl.edu/news/Pages/27358.aspx • Media coverage via Google News, https://news.google.com/news?ncl=dwmwseq1Mco0xVMLXz5qVDa7uJqbM&q=cloninger&lr=English&hl=en&sa=X&ei=hgIjVO-0GoGOyASxtIGAAw&ved=0CD0QqgIwCQ • Nonnegative matrix factorization and ancestry inference: http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1001117 • Pitfalls of predicting complex traits from SNPs: http://www.ncbi.nlm.nih.gov/pubmed/23774735 • Using principal components analysis to identify regions with long-range LD: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2443852
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 07, Igor Zwir commented:
None
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 07, Igor Zwir commented:
None
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 07, Igor Zwir commented:
None
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 07, Igor Zwir commented:
References
- Cordon O, Herrera F, Zwir I (2002): Linguistic modeling by hierarchical systems of linguistic rules. Ieee T Fuzzy Syst. 10:2-20.
- Ruspini EH, Zwir I (2002): Automated generation of qualitative representations of complex objects by hybrid soft-computing methods. In: Pal SK, Pal A, editors. Pattern recognition : from classical to modern approaches. New Jersey.: World Scientific, pp 454-474.
- Cordon O, Herrera F, Zwir I (2003): A hierarchical knowledge-based environment for linguistic modeling: models and iterative methodology. Fuzzy Set Syst. 138:307-341.
- Zwir I, Shin D, Kato A, Nishino K, Latifi T, Solomon F, et al. (2005): Dissecting the PhoP regulatory network of Escherichia coli and Salmonella enterica. Proc Natl Acad Sci U S A. 102:2862-2867.
- Zwir I, Huang H, Groisman EA (2005): Analysis of Differentially-Regulated Genes within a Regulatory Network by GPS Genome Navigation. Bioinformatics. 21:4073-4083.
- Romero-Zaliz R, C. Rubio R, Cordón O, Cobb P, Herrera F, Zwir I (2008): A multi-objective evolutionary conceptual clustering methodology for gene annotation within structural databases: a case of study on the gene ontology database. IEEE Transactions on Evolutionary Computation . 12:6:679-701.
- Romero-Zaliz R, Del Val C, Cobb JP, Zwir I (2008): Onto-CC: a web server for identifying Gene Ontology conceptual clusters. Nucleic acids research. 36:W352-357.
- Harari O, Park SY, Huang H, Groisman EA, Zwir I (2010): Defining the plasticity of transcription factor binding sites by Deconstructing DNA consensus sequences: the PhoP-binding sites among gamma/enterobacteria. PLoS computational biology. 6:e1000862.
- Cheeseman P, Stanford University. (1990): Bayesian learning. [Palo Alto, Calif.]: Board of Trustees of the Leland Standord Junior University.
- Cook DJ, Holder LB, Su S, Maglothin R, Jonyer I (2001): Structural mining of molecular biology data. IEEE Eng Med Biol Mag. 20:67-74.
- Fraley C, Raftery AE (1998): How many clusters? Which clustering method? Answers via model-based cluster analysis. The computer journal. 41:578-588.
- Bezdek JC (1998): Pattern Analysis. In: Pedrycz W, Bonissone PP, Ruspini EH, editors. Handbook of Fuzzy Computation. Bristol: Institute of Physics, pp F6.1.1-F6.6.20.
- Bittner T, Smith B (2003): A theory of granular partitions. Foundations of geographic information science.117-151.
- Fred AL, Jain AK (2005): Combining multiple clusterings using evidence accumulation. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 27:835-850.
- Arnedo J, del Val C, de Erausquin GA, Romero-Zaliz R, Svrakic D, Cloninger CR, et al. (2013): PGMRA : a web server for ( phenotype x genotype ) many-to-many relation analysis in GWAS. Nucleic Acid Research. 75.
- Pascual-Montano A, Carazo JM, Kochi K, Lehmann D, Pascual-Marqui RD (2006): Nonsmooth nonnegative matrix factorization (nsNMF). IEEE transactions on pattern analysis and machine intelligence. 28:403-415.
- Madeira SC, Oliveira AL (2004): Biclustering algorithms for biological data analysis: a survey. IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM. 1:24-45.
- Bezdek JC, Pal SK, IEEE Neural Networks Council (1992): Fuzzy models for pattern recognition : methods that search for structures in data. New York: IEEE Press.
- Latorre Carmona P, Sánchez JS, Fred ALN, SpringerLink (Online service) (2013): Mathematical Methodologies in Pattern Recognition and Machine Learning Contributions from the International Conference on Pattern Recognition Applications and Methods, 2012. Springer Proceedings in Mathematics & Statistics,. New York, NY: Springer New York : Imprint: Springer,, pp VIII, 194 p. 158 illus., 140 illus. in color.
- Kim EY, Kim SY, Ashlock D, Nam D (2009): MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering. BMC bioinformatics. 10:260.
- Saeed F, Salim N, Abdo A (2012): Voting-based consensus clustering for combining multiple clusterings of chemical structures. Journal of cheminformatics. 4:37.
- Deb K (2001): Nonlinear goal programming using multi-objective genetic algorithms. J Oper Res Soc. 52:291-302.
- Deb K (2001): Multi-objective optimization using evolutionary algorithms. 1st ed. Chichester ; New York: John Wiley & Sons.
- Rissanen J (1989): Stochastic complexity in statistical inquiry. Singapore: World Scientific.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 07, Igor Zwir commented:
Rationale of the method applied in Arnedo et. al.
Clustering methods have been applied for long time in pattern recognition problems and in a variety of fields and actually are embedded in most of our devices, from a simple refrigerator to a Boing engine. In biomedical sciences there is a trend of considering that few classes of clustering exist (e.g., hierarchical cluster of K-Means). However, datamining, knowledge discovering, and machine leaning are fields of computer sciences with very rich and different either theoretical or empirical methods. Much of the power of these methods is based on the fusion of them, taking advantages of each one.
The methodology and framework proposed in this Arnedo et. al encodes a generalized factorization method (GFM) that was designed to identify structural patterns or clusters (substructures) that characterize complex objects (1-8) embedded in databases (6, 7, 9, 10). Unfortunately, solving these kinds of complex problems cannot be performed by a single clustering method but utilizing and combining the advantages of many of them, as were implemented in the GFM and summarized below.
First, we utilized the notion of Model-based clusters (5, 11, 12), where the centroid/prototype of a cluster is not a point but a model (e.g., line segment, ellipsoids, (5, 13, 14) or see C-varieties algorithm (12)) -or a family of models (8) - defined as a relation between cohesive subset of points that meet certain constraints (e.g., Relational clustering (12), undirected graph (12)). When the subjacent equations of the models are unknown, they can be estimated from the data by identifying relations between subsets of observations that show similar patterns under a specific subset of relevant descriptive features (attributes) (4, 5, 15). Here we proposed that in a pure data driven analysis these unknown models (grey boxes) can be learned as local relationships between subsets of features and observations, which are termed biclusters. The biclusters are often represented as sub-matrices (4, 5, 15). Other models proposed in this work (black boxes) consist of those that are encoded by a particular method devoted to recognize certain classes of patterns.
Second, to identify a family of models (8) or biclusters we planned the use of the NMF (16) factorization algorithms (tensor analysis), which can uncover cohesive data represented as sub-matrices (factors). When sorted and thresholded, each sub-matrix represents a bicluster. Notably, biclustering produces several local models of the data (17), each one capturing intrinsic relationships between subsets of observations sharing subsets –instead of all- features. Diverse biclusters –instead of uniform data explanations- provide a better understanding of the described object that is diluted when together, all features are used in a single global model of data typically performed by clustering methods (17).
Third, we proposed in this work that biclusters can be fuzzy (overlapping, see Fuzzy clustering (18)), where voxels and/or subjects can belong to more than one bicluster), as well as possibilistic (non-exhaustive partitions, see Possibilistic clustering (12)), where certain voxels and/or subjects may not be assigned to any bicluster, and thus, outliers can be detected.
Fourth, we demonstrated that biclusters could be identified without choosing a priori a particular number of clusters (11, 13, 14, 19). Setting this parameter as a small number may eventually generate few but large data partitions here defined as general biclusters that may underfit the data. In contrast, using a big number of clusters will generate many clusters with partitions of small size (i.e., more granular partitions) here defined as specific biclusters that may overfit the data (1, 5, 11, 13, 14, 19-21). Although there are many different validity indices (12, 18) that suggest the best number of clusters for a given dataset, they often produce contradictory results (11, 13, 14, 19). This problem is exacerbated when Centroid-based (e.g., k-means, NMF) or Distribution-based clustering (e.g., EM) is utilized because two successive numbers of clusters may generate completely different cluster arrangements (see (12, 18) for a review). Instead of fighting over the lost battle of uncovering a single optimal number of clusters, biclusters were calculated from partitions generated by all number of clusters between 2 and √n (12), where n is the number of observations (subjects). We postponed the selection of the best clusters until all partitions were examined to select a set of clusters that together provide an optimal description of the sample. These clusters could be chosen from different partitions that were generated by different number of clusters (see Consensus clustering (1, 5, 11, 13, 14, 19-21)).
Fifth, formulation of the bicluster identification problem from different partitions may produce redundant, excessively specific and/or excessively general biclusters (4, 5, 15, 22, 23). To select optimal biclusters, we proposed a GFM that performs the following Consensus clustering strategy: (i) eliminates all redundant biclusters by comparing the similarity between their subjacent sub-matrices using Hypergeometric statistics (see Methods and Material in Supplement 1), and selecting only one representative bicluster from each set of equivalent sub-matrices. Biclusters harboring <10% of the total number of subjects were not considered to avoid a trend to obtain singleton biclusters. (ii) From the non-redundant biclusters, GMF selects all biclusters that are optimal in terms of specificity, generality and diversity. To do so, it selects all non-dominated biclusters where one bicluster is not worse (i.e. dominated) than another in both objectives: specificity and generality (4, 5, 15, 22, 23). Moreover, to ensure diversity, GMF applies the non-dominance metric only among biclusters within a neighborhood (22, 23) (i.e., biclusters with a relatively high overlapping of subjects and voxels with). This Multiobjective and Multimodal optimization process is analogous to Minimum Description-Length methods (24) and was carried out as described in (1-8). (iii) The remaining optimal biclusters are hierarchically organized by inclusion of their subjects into connected or disjoint hierarchies (i.e., subgraphs or subnetworks).
Sixth, and after the first step of identifying interesting patterns describing objects embedded in the data, we went one step further by also finding characteristic descriptions for each group, where each group represents a concept or class using Conceptual clustering. In this work we incorporated factors such as the generality and simplicity of the derived bicluster descriptions by (1) relational clustering validation against data driven independent descriptions obtained in other domains of knowledge, and (2), incorporate the previous structured knowledge into predictive multiclassifiers mixing machine learning and multiobjetive optimization techniques.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 07, Igor Zwir commented:
Interdisciplinary work not so fast … Geneticist with a statistic background vs. engineers/computer scientists with a genetic background
Unfortunately, interdisciplinary works have been concurrently stimulated and frozen at the same time. Much of this contradiction is due to the lack of universal reviewers that know about everything. Science is specialized, in terms of education and consequently in terms of results. For example, most studies of complex disorders usually focus on single sources of data (genetics, neuroimages), eliminating the possibility of having complementary perspectives of the patients. In addition, there is a sort of disrupted communication between the biomedical researchers and math or computer science investigators. Biologist and physicians use to search for articles in pubmed (http://www.ncbi.nlm.nih.gov/pubmed), however, most of the engineers look for the Institute of Electrical and Electronics Engineers (IEEE) communications (https://www.ieee.org) and many of them ignore Pubmed. In contrast, many molecular biologists ignore what IEEE exists and what it represents. There are few IEEE publications that are available in pubmed and most of them are old. However, many of those that are not in Pubmed are much more rigorous than any method proposed in the best biomedical journals. This comment encourages both Pubmed and IEEE sites to solve their differences, which in turn, will help to have more and more informed reviewers for novel techniques in the post genomic era.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2014 Oct 11, Igor Zwir commented:
Pre-selected SNPs The list of 2,891 pre-selected SNPs from the MGS study (1) utilized in (2) (see supplemental material) is available at http://m4m.ugr.es/resources/2891-SNP-PreSelection.txt. This list contains 99% of the SNPs reported in the supplemental material of (1).
(1) Shi J, Levinson DF, Duan J, Sanders AR, Zheng Y, Pe'er I, Dudbridge F, Holmans PA, Whittemore AS, Mowry BJ, Olincy A, Amin F, Cloninger CR, Silverman JM, Buccola NG, Byerley WF, Black DW, Crowe RR, Oksenberg JR, Mirel DB, Kendler KS, Freedman R, Gejman PV. Common variants on chromosome 6p22.1 are associated with schizophrenia. Nature. 2009;460I:753-7.
(2) Arnedo J, Svrakic DM, del Val C, Romero‑Zaliz R, Hernández-Cuervo H, Molecular Genetics of Schizophrenia Consortium, et al. Uncovering the Hidden Risk Architecture of the Schizophrenias: Confirmation in Three Independent Genome-Wide Association Studies. American J of Psychiatry 2014 (in press);172:1–15. Epub 9/15/2014.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-