- Jul 2018
-
europepmc.org europepmc.org
-
On 2013 Oct 25, Stephen Turner commented:
This paper examines a wide range of all the available algorithms and software for sequence classification as applied to metagenomic data (i.e. given a sequence, determine its origin). They comprehensively evaluated the performance of over 25 programs that fall into three categories (alignment-based, composition-based, and phylogeny-based) on several different datasets where the composition was known, using a similar set of evaluation criteria (sensitivity = number of correct assignments/number of sequences in the data; precision = number of correct assignments/number of assignments made). They concluded that the performance of particular programs varied widely between datasets due to reasons like highly variable taxonomic composition and diversity, level of sequence representation in underlying databases, read lengths, and read quality. The authors specifically point out that, even though some methods lack sensitivity (as they've defined it) they are still useful, because they have high precision. For example, marker-based approaches (like Metaphyler) might only classify a small number of reads, but they're highly precise, and may still be enough to accurately recapitulate organismal distribution and abundance. Further, the authors note that you can't ignore computational requirements, which varied by orders of magnitude between programs. Selection of the right method depends on the goals (is sensitivity or precision more important?) and the available resources (time and compute power are never infinite -- these are real limitations that are imposed in the real world). Overall, this paper was a great demonstration of how one might attempt to evaluate many different tools ostensibly aimed at solving the same problem but functioning in completely different ways.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-
- Feb 2018
-
europepmc.org europepmc.org
-
On 2013 Oct 25, Stephen Turner commented:
This paper examines a wide range of all the available algorithms and software for sequence classification as applied to metagenomic data (i.e. given a sequence, determine its origin). They comprehensively evaluated the performance of over 25 programs that fall into three categories (alignment-based, composition-based, and phylogeny-based) on several different datasets where the composition was known, using a similar set of evaluation criteria (sensitivity = number of correct assignments/number of sequences in the data; precision = number of correct assignments/number of assignments made). They concluded that the performance of particular programs varied widely between datasets due to reasons like highly variable taxonomic composition and diversity, level of sequence representation in underlying databases, read lengths, and read quality. The authors specifically point out that, even though some methods lack sensitivity (as they've defined it) they are still useful, because they have high precision. For example, marker-based approaches (like Metaphyler) might only classify a small number of reads, but they're highly precise, and may still be enough to accurately recapitulate organismal distribution and abundance. Further, the authors note that you can't ignore computational requirements, which varied by orders of magnitude between programs. Selection of the right method depends on the goals (is sensitivity or precision more important?) and the available resources (time and compute power are never infinite -- these are real limitations that are imposed in the real world). Overall, this paper was a great demonstration of how one might attempt to evaluate many different tools ostensibly aimed at solving the same problem but functioning in completely different ways.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-