- Jul 2018
-
europepmc.org europepmc.org
-
On 2013 Oct 23, Stephen Turner commented:
This paper warrants a closer look for both the strategy and implementation for pulling out microbial next-generation sequencing (NGS) reads from a highly contaminated host background. IMSA (integrated metagenomic sequence analysis) is a computational pipeline that does this and is flexible enough to allow the user to select and update which databases they're using and the stringency for removing host sequence. It also has some decent post-processing and output functionalities.
The algorithm has a series of steps to remove host sequences, each more computationally intensive than the previous step (e.g. Bowtie... BLAT... BLAST). After that, it BLASTs everything against NCBI/nt. It scores reads in a simple but intuitive manner (a read that maps perfectly and uniquely to a sequence in the reference database gets a score of 1; a read that maps perfectly to two conserved regions scores 0.5; a read that maps to three scores 0.333; etc.). It then outputs a list of taxonomic IDs and annotated FASTQ files of filtered reads aligning to those IDs that can then be used downstream (assembly, etc.).
They ran this pipeline on a combined set of viral reads from two different human papillomaviruses (HPVs) in two different cell lines and were able to distinguish the two strains and pull out reads from those strains at the expected proportions. Interestingly, they filter against both the genome and the transcriptome. They found that when they filtered against RefSeq RNAs alone, their read coverage for certain regions in HPV dropped to zero. This is because RefSeq still contains annotation errors, where some genes annotated as human actually contain HPV sequence.
In addition to outputting a breakdown of what's in the sample and an annotated FASTA file of sequences that aligned to taxa, the pipeline also has tools to output data in a format for phylogenetic tree analysis with Treeview, Cluster, etc. With respect to performance, they claim 50bp single-end reads can be processed at 4.5 hours per million reads per node used.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-
- Feb 2018
-
europepmc.org europepmc.org
-
On 2013 Oct 23, Stephen Turner commented:
This paper warrants a closer look for both the strategy and implementation for pulling out microbial next-generation sequencing (NGS) reads from a highly contaminated host background. IMSA (integrated metagenomic sequence analysis) is a computational pipeline that does this and is flexible enough to allow the user to select and update which databases they're using and the stringency for removing host sequence. It also has some decent post-processing and output functionalities.
The algorithm has a series of steps to remove host sequences, each more computationally intensive than the previous step (e.g. Bowtie... BLAT... BLAST). After that, it BLASTs everything against NCBI/nt. It scores reads in a simple but intuitive manner (a read that maps perfectly and uniquely to a sequence in the reference database gets a score of 1; a read that maps perfectly to two conserved regions scores 0.5; a read that maps to three scores 0.333; etc.). It then outputs a list of taxonomic IDs and annotated FASTQ files of filtered reads aligning to those IDs that can then be used downstream (assembly, etc.).
They ran this pipeline on a combined set of viral reads from two different human papillomaviruses (HPVs) in two different cell lines and were able to distinguish the two strains and pull out reads from those strains at the expected proportions. Interestingly, they filter against both the genome and the transcriptome. They found that when they filtered against RefSeq RNAs alone, their read coverage for certain regions in HPV dropped to zero. This is because RefSeq still contains annotation errors, where some genes annotated as human actually contain HPV sequence.
In addition to outputting a breakdown of what's in the sample and an annotated FASTA file of sequences that aligned to taxa, the pipeline also has tools to output data in a format for phylogenetic tree analysis with Treeview, Cluster, etc. With respect to performance, they claim 50bp single-end reads can be processed at 4.5 hours per million reads per node used.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-