4 Matching Annotations
  1. Last 7 days
    1. DNA is sequenced to depths targeted to maximize diversity capture using a combination of Oxford Nanopore and Illumina for long and short reads, respectively, allowing for the generation of high quality and high contiguity genomic assemblies.

      The combination of ONT and Illumina is great - I wondered if you have found a tradeoff of trying to maximise finding diversity, i.e., reads that have differences, but also minimize retaining reads with sequencing errors that look artificially dissimilar. Presumably, walking the line between the two is critical to not over-inflating diversity estimates and retaining only confident 'true' standing diversity - I would love to know more about how you navigate this!

    2. the Basecamp Research supply chain allows royalty disbursements to be triggered at the point of data use and not only at the point of final product commercialisation

      I believe that a profit-sharing model for the country of origin of biodiversity has to be central to the commodification of biological diversity. I am curious about a couple of practical aspects of your implementation of this. Firstly, how do you determine the 'value' and therefore the royalties associated with the point of use of data prior to commercialization (are there some minimum royalties that are immediately owed to the country of origin at the point of use?), and subsequently I couldn't find a description in the manuscript of what constitutes a royalty vs. profit from the use of a sequence. When you say that 100% royalties will go to the data source A when a natural sequence is used, how does this compare with the profit gleaned from products developed from that sequence? Without this clarity, it feels rather obtuse as to how much countries are truly being compensated (my impression is that 'royalties' models of compensation have rightly been long criticized in other sectors due to their opacity and underweighting of small to mid-size contributors).

    3. Each sequence within BaseData is also embedded within a deep metadata layer capturing environmental, chemical, and physical parameters, as well as genomic and metagenomic context.

      Given the strength of biological foundation models will lie in their breadth of understanding, how do you balance sampling previously sparsely/unsampled environments (which presumably contribute substantially to new taxa/sequences) with less unique environments that exhibit more homogenous taxonomic diversity to get an idea of standing patterns of biological variation? I would imagine that capturing that standing variation is also an important component of understanding biology as a whole. Presumably, models will fail to generalize patterns and will overweight the prevalence of novelty in novel environments when they are more selectively sampled than other environments?

    4. This novelty extends beyond sequence space into taxonomic space: BaseData includes over 1 million new species, as defined by unique Operational Taxonomic Units, not found in GTDB or OMG, highlighting its unprecedented contribution to species-level discovery

      Increasing the breadth of sampling to this extent is fantastic. I was wondering whether you have an estimate of the increase in phylogenetic branch length across the data resulting from the addition of these additional taxa. I'm also curious as to whether these 'species' are all microbes or whether you also pick up DNA from macro-organisms, and if so, what the increase in 'traditionally' described species looks like compared to when you use OTUs?