- Mar 2025
-
www.biorxiv.org www.biorxiv.org
-
Monomers generated through structural diffusion appear to only occupy a small region in comparison to both the UniRef50 and PISCES sequences, whereas generative models of sequences appear to more evenly populate the space of similar length natural proteins.
Taxonomic biases of the training data likely also play an important role here. The data sources aren't equal in how they've sampled the protein universe. This is especially apparent when comparing the structure and sequence databases. For example, certain taxa (e.g., humans) are overrepresented in the PDB, while others dominate UniRef.
It's not hard to imagine how the distribution differences in the t-SNE might reflect this, especially given the strong overlap of sequence-based methods with the UniRef samples. Do you know if the same is true for the structure-based methods? If you visualized where, say, PDB proteins are, would there be strong overlap?
Any ideas on how to disentangle approach from the taxonomic makeup of training data?
-
- Feb 2025
-
www.biorxiv.org www.biorxiv.org
-
Passeriformes (perching birds) and non-passeriforms show distinct relative brain size237dependent diversification patterns when fitting BiQuaSSE models, which allow both groups to have238different speciation and extinction rates (Fig. 1D-G).
Did you perform any comparisons other than Passeriformes vs. not?
An examination of the 3 parameters (speciation, extinction, diversification) as a function of all taxonomic comparisons seems like a useful and potentially more agnostic analysis. More generally, I wonder about the extent to which taxonomic coarseness might influence sensitivity to detecting "cognitive buffer" over "behavioral drive." Might there be smaller clades of large-brained species that defy the macro-level extinction trends?
-
-
-
The area under the receiver operator curve (AUC-ROC) showed that protein compartments could be predicted with remarkable accuracy (0.83-0.95) across the 12 different compartments (Fig. 1D).
ESM2 performance can be sensitive to the makeup of training data used (e.g. https://www.biorxiv.org/content/10.1101/2024.03.07.584001v1.abstract). Specifically, class biases in training data can be recapitulated in generated sequences.
Given that AUC-ROC varies as a function of compartment type (Fig 1D) and the compartments themselves are associated with diverse input sequence numbers (Fig 1B), I wonder if you examined possible biases in ProtGPS's behavior? Does ProtGPS more readily generate sequences that are suited for certain compartments than others? Is this explainable by the statistical distribution of the training data?
-
- Dec 2024
-
www.biorxiv.org www.biorxiv.org
-
Nine age-matched adult females and adult males each were chosen from each of the four taxa, 72 individuals are included in total in the overall analysis. As somatic organs we included brain (whole brain), heart, liver (left medial lobe), kidney (right) and mammary gland (fourth, right). Note that the mammary glands in mice have similar sizes in both sexes before lactation and are therefore directly comparable.
There's some evidence of sex-specific cell type heterogeneity in organs (e.g. https://pmc.ncbi.nlm.nih.gov/articles/PMC10210449/; https://pmc.ncbi.nlm.nih.gov/articles/PMC7615307/#S1). It seems possible that consistent sex-specific organ heterogeneity might be another explanation for the patterns you see and, if present, could change interpretations/conclusions. E.g., sex-biased differences could arise from cell number variation rather than intrinsic transcriptional differences. How much of a concern is that here?
-
-
www.biorxiv.org www.biorxiv.org
-
46.4% - 86.2% of sequencing reads (mean 67.5%) mapped confidently to the reference genome, and 32.5% - 75.6% (mean 52.8%) mapped confidently to the reference transcriptome (Table S2). We obtained a cell / gene count matrix for each sample, which consisted of 1,133-8,226 cells (mean 4,498 cells), with a means of 33,058 reads, 2,255 UMI counts, and 713 detected genes (Figure S1A-C). In total, we detected in each sample between 20,877 and 26,535 genes (mean 24,195 genes).
Does mapping percentage or gene count vary with phylogeny and/or ecology? Put differently, is there any reason to worry that technical variation here might influence your sensitivity for detecting cell type abundance, especially given the low number of replicates per species?
-
- Oct 2024
-
www.biorxiv.org www.biorxiv.org
-
The phylostratigraphy map of M. musculus and D.rerio was constructed by comparing 22,769 M. musculus and 25,787 D.rerio protein sequences with the protein sequence database by blastp algorithm V2.9.0 with a 10-3 e-value threshold[101].
Can you expand on why you chose blastp? There are a number of other (likely more sensitive) alignment methods. Given that many of the analyses in this manuscript rely on specific assumptions with respect to evolutionary age, it seems that identifying the most accurate approach possible would be useful.
Also, why use this specific e-value threshold for all proteins? Proteins often vary in e-value distributions due to differences in sequence length/composition, evolutionary history, etc. Methods that account for this (e.g. OrthoFinder) might be worth exploring.
-
- Sep 2024
-
www.biorxiv.org www.biorxiv.org
-
The ability to move strategically allows these algae to seek desirable niches for growth and survival, especially in extreme habitats where resources are scarce or conditions are rapidly changing.
Is C. pacifica's capacity to live in higher salinity environments accompanied by variation in their motility patterns with respect to non-extremophiles? Given that the Reynold's number varies with salinity, it might be enlightening to measure C. pacifica's speed distribution at different salinity concentrations. I wonder if these experiments might uncover more interesting axes of diversity within C. pacifica that differentiate them from species like C. reinhardtii.
-
- Aug 2024
-
www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov
-
A schematic showing the evolution of higher activity variants with EVOLVEpro. The mutagenesis landscape of proteins is often conceptualized as a complex terrain with numerous potential paths. Shown here is a gray road that conceptualizes the protein mutagenesis landscape where traversing upwards results in higher protein activity and traversing downwards reduces protein fitness. Traditional frameworks of evolutionary plausibility attempt to navigate this terrain based on natural selection, which is constrained by historical and environmental factors.
In the manuscript, "fitness" generally refers to landscapes learned by pLMs. However, at other times, it is used to describe the actual landscapes traversed by evolution (via processes like natural selection). Given the limitations of pLMs - including those you cover in the introduction - it feels dangerous to conflate these two. It is far from established that language models are able to infer the true structure of evolutionary processes, much less model the complex activities of natural selection.
This feels important to note since the discontinuity between fitness and trait distributions has been recognized for a long time (e.g. Fisher 1930). Many factors contribute to this relationship, both at the individual gene/protein level and at the level of genetic interactions. It is likely that variation in relationships between pLM fitness/activity will also be affected by multiple such factors (as evidenced by the differences observed even here across the 5 proteins of focus). It is also likely that these will at least somewhat differ from the factors influencing empirical fitness landscapes. Delineating these differences clearly seems to be useful for future model development/refinement.
-
- Jul 2024
-
www.biorxiv.org www.biorxiv.org
-
Strikingly, the resulting landscape was dominated by generated proteins, which comprised 94.1% of the total phylogenetic diversity (as measured by cumulative branch length) and resulted in a 10.3-fold increase in diversity relative to the entire CRISPR-Cas Atlas (Fig. 2b). Novel phylogenetic groups were distributed across the tree, suggesting that the model has captured the full diversity of Cas9 and is not overfitting to any particular lineage.
I find it hard to interpret the importance of these results without more context.
For example, how surprising is it to see this enrichment given the initial n of natural and generated proteins?
How might decisions with respect to tree construction effect the branch length distribution? It seems possible that you would get different a different outcome if you varied the mmseqs parameters or implemented different criteria for choosing representative proteins.
Furthermore - though novel phylogenetic groups are distributed throughout the tree - it would be interesting to know if the overall distribution across clades is predicted by the abundance of natural proteins across the tree. I.e. do clades with more natural proteins in the training data tend to produce more generated proteins?
-
-
www.biorxiv.org www.biorxiv.org
-
We ran the analysis using a rooted time-calibrated species tree obtained from timetree.org 33.
What was the rationale for using a tree from timetree.org as opposed inferring one from the gene families?
I imagine that a comparing the effects of using timetree vs. an inferred tree on CSUBST outputs would be enlightening. Such a comparison could be an empirical way to assess the effects of topological error in this data set (and would be a nice complement to some of the analyses in Fukushima and Pollock).
-
- May 2024
-
www.biorxiv.org www.biorxiv.org
-
where nj is the raw sequence count for species j, d(i, j) is the time to last common ancestor between species i and j collected from the TimeTree of Life resource (Kumar et al., 2022), and α ∈ R≥0 is a hyperparameter used to scale d appropriately. Under the assumption that mutations occur at a fixed rate, <img class="highwire-embed" alt="Embedded Image" src="https://www.biorxiv.org/sites/default/files/highwire/biorxiv/early/2024/03/12/2024.03.07.584001/embed/inline-graphic-3.gif"/> gives the expected overlap in sequence between two species’ orthologs, to approximate the effective sequence counts they contribute to each other4.
It's great that even with the use of fixed rates you see a substantial increase in fraction of bias explained. Since mutation rates obviously do vary, I wonder just how much better you might do using a model that doesn't explicitly fix them...
-
Under the assumption that mutations occur at a fixed rate, <img class="highwire-embed" alt="Embedded Image" src="https://www.biorxiv.org/sites/default/files/highwire/biorxiv/early/2024/03/12/2024.03.07.584001/embed/inline-graphic-3.gif"/> gives the expected overlap in sequence between two species’ orthologs, to approximate the effective sequence counts they contribute to each other4.
What does it look like if you just use the branch lengths from the phylogeny to do this weighting? I would guess you get at least some increase in the Spearman correlations and it's a straightforward approach.
-
- Mar 2024
-
www.biorxiv.org www.biorxiv.org
-
or each behavior, all individuals seem to exhibit very similar bout duration distributions.
It is hard not to notice that the distributions for certain states (e.g. Meerkat vigilant/resting state) are noisier than others. It would be interesting to see a comparison of the variance of these distributions as a function of species and/or state to see if the claim in this sentence is statistically supported.
-
- Feb 2024
-
www.biorxiv.org www.biorxiv.org
-
We hypothesize that sea robins initially developed fin ray-like legs for locomotion. Ancestral organs then evolved limited sensory capability to facilitate manipulation of the visible substrate in search of food. Finally, evolution of sensory papillae further specialized legs to localize and uncover buried prey.
How much history/ecological data are there available for these species? It could be interesting to pair the phylogenetic patterns with other trait data to explicitly test different evolutionary hypotheses. e.g. is there a relationship with prey type? substrate? depth? biotic diversity?
-
To test this ability, we developed a simple behavioral assay in which sea robins (Prionotus carolinus) were housed in a controlled tank with either mussels or capsules containing crude or filtered mussel extract buried in sand without visual cues (Fig. 1a, b, Supplementary movie 1). Sea robins alternated between short bouts of swimming and walking (Fig. 1b) and appeared to “scratch” at the sand surface with their legs while walking, which we hypothesized represented sensory behavior.
Do these behaviors vary at all as a function of what prey are used? I'm guessing you tested squid and crabs with P. carolinus as you did with P. evolans?
Presumably motile (squid/crabs) prey would give off a different set of cues that less/non-motile prey (mussels)? Specifically, I wonder if there is a tradeoff between chemo- and mechanosensation that is dependent on the amount of movement? Examining this relationship could be a potential route into the neural computations underlying digging behavior...
-
- Jan 2024
-
www.biorxiv.org www.biorxiv.org
-
We classified individual haploid yeast cells into five different cell cycle stages (M/G1, G1, G1/S, S, G2/M) via unsupervised clustering of the expression of 787 cell-cycle-regulated genes30 in combination with 22 cell-cycle-informative marker genes (Figures 1B, S2 and S3)
How sparse is this matrix? Given an average of ~1,500 UMIs and ~800 cell-cycle genes, I'm assuming the distribution of expression for the cell-cycle genes is quite distributed/uneven across the cells?
If very uneven, I wonder if some of the cell cycle designations might be driven by sparsity as opposed to canonical expression signatures associated with each stage? One way to parse this out might be to look at the PC loadings using as input to clustering/UMAP/etc. Do any show signatures of extreme sparsity (e.g. binary expression only one or several genes)?
More broadly, it might be helpful to report the average # of cell-cycle genes detected in each cell.
-
- Dec 2023
-
www.biorxiv.org www.biorxiv.org
-
Second, we evaluated whether sequences of codas reflect longer-term trends. To do so, we collected coda triples of the same discrete coda type, and measured the correlation between tempo drift across adjacent pairs. We found a significant positive correlation, compared to a null hypothesis that drift between adjacent pairs is uncorrelated (test: Spearman’s rank-order correlation (two-sided), r(2586) = 0.57, p = 2e−220, 95% CI= [0.54, 0.60], n = 2588). Thus, rubato is distributed across sequences of multiple codas.Finally, we evaluated whether rubato is perceived and controlled by measuring whales’ ability to match their interlocutors’ coda durations when chorusing. We measured the average absolute difference in duration between (1) pairs of overlapping codas from different whales, and (2) pairs of non-overlapping codas of the same discrete coda type. Durations are significantly more closely matched for overlapping codas (0.099s on average) than would be expected under a null hypothesis that chorusing whales match only discrete coda type (which would give a drift of 0.129s on average) (test: permutation test (one-sided), p = 0.0001, n = 908; see Supplementary Section 6).
I wonder if calculating the autocorrelation of coda durations might be a nice complementary measure here. Autocorrelation could give you a sense of the time scale over which the rubatos decay and, seemingly, might also provide a sense for the timescale of longer-term trends.
Similarly, I wonder if cross-correlation might be useful for comparing the information quantity shared with interlocutors? The correlation value would be interesting, in addition to any patterns of temporal lag between codas. It might be a comprehensive metric for comparing the similarities of codas over time (as opposed to just looking at overlapping codas).
-
- Nov 2023
-
www.biorxiv.org www.biorxiv.org
-
cellPLATO performs UMAP on morphological/motility parameters then uses HDBSCAN cluster analysis to define behavioural clusters
It is hard to tell from the text if HDBSCAN is run on the behavioral parameters or on the UMAP output. If the latter, then I would take extreme caution in thinking about the generalizability of the method given the numerous issues with clustering on nonlinear manifolds. Either way, it would also be helpful to report more information on what the specific morphological/motility parameters are and any normalizations/manipulations that were done on them prior to UMAP and clustering.
Also, any justification for choosing of UMAP and HDBSCAN would be useful.
-
UMAPs 1, 2 and 3
This might be a slightly confusing way to refer to UMAP dimensions (is it accepted that a UMAP dimension = a single 'UMAP'?)
-
We first investigated two fundamental measurements of cell migration and morphology, namely cell speed and cell area. When comparing conditions, the median migration speed of NK cells on VCAM-1 was 3.48 μm/min and 2.54 μm/min on ICAM-1 (Fig. 2A). The effect size distribution for VCAM-1 was greater, demonstrating statistical significance (p <0.00001) (41), and its distribution did not overlap with the control condition (ICAM-1). NK cells migrating on VCAM-1 also had smaller median cell area (114 μm2) compared with ICAM-1 (175 μm2) (Fig. 2B), with nonoverlapping effect size distribution (p < 0.00001).
Does donor identify have any effect here? Do the donors differ at all in their speed/area distributions and effect sizes? This would be useful to know here and for many other analyses presented in the manuscript. More broadly, it is a little hard to assess the generalizability of the behavioral results presented here (including the cellPLATO analyses) without knowing more about the influence of experimental variables like this.
-
- Oct 2023
-
www.biorxiv.org www.biorxiv.org
-
We next adapted an experimental paradigm used to study prey capture in zebrafish for these other species (Mearns et al., 2020). Individual larvae were placed in chambers with prey items (either artemia or paramecia).
These species are ecologically diverse (e.g. benthic vs. riverine) and likely possess corresponding sensory differences. Given this, it seems possible that their prey capture behaviors may vary as a function of sensory environment. For example, benthic species may display different repertoires in dark conditions.
Have you tested the effect of varying the sensory environment on prey capture behaviors? Is there intra-specific variation? Are species-specific behaviors invariant? Whatever the outcome, these experiments would help refine the picture of how these behaviors evolved and could lead to more specific sensorineural hypotheses.
-
- Sep 2023
-
www.biorxiv.org www.biorxiv.org
-
Finally, 14 convergent amino acid substitutions with high confidence among known echolocating mammalian lineages were obtained (Table S3), and these sites were found to be effective in differentiating echolocating and nonecholocating mammals (Fig. 1A; Fig. S2).
I wonder if it might be worth including a brief comment on the identity of these genes and/or their potential relationships with echolocation? Do they seem to be sensible functional hits? Seeing as the echolocation score appears to work quite well it would be interesting to known a bit more about any molecular context for these predictive loci.
-
-
www.biorxiv.org www.biorxiv.org
-
phonotypes
phenotypes
-
(1) at least two orthology prediction algorithms agree the human and worm genes are orthologs; (2) the WormBase (version WS270) (Harris et al., 2020) gene description includes either ‘neuro’ or ‘musc’ (this captures variants of neuronal, neural, muscle, muscular etc.);
I'm wondering about how varying these criteria would effect the number of/which genes were detected.
For criteria 1, what was the rationale for choosing agreement between >2 algorithms? From Fig 1C, it's hard to tell if there is a relationship between %homology and #of agreeing algorithms. What benefit do you get from using this cutoff? What are the tradeoffs? It might be helpful to include a figure similar to 1C, but including the full set of genes before filtering and to walkthrough the outcomes of different cutoffs.
Similarly, for criteria 2, what type of/how many hits do you get if you don't select for 'neuro' or 'musc'? Is there any chance that, though you are using a behavioral readout, genes not annotated 'neuro'/'musc' might still contribute to a behavioral phenotype (e.g. via pleiotropy/epistasis)? Would be useful to include a statement of your thinking on this!
-
- Jun 2023
-
www.biorxiv.org www.biorxiv.org
-
Our study is unique in that instead of using gene expression values directly, we use principal components calculated from gene expression values as our phylogenetic characters. In addition, we remove later principal components that may represent highly heterogeneous cell-specific signal.
Seems like it would be worth including a direct comparison of Brownian motion to other evolutionary models. The computational overhead shouldn't be very high and, if the comparison supports the use of Brownian motion, it could be a more compelling argument than this.
-
This dataset was chosen for the uniformity of sampling, consistency of lab and sequencing protocols, the high quality of its cell type annotations, and the abundance of genomic resources available for the five model species. UMI counts were downloaded as CSV files from the NCBI GEO database (GSE146188). A file containing meta-data, including cluster assignment and cell type labels, was obtained from the Broad Institute Single Cell Portal
I wonder about the effect of scRNA-seq methodology on downstream results here. How do droplet-based approaches (like that used for van Zyl et al.) compare to others (e.g. Smart-seq2) when generating cell type trees? There can substantial differences in the # of genes detected by these methods, with droplet-based approaches often generating datasets with less genes. Does this affect the estimation of rank and/or the outputs of the PCA you use for evolutionary modeling? It seems like this would be an important issue to solve since droplet-based methods are essentially downsampling informative data in a nonrandom way that may bias evolutionary inference.
TLDR: are cell tree topologies consistent independent of sequencing methodologies?
-
-
www.biorxiv.org www.biorxiv.org
-
we designed and constructed a low-cost parallel imaging platform capable of measuring C. elegans growth for 60 individual animals simultaneously over the course of their ≈ 70 hour development at a temporal resolution of 0.001 Hz, resulting in a time series of ≈ 200 observations per animal. In addition to length and area measured automat-ically, egg hatching, and first egg-laying by mature adults are manually recorded.
Is there a reason for the coarse sampling at 0.001 Hz? Mechanical constraints of the XY plotting robot? Data size constraints? Obviously faster sampling would open up locomotion/behavior as a read out of other possibly interesting, orthogonal phenotypes (with their own developmental modes). Given the video data you are already collecting, it seems like if faster sampling is possible this would be a relatively straightforward - and informative - set of phenotypes to add in?
-
- May 2023
-
www.biorxiv.org www.biorxiv.org
-
we designed and constructed a low-cost parallel imaging platform capable of measuring C. elegans growth for 60 individual animals simultaneously over the course of their ≈ 70 hour development at a temporal resolution of 0.001 Hz, resulting in a time series of ≈ 200 observations per animal. In addition to length and area measured automat-ically, egg hatching, and first egg-laying by mature adults are manually recorded.
Is there a reason for the coarse sampling at 0.001 Hz? Mechanical constraints of the XY plotting robot? Data size constraints? Obviously faster sampling would open up locomotion/behavior as a read out of other possibly interesting, orthogonal phenotypes (with their own developmental modes). Given the video data you are already collecting, it seems like if faster sampling is possible this would be a relatively straightforward - and informative - set of phenotypes to add in?
-
- Mar 2023
-
www.biorxiv.org www.biorxiv.org
-
This inter-annotator variability can be associated with (a) subjective differences of behavior definition among human labelers (b) varying level of annotator’s expertise, and (c) training with-in and across labs.
What about intra-annotator variability? Seemingly this could also be an important contributor to inter-annotator variation. Might it make sense to compare multiple annotations from a single annotator and use the average as the basis for the ethograph generation?
-
In order to test inter-annotator variability, we use generated a set of single mouse behavior classifiers for two simple behaviors, left and right turn. We inferred behavior from all four classifies on a large set of videos and compared the two pairs of classifiers from each annotator
How do these comparisons look for other (potentialyl more 'complex') behaviors? Presumably, turning should be among the more straightforward behaviors for a human to recognize. Do the patterns of inter-annotator agreement change with other behaviors (e.g. grooming) and, if so, would accounting for this increase/descrease performance of the neural network? This is a general risk when using human-based annotations for behavioral classification and it seems to me not easily solved by focussing on a single behavior.
-
- Feb 2023
-
www.biorxiv.org www.biorxiv.org
-
To assess nuclear density, we measured the average distance from each nucleus to its nearest neighbour
I wonder if it might be useful to do some analyses of the specific spatial orientation of nuclei across the centrifugation experiments. While it is sensible that density may be the primary signal driving cellularization, it is also interesting to consider that there may be higher order relationships between cell distribution or spatial organization that are predictive of the different outcomes (i.e Flip, lysis, irregular invaginations) since centrifugation is a relatively forceful and disruptive approach. Spatial relationships of nuclei could theoretically be extracted by segmentation/registration and performing some basic statistical comparisons to uncover the relationship between the images (e.g. PCA).
-