Hypothesis

67 Matching Annotations

Jan 2026
www.biorxiv.org www.biorxiv.org

The use of cross-validation has overestimated the value of genomic selection in plant breeding

5
1. george_sandler 13 Jan 2026
  
  in Arcadia Science
  
  Instead, I show that analytical expressions and computational simulations are more informative about the likelihood of success of GS than cross-validation, and can be more effectively employed to evaluate GS program design.
  
  I really enjoyed reading this preprint, I think similar lessons on what exactly CV evaluates depending on your strategy have been (re)learned several times across deep learning biology.
2. george_sandler 13 Jan 2026
  
  in Arcadia Science
  
  predictand
  
  typo
3. george_sandler 13 Jan 2026
  
  in Arcadia Science
  
  foundations plant
  
  typo
4. george_sandler 13 Jan 2026
  
  in Arcadia Science
  
  Thus the gains in Figure 1 (particularly B and C) are likely optimistic. The low predicted gains from Genomic Prediction in small, diverse populations (second panel of Figure 1D) are particularly concerning, as this indicates that such Genomic Prediction models are likely to have very low accuracy.
  
  There's also other secondary costs to aggressive recurrent GS such as drift in traits that aren't amenable/feasible to build GP models for.
5. george_sandler 13 Jan 2026
  
  in Arcadia Science
  
  Genomic Prediction, except under limited narrow contexts, perhaps including CV2 and CV0 breeding scenarios where the goal is closer to phenotype prediction than breeding value prediction.
  
  I understand the rational for why the current study is setup to ignore CV0/CV2 scenarios, however I would argue that CV2 is likely one of the more promising study designs for genomic prediction at least for commercial breeding programs. By leveraging sparse testing + GP, one can test material across a broader environmental footprint for a fixed budget. The GP based breeding values from such a study design as particularly valuable as they can allow more relevant selection decisions to be made, as commercial breeding programs are often aiming for broad acreage products as the end goal.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.64898/2026.01.05.697784v1
www.biorxiv.org www.biorxiv.org

Explaining how mutations affect AlphaFold predictions

2
1. george_sandler 09 Jan 2026
  
  in Arcadia Science
  
  AF overpredicted the dimer conformation substantially
  
  It might be valuable to check which conformation (if not both) were included in the original model training datasets.
2. george_sandler 09 Jan 2026
  
  in Arcadia Science
  
  XCL1 attention heads displayed an interaction network unique to the dimer fold (Figure 2B). Using an interpretation strategy originally suggested by the AF team (C), this network is characterized by vertical lines corresponding to interacting amino acids (Figure S5A,B).
  
  It's interesting to see how the key residues in these attention maps interact globally with the total sequence. This feels somewhat distinct from the results of Zhang et al. on the categorical Jacobian which picks up strong pairwise patterns between amino acids (predicting the contact map of a folded sequence). I wonder if this pattern is a unique feature of these fold-switching proteins or a general phenomenon in Alphafold.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.64898/2025.12.30.697132v1
Nov 2025
arxiv.org arxiv.org

Biology-informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability

6
1. george_sandler 14 Nov 2025
  
  in Public
  
  BINNs improve prediction accuracy and interpretability through utiliziation of gene expressionin genotype to phenotype modeling.
  
  Very cool approach!
2. george_sandler 14 Nov 2025
  
  in Public
  
  B2P-RR
  
  Might be another typo in the figure here "E2P"
3. george_sandler 14 Nov 2025
  
  in Public
  
  (G2P-RR) and Ridge Regression on bio-informed(eQTL-selected) markers (G2P-RR-B
  
  This is probably the second most informative model comparisons in the manuscript in terms of validating the BINN framework after the permuted gene/eQTL BINN models. It would be helpful to see a statistical test of predictive power between the random vs bio-informed ridge regression models.
4. george_sandler 14 Nov 2025
  
  in Public
  
  Expression data overall offers improved cross-population generalization compared to marker-basedmodels as expression reflects functional output of many regulatory layers that can “normalize” someof the divergence in raw markers
  
  I think it might be useful to add an asterisk to this statement. In principle by tracking the cascade of G->B->P one might construct a more powerful model through biologically informed sparsity. However, it is crucial to keep in mind that while expression variation is partly driven by genotype, in a field setting realistically a significant fraction of expression might be attributable to environmental effects. (This tracks with the performance of expression only models in Sup fig 3/4). E.g. a stressed plant likely has a unique transcriptomic profile, and an easier to predict silking phenotype. However, these relationships won't translate to the G->B->P model since transcript abundance itself is not used during model inference. This effect is what likely explain the difference in performance between the expression RR models in sup fig3/4 and the G2B2P model.
5. george_sandler 14 Nov 2025
  
  in Public
  
  An important limitation to this approach is that ifB2P underperforms G2P
  
  The results from Fig 2 in the main text do not seem to suggest that the B2P model is substantially more powerful than the G2P model which might suggest that this could be an issue in the maize dataset. On the other hand the Fig 3/4 in the supplement do show (the expected) pattern that expression is the most powerful predictor of phenotype. I have struggled to reconcile the results from these two figures while reading the manuscript.
6. george_sandler 14 Nov 2025
  
  in Public
  
  Silking NE,
  
  This doesn't match the "Anthesis MI" label in the figure, might be a typo.
Visit annotations in context

Annotators

george_sandler

URL

arxiv.org/pdf/2510.14970
Oct 2025
www.biorxiv.org www.biorxiv.org

Coalescence and Translation: A Language Model for Population Genetics

3
1. george_sandler 10 Oct 2025
  
  in Arcadia Science
  
  robustness
  
  One more dimension of robustness that I think could be useful to explore is sources of error in data (e.g. genotype/phase etc.). Actually adding some small amount of noise to genotypes during training could make cxt quite robust to real dataset errors giving it a further performance edge over alternatives like Singer.
2. george_sandler 10 Oct 2025
  
  in Arcadia Science
  
  the model autoregressively learns the conditional distribution:
  
  In principle do you reckon it would be possible to use a random masking approach (like in models like ESM2) for this problem? Currently one (entire) side of a region informs the models prediction on the focal window, while in principle the most informative regions are both to the immediate left/right of the focal window. Random masking as a strategy could allow the model to leverage this information bidirectionally, but technically could be more challenging.
3. george_sandler 10 Oct 2025
  
  in Arcadia Science
  
  Our work moves towards a foundation model for population genetics, bridging deep learning and coalescent theory to enable flexible, scalable inference of genealogical history from genomic data.
  
  We greatly enjoyed reading this preprint for our internal journal club. This seems like a very principled and useful application of the transformer architecture in biology.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2025.06.24.661337v1
www.biorxiv.org www.biorxiv.org

From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models

1
1. george_sandler 06 Oct 2025
  
  in Arcadia Science
  
  Intriguingly, a large subset of latents appear to be protein family-specific
  
  It would be very interesting how this effect scales with model size. I'd be curious to see if particularly the larger ESM models cause more general latents to break down into more family specific groupings. This would mesh with some of the evidence of family specific overfitting that keeps popping up in the larger sized models.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2025.02.06.636901v2
Sep 2025
www.biorxiv.org www.biorxiv.org

Determinants of mutation load in birds

4
1. george_sandler 05 Sep 2025
  
  in Arcadia Science
  
  ρ / π = 4Ner / 4Neµ = r / µ.
  
  This way of backing out recombination rate from Ner makes some strict assumptions, particularly that pi reflect mutation-drift equilibrium, and that mu is invariant across the genome. This is unlikely to be true in most datasets. It would be very helpful if more region specific comparisons could be made to pedigree based estimates to evaluate how well this approximation works in this dataset.
2. george_sandler 05 Sep 2025
  
  in Arcadia Science
  
  We fitted a mixed effect model including all three parameters and a choice of meaningful interactions as fixed effects using the lmer function of the R package lme4
  
  Again I think the influence of phylogenetic signal is important to consider here. A phylogenetically corrected regression (e.g. phylogenetic generalized least squares) would take into account the non-independence of species observations and provide a more accurate view of the data.
3. george_sandler 05 Sep 2025
  
  in Arcadia Science
  
  Chromosome-level recombination rates were high across all comparisons (median R2=0.668, Supplementary Figure 3)
  
  It would be nice to get an idea of how stable this porting over of the estimated recombination map is across phylogenetic distance. There is considerable variation in the correlation among species pairs which would be good to understand. Also would be valuable to show the reader how the phylogenetic distance of these comparisons relates to the distances used for comparisons of the main analysis.
4. george_sandler 05 Sep 2025
  
  in Arcadia Science
  
  Theoretical considerations on mutation load.
  
  The message of this figure get's a little muddled. One would expect that as as Ne increases, that the proportion of mutations experiencing a larger (more negative) Nes would grow (as in the case of increasing r) but the opposite is shown. Unless the figure implies that fewer segregating mutations will have large Nes? But then this conflicts with the sequence for recombination.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2025.08.18.670172v1
Jul 2025
www.biorxiv.org www.biorxiv.org

Borzoi-informed fine mapping improves causal variant prioritization in complex trait GWAS

1
1. george_sandler 25 Jul 2025
  
  in Arcadia Science
  
  To score a variant, we extract the 524 kb sequence centered on the reference allele and compute model predictions yREF. We create an alternative sequence by replacing the reference allele with the alternative allele and recompute model predictions yALT.
  
  I understand that for computational efficiency/interpretability it is probably best to restrict the Borzoi comparisons to a difference in one focal variant. But given the size of these windows there will almost certainly be many variants that differ across sequences in the study population. I'm curious if you have experimented with using actual UKBB haplotypes as input for a focal position and tested if this introduces meaningful variance in the predictions for said focal variant. Could be valuable to assess how much artificially restricting sequence space differences to one position affects model predictions.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2025.07.09.663936v1
May 2025
www.biorxiv.org www.biorxiv.org

From Likelihood to Fitness: Improving Variant Effect Prediction in Protein and Genome Language Models

2
1. george_sandler 30 May 2025
  
  in Arcadia Science
  
  We found a simple minimum percentage identity threshold of 30% performed best
  
  The figure reports an average but my gut feeling is that this threshold should maybe be protein/protein family specific. I imagine the overall shape of the phylogeny/distribution of branch lengths around a focal protein will influence how much predictive gain LFB provides. For example, it might make sense to set this threshold higher for a protein with lots of intermediate divergence homologs, vs one that has few. An explicit analysis of what features of a protein family's phylogeny favour differing thresholds might in itself be a very useful analysis for guiding the application of LFB/LFB-like methods to PLM improvement.
2. george_sandler 30 May 2025
  
  in Arcadia Science
  
  The LFB estimators proposed in this work are intentionally simple and serve as a starting point for more sophisticated inference strategies
  
  One alternative/complementary 'modify the model' strategy that might be useful to compare to this method is protein family specific PLM fine-tuning. One could test how fine-tuning on a much narrower region of protein space affects a PLM's ability to soak up phylogenetic signal by testing if a fine-tuned model is similarly improved with LFB in zero-shot fitness prediction tasks.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2025.05.20.655154v1
www.biorxiv.org www.biorxiv.org

A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model

2
1. george_sandler 09 May 2025
  
  in Arcadia Science
  
  Despite this technical variation, cell types cluster consistently across species (Fig. 2E), highlighting the biological relevance of the learned embeddings. TranscriptFormer learns to group cells in a biologically relevant fashion by species and cell types (Fig. 2E), without the model being trained or run with species or cell type labels.
  
  With an emphasis on model generalizability, the most interesting signal one could observe is that embeddings cluster not by species (which is driven by strong phylogenetic signal) but rather by other conserved biologically meaningful differences like cell type. An explicit quantification of how much clustering in UMAP (or some other dimensionality reduction method) is explained by species identity vs cell type would be convincing of model generalizability (and the benefit of having multiple species in the training data). At a glance the plots right now suggest the dominant driver in clustering does seem to be species identity but it is hard to tell.
2. george_sandler 09 May 2025
  
  in Arcadia Science
  
  Despite this ceiling effect, the multi-species variants (TF-Metazoa and TF-Exemplar) performed marginally better than the human-only model (TF-Sapiens) despite the same number of active parameters during inference and identical pretraining protocols,
  
  Is there some quantification of this? In the challenging cell types for example the performance seems roughly equivalent between the Metazoa and Exemplar versions of the model. Overall it is hard to see evidence of a benefit of adding diverged species to the training data.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2025.04.25.650731v1
Apr 2025
www.biorxiv.org www.biorxiv.org

Inferring genotype-phenotype maps using attention models

4
1. george_sandler 18 Apr 2025
  
  in Arcadia Science
  
  As expected, the performance of the attention-based model, as characterized by R2 on the test dataset, is much better than that of the linear model (see Fig. 3)
  
  It would have been interesting to see how a simpler say vanilla MLP based approach would stack here to really sell the advantage of attention over other deep learning approaches.
2. george_sandler 18 Apr 2025
  
  in Arcadia Science
  
  Predicting phenotype from genotype is a central challenge in genetics. Traditional approaches in quantitative genetics typically analyze this problem using methods based on linear regression.
  
  I greatly enjoyed reading this paper. The rigorous and rational approach to testing model performance on simulated data, reasonable model architecture, and smart dataset choice are a much needed advance beyond haphazardly applying deep learning networks to G-P datasets with minimal performance gain.
3. george_sandler 18 Apr 2025
  
  in Arcadia Science
  
  With this in mind, we subsample the loci (effectively combining highly correlated loci) to create a representative set of L = 1, 164
  
  Did you experiment with how LD based pruning affects model performance? For linear genomic prediction models the relationship between marker number and predictive performance is well characterized, as long as you capture LD structure well, marker number is not very critical. However, this is not characterized well for deep learning models in this context. Epistatic interactions in particular will depend on products of LD between marker/causal QTL's which could cause performance degradation if causal QTL's are not very well tagged.
4. george_sandler 18 Apr 2025
  
  in Arcadia Science
  
  higher-order epistatic interactions
  
  I was curious why you chose to simulate fourth order epistatic interactions. Statistically one expects higher order epistatic interactions to contribute progressively less to genetic variance, so most studies tend to focus on pairwise epistasis.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2025.04.11.648465v1
Mar 2025
www.biorxiv.org www.biorxiv.org

Genome modeling and design across all domains of life with Evo 2

3
1. george_sandler 07 Mar 2025
  
  in Arcadia Science
  
  as preliminary analysis indicated that most features of interest were represented at this point
  
  It doesn't seem that Fig S5 shows how layer 26 was selected. It would be interesting to at least get a short description in the methods of how this layer was chosen. Other work on mechanistic interpretability in protein language models has shows that different types of features can be learned in different layers of the model.
2. george_sandler 07 Mar 2025
  
  in Arcadia Science
  
  Together, these results highlight the competitive performance of Evo 2 in predicting the pathogenic effects of human coding SNVs
  
  As an evolutionary geneticist to me the most interesting benchmark here are the PhyloP scores. When I see models like EVO2 my concern is always that they are able to effectively memorise phylogenetic conservation. This is totally valid from a biological standpoint however, this can be done with a far simpler phylogenetically explicit method like PhyloP, GERP etc. What is far more exciting is the possibility that a flexible, large model like EVO2 could pick up on non-linear (e.g epistatic) patterns which is something PhyloP type methods are blind to. That PhyloP is very competitive in all these tasks I think is quite telling that for the most part the power of all these models comes from identifying conservation rather than more general 'biological rues'. However that in some instances PhyloP can be improved upon is very exciting nonetheless, in my opinion this is the golden benchmark to be trying to beat.
3. george_sandler 07 Mar 2025
  
  in Arcadia Science
  
  These values were then used as a predictive variable in a logistic regression model of gene essentiality, and directly compared to simple genetic metrics such as GC content and transcript length. Gene age values from the original lncRNA essentiality study (Sarropoulos et al., 2019) were used where available as an additional control.
  
  Aside from NT, these alternative metrics of lncRNA essentiality seem over simplistic compared to a model as complex as EVO2. Are there no other alternative models for lncRNA essentiality? Maybe a tweak of sequence conservation methods could work here too.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2025.02.18.638918v1
Feb 2025
www.biorxiv.org www.biorxiv.org

Unraveling genetic load dynamics during biological invasion: insights from two invasive insect species

4
1. george_sandler 08 Feb 2025
  
  in Arcadia Science
  
  Adult HA were sampled at four sites: two in the native area (Russia [Siberia] and China)
  
  The results certainly seem to suggest these native populations might be bottlenecked too. Is there any indication on how central these sampling locations are to the species native range? Is it possible that the range edge was sampled?
2. george_sandler 08 Feb 2025
  
  in Arcadia Science
  
  In all populations studied and for each species, derived alleles were mostly rare (with frequencies below 0.1)
  
  Site-frequency spectrum plots per population+mutation would quantitatively demonstrate these patterns without the need to arbitrarily bin allele frequency.
3. george_sandler 08 Feb 2025
  
  in Arcadia Science
  
  and a negative correlation between t
  
  This correlation is based on two autocorrelated measures (as theta pi synonymous is measured in both the X and Y axis), so it should be interpreted with caution.
4. george_sandler 08 Feb 2025
  
  in Arcadia Science
  
  crop pest (DVV)
  
  I wonder how much the fact that DVV is a crop pest might influence the results for this species. It would be easy for me to imagine that most DVV populations (native and invasive) have experienced agriculture related bottlenecks and/or population expansions. Pests like corn rootworm have repeatedly adapted to the use of pesticides/GM crops a process which often involved a bottleneck (followed by expansion) and may cause similar effects on the evolution of load in native/invasive populations. Data on the population ecology or local agricultural practices (and history of pest load) may be helpful in figuring if the selective landscape of these populations could have such effects
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2024.09.02.610743v2
Jan 2025
www.biorxiv.org www.biorxiv.org

Ectopic gene conversion causing quantitative trait variation

4
1. george_sandler 03 Jan 2025
  
  in Arcadia Science
  
  but that selection acts against EGC near codons specifying fixed derived amino acids, i.e., mutations that differentiate Arabidopsis thaliana gene copies from those of Arabidopsis lyrata.
  
  Very cool result!
2. george_sandler 03 Jan 2025
  
  in Arcadia Science
  
  (Fig. 2)
  
  It would be helpful to have colour labelled legends in the figures.
3. george_sandler 03 Jan 2025
  
  in Arcadia Science
  
  deed, visual inspection identified numerous linked specific variants at polymorphic sites, sometimes spanning hundreds of positions, indicative of gene conversion tracts.
  
  As you point out, it seems some of the strongest evidence for ECG is that polymorphism is not just shared, but shared polymorphisms are linked. One way you could statistically quantify this is by running a tool scanning for evidence of identity by descent (IBD), or use a tree sequence approach, treating each gene from each accession as an individual genome (like in the multiple sequence alignment you construct). This isn't strictly what IBD tools are for, but it should provide a good proxy given that A. thaliana has low polymorphism and high linkage disequilibrium. Relatedly it would help if intervals for putative EGC could be filled in, not just the limits marked as in Fig S6. This would make it easier to see what the length of EGC tracts is.
4. george_sandler 03 Jan 2025
  
  in Arcadia Science
  
  This can explain why there is more segregating fitness variation within populations than predicted under mutation selection balance (1).
  
  This seems to me as a fairly strong statement that doesn't quite line up with the goals/results of this study. Technically what this study shows is how ECG can contribute to standing genetic variation in a population. The specific paradox of standing genetic variation in phenotypes however, looks to reconcile why there is more variation than expected under mutation selection balance (MSB). MSB in practice is agnostic to the type of mutations, when/where they arose, simply their fitness effects. As detailed in the first reference, it is clear that SNPs alone are inconsistent with MSB which is not surprising since they are only a fraction of genetic variants found in populations. However again as detailed in the first reference, quantitative genetics approaches that use mutation accumulation experiments are agnostic to mutation type, and provide a framework for testing if MSB is sufficient to explain standing levels of genetic variation. Paradoxically they often find that MSB is not enough to explain the high variation in phenotypes (due to genetic effects) we observe (see https://doi.org/10.1098/rspb.2018.1864 for an example), implying that forces such as balancing selection must also be working to maintain excess genetic variation. What this study demonstrates is that non-SNPs can contribute to variation (as other studies have demonstrated for transposons, inversions, indels etc.). This alone does not demonstrate the sufficiency of MSB to explain observed genetic variation in phenotypes though.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2024.12.26.630369v1
Dec 2024
www.biorxiv.org www.biorxiv.org

Fluctuations and the limit of predictability in protein evolution

1
1. george_sandler 13 Dec 2024
  
  in Arcadia Science
  
  Conversely, at larger time scales, the dynamical noise contribution dominates and the trajectory-to-trajectory fluctuations are large enough to hide the signal coming from the ancestral sequence, precluding the possibility to reconstruct i
  
  It might be interesting to see what the scale of hamming distance distribution is in the underlying MSA's for the focal protein families is, vs. at what scales of hamming distance such effects are observed in the simulations. One potential concern could be that couplings/epistasis are estimated from the MSA on one scale of sequence divergence, but the simulations are pushed to much larger scales, in which cases the epistatic interactions inferred from the MSA might no longer be accurate.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2024.12.04.626874v1
www.biorxiv.org www.biorxiv.org

Genotypic and phenotypic consequences of domestication in dogs

6
1. george_sandler 06 Dec 2024
  
  in Arcadia Science
  
  For over a century, scholars and dog-enthusiasts alike have sought to unravel the complex evolutionary history of man’s best friend
  
  I enjoyed reading your paper! This canid dataset offers such a great opportunity for exploring genotype-phenotype mappings, it's great to see how such associations can be teased apart in studies such as this.
2. george_sandler 06 Dec 2024
  
  in Arcadia Science
  
  For small individuals, 34 SNP
  
  Relatedly, it might be interesting to compare effect size estimates across these different data subsets. A large swing in additive effect sizes for markers across populations has implications regarding epistatic interactions the focal locus may be part of, see: https://www.nature.com/articles/nrg3627
3. george_sandler 06 Dec 2024
  
  in Arcadia Science
  
  For breed average height, we found 27 SNPs
  
  It would be interesting to see a simple summary in the text of how much sharing there was in significant markers between the breed average/small/large subsets.
4. george_sandler 06 Dec 2024
  
  in Arcadia Science
  
  detect non-additivity
  
  It's hard to tell from the description in this manuscript how non-additivity is captured by this analysis. A quick one-liner on this might be helpful for readers.
5. george_sandler 06 Dec 2024
  
  in Arcadia Science
  
  ROH sharing matrix as a kinship matrix.
  
  Do you have a sense for how much of a difference there is between using an ROH sharing vs a general whole-genome SNP/marker based kinship matrix? My initial reaction was that a more independent whole genome derived matrix that captures structure across the whole genome might be more desirable to capture fine scale population structure/ancestry differences etc. But perhaps using ROH runs themselves since they are the target of the analyses is better.
6. george_sandler 06 Dec 2024
  
  in Arcadia Science
  
  The second genomic narrative
  
  A small content suggestion. The introduction of this paper spends a considerable amount of time discussing the potential history of dog domestication. However, this background seems only tangentially related to the content of this paper which rather aims to take advantage of population structure in dogs to explore associations between genotypes and phenotypes in the context of domestication. More background on the genomics of domestication might be more relevant.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2024.05.01.592072v2
Oct 2024
www.biorxiv.org www.biorxiv.org

Population genomics reveals strong impacts of genetic drift without purging and guides conservation of bull and giant kelp

7
1. george_sandler 14 Oct 2024
  
  in Arcadia Science
  
  We expected that purging wouldremove putatively deleterious alleles from small populations but have no effect on frequenciesof alleles in less deleterious categorie
  
  I'm a little confused by this statement. This would make sense in certain cases (e.g., highly recessive mutations). But SNPeff and GERP scores give no information on recessiveness. If these are just any deleterious variants, then the expectation should be the opposite, that high Ne pops should purge load easier.
2. george_sandler 14 Oct 2024
  
  in Arcadia Science
  
  such as in recently bottlenecked populations
  
  Does this seem plausible in the case of kelp? If bottlenecking has been recent, the effect on Ne will be instant, but it will take for the signal of differences in purging to build up. Have strong kelp declines been more recent (>50 years)?
3. george_sandler 14 Oct 2024
  
  in Arcadia Science
  
  A strong isolation-by-distance pattern of increasing genetic distance (dXY) with geographic distance (Figure S4) andthe presence of populations admixed between clusters (Figure 1A-D) suggest that adjacentclusters are connected by gene flow.
  
  Seems in bull kelp lots of gene flow occurs across the southern tip of Vancouver Island. On the other hand, the northern tip seems to represent a barrier (judging by the clusters in Fig1). Are there any hydrological/oceanographic reasons to expect this maybe?
4. george_sandler 14 Oct 2024
  
  in Arcadia Science
  
  Bull kelp and giant kelp are the principal canopy-forming species in kelp forests of thenortheast Pacific, supporting highly productive and biodiverse ecosystems13
  
  I really liked reading this paper. It's great to see such detailed sampling and interrogation of the pop-gen of these keystone species.
5. george_sandler 14 Oct 2024
  
  in Arcadia Science
  
  We estimatedeffective population size (Ne)
  
  Does this mean you report selfing Ne (rather than typical coalescent Ne) in your analyses? This would be useful to highlight in the main text.
6. george_sandler 14 Oct 2024
  
  in Arcadia Science
  
  1) All else being equal, indivi
  
  Is there any correlation in Ne/inbreeding/low diversity between the two species of help where they co-occur? This could be a useful indicator for conservation efforts.
7. george_sandler 14 Oct 2024
  
  in Arcadia Science
  
  We observed no evidence of purging in either species. We predicted that smallerpopulations would show a reduction in DMA frequency at evolutionarily conserved sites (GERPanalysis) due to increased homozygosity and exposure to selection, yet DMA frequency wasuncorrelated with population size (Figure 3A-B)
  
  While there are no differences between populations, the regression lines for the GERP/SNPeff analyses clearly show that less constrained sites harbor more diversity than more constrained sites, implying that purifying selection is acting on purging deleterious variants in both species. Seems like purging is present in this dataset it is more of a time-scale issue when it comes to detecting it. This makes sense particularly for recessive variants since they will be hidden from selection for a lot of time.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2024.10.10.617648v1.full.pdf
Sep 2024
www.biorxiv.org www.biorxiv.org

Experimental evolution of evolvability

4
1. george_sandler 06 Sep 2024
  
  in Arcadia Science
  
  Capacity to generate adaptive variation can evolve by natural selection. However, the idea that mutation becomes biased toward specific adaptive outcomes is controversial. Here, using experimental bacterial p
  
  I greatly enjoyed reading this preprint. The dissection of the origin of the de novo contingency locus was very cool.
2. george_sandler 06 Sep 2024
  
  in Arcadia Science
  
  Central to our findings was a selective process where lineages better able to generate, by mutation, adaptive phenotypic variants, replaced those that were less proficient (Figure 1). In one metapopulation, a single lineage emerged with capacity to transition rapidly between phenotypic states via expansion and contraction of a short nucleotide repeat in a manner precisely analogous to that of contingency loci in pathogenic bacteria
  
  Do you have any thoughts on why global mutator alleles underpinned evolvability in two populations, and a local mechanism in the other? Seems that increased mutation rates are a common by-product of experimental evolution (e.g. instances in the LTEE). There is a nice paper in yeast that has demonstrated that mutator alleles tend to be favoured in cases where local population size is high (which allows selection to more efficiently act on the beneficial variants they produce). Might be relevant here: Sign of selection on mutation rate modifiers depends on population size, Raynes et al.,2018
3. george_sandler 06 Sep 2024
  
  in Arcadia Science
  
  As the number of repeats increased, the rate at which transitions occurred visibly increased (Figure 3E).
  
  What a cool result. This reminds me of the observation in stickleback that the independent evolution of loss of pelvic hindfins tends to target the same locus because of the specific molecular characteristics of that stretch of sequence. This may be of relevance to this study: DNA fragility in the parallel evolution of pelvic reduction in stickleback fish, Xie et al. 2019.
4. george_sandler 06 Sep 2024
  
  in Arcadia Science
  
  The selective regime employed was contrived, with selection on lineages being strictly enforced. Such stringent conditions are likely limited in nature. However, microbial pathogens faced with the challenge of persistence in face of the host immune response, will experience strong lineage-level selection, with repeated transitions through selective bottlenecks [38]. As we have shown here, precisely these conditions can promote the evolution of evolvability.
  
  It seem to me that the key component of the experimental selection regime is that individual level and lineage level selection were allowed to act in separate consecutive timesteps, shielding lineage level selection from being swamped out by 'short-sighted' individual level selection. I wonder how likely this scenario is to play out in circumstances such as pathogen evolution where I would imagine that both levels of selection should still be acting concurrently. I agree however that the presence of contingency loci implies that some form of ecological conditions likely exist that allows for this to happen.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2024.05.01.592015v4
Aug 2024
www.biorxiv.org www.biorxiv.org

Direct inference of the distribution of fitness effects of spontaneous mutations from recombinant inbred C. elegans mutation accumulation lines

4
1. george_sandler 02 Aug 2024
  
  in Arcadia Science
  
  The magnitude of the raw difference is typically much larger than that of the posterior effects. The difference is likely caused by LD, in that the raw difference of a single mutation contains contributions from other linked mutations, which may inflate the estimates.
  
  Could you constrain this analysis to mutations that are in LE with other de-novo mutations to test this hypothesis?
2. george_sandler 02 Aug 2024
  
  in Arcadia Science
  
  Here we employ a classical line-cross strategy with MA lines, to break down the linkage disequilibrium among the accumulated mutations. We then combine whole-genome sequencing with high-throughput competitive fitness assays to estimate the DFE of a set of 169 spontaneous mutations.
  
  I greatly enjoyed reading this paper. True experimental estimates of the DFE in MA studies are super valuable and provide a very interesting comparison for pop-gen based DFE methods as pointed out by the authors.
3. george_sandler 02 Aug 2024
  
  in Arcadia Science
  
  Averaged over all RI(AI)Ls, accounting for variation among assay blocks and removing two outlying lines, the regression of W on number of mutations is not significantly different from 0 (slope = −0.0051, F1,509=1.83, P>0.17), although the trend suggests that mutations are deleterious, on average.
  
  Is there a chance that false negative mutations (i.e. incorrectly unobserved events in the MA lines) could contribute to this result?
4. george_sandler 02 Aug 2024
  
  in Arcadia Science
  
  The simplest way to infer the mutational effect at a locus is to calculate the mean value of all lines with a mutant allele and all lines with an ancestral allele at that locus; the difference is the raw difference (uRAW) of the mutation at that locus. As a sanity check, we plotted the inferred Bayesian posterior effect against the raw difference; ideally, the correlation should be +1. The correlations were positive, but well below 1 in all three cases (Figure 4). The magnitude of the raw difference is typically much larger than that of the posterior effects. The difference is likely caused by LD, in that the raw difference of a single mutation contains contributions from other linked mutations, which may inflate the estimates.
  
  Two quick thoughts for further sanity checks. 1) Does this regression look any different for SNPs vs indels? 2) Do the individual mutation specific effects conform to expectations one might have based on the functional annotations available for these mutational events?
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2024.05.08.593038v2
Jul 2024
www.biorxiv.org www.biorxiv.org

Environment by environment interactions (ExE) differ across genetic backgrounds (ExExG)

4
1. george_sandler 12 Jul 2024
  
  in Arcadia Science
  
  An important caveat is that, although the DE framework makes reasonable fitness predictions for these two drug pairs, it fails in many other environments and for many other genotypes, again highlighting the prevalence of ExExG.
  
  The DE approach seems quite powerful especially since it adds a 'benign E' reference line for fitness comparisons. I would love to see how the prediction from this model lines up with true fitness in figure 2 for all lines tested.
2. george_sandler 12 Jul 2024
  
  in Arcadia Science
  
  In terms of synergy vs antagonism, our results suggest that a small number of mutations can change a drug combination from having a synergistic to an antagonistic effect. For example, figure 2C shows a case where LRLF acts synergistically on a yeast strain harboring a single nucleotide mutation to the HDA1 gene, but acts antagonistically on a different evolved yeast mutant. Similarly, figure 3 shows cases where a drug pair changes from having a synergistic to an antagonistic effect across different mutants.
  
  It seems from figure 2 and 3, the dominant pattern in the dataset is that of antagonistic interactions (at least in respect to the additive model). This made me wonder two things: 1) Are there are general biological explanations for such a pattern or considerations for why this might be expected? I'm thinking of the GxG equivalent where we know for example that diminishing returns epistasis is a common feature of adaptive populations, and this can be linked to theoretical models of fitness landscapes in the context of Fisher's geometric model etc. 2) Is this the correct biological null model to use? Certainly in the quant-gen world the additive approach would be the go-to starting point, but is this relevant for the context of these fitness estimates? My first gut feeling was that the average null model should be more relevant. Not sure if a pop-gen multiplicative approach is another potential null.
3. george_sandler 12 Jul 2024
  
  in Arcadia Science
  
  Here, we take a large collection of roughly 1,000 antifungal drug-resistant yeast mutants evolved using this method and ask how often fitness in multidrug environments is predicted by fitness in single drug environments (Figure 1D)
  
  I enjoyed reading this paper and the novel ExExG framing of the study! This is a great dataset, I hope more genomic data can be attached to it in the future enabling even more mutation specific questions to be asked.
4. george_sandler 12 Jul 2024
  
  in Arcadia Science
  
  Four different models (horizontal axis) are used to calculate expected fitness for each of roughly 1000 mutants per drug pair
  
  It would be useful to get a short description of these models here (aside from the methods) for clarity.
Visit annotations in context

Annotators

george_sandler

URL

biorxiv.org/content/10.1101/2024.05.08.593194v2

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL