10,000 Matching Annotations
  1. May 2026
    1. On 2023-10-10 13:37:06, user Tom Langton wrote:

      There is a distinct lack of transparency becasue data are not linked in supplementary material and neither is the stata code. Even if the code was there it would be difficult to reproduce – stata is not very widely used.

      Does the analysis decisively link badger culling to decline in bovine TB in cattle? Culling is confounded with complex changes to gamma and other testing and other measures. There are places in the ms where a claim is made for a link ‘The effect of badger culling…’ but then later: ‘However, this data analysis cannot explicitly distinguish…….’ APHA have stated more than once that the breakdown data alone are insufficient to show an effect of culling.

      However, the DID approach is inappropriate because the 52 cull and pre-cull study areas are surrounded by and in close contact with each other and so are not adequately discrete in space and time, breaking a DID requirement. Notably in the four-year period 2016/17 -2019/20 pre-cull and cull areas closely juxta position. Pre-cull and cull interventions may influence change in adjacent pre-cull and cull areas in an irregular and unpredictable manner, via the constant movement of cattle with different intervention histories.

      Some of the very important bTB testing controls used and the timing of their introduction are not correctly described and some not even mentioned.

      The approach mixes data from 46 High Risk Area study areas with 6 from the Edge Area. These have very different epidemiological and disease control history profiles and the reason for mixing them is not explained. Pre-cull gamma testing was intense in the Edge Area. While the manuscript mentions additional gamma testing from 2017, gamma testing was erratic between areas and over time. There were considerable numbers of gamma reactors in many of the cull areas in cull years 1 and 2 with similar disclosures to years 3 and 4 but pre-cull use in the HRA was generally low.<br /> In both HRA and Edge areas OTFW incidence was declining when many of the additional disease control rules were intensified, and badger culling introduced. The government chief scientist view was that distinguishing and determining with any precision any contribution of badger culling to disease control is not possible, is therefore maintained. It is impossible to distinguish between the effects of badger culling and disease control measures using the DID approach in any meaningful way.

      The comment made on being unable to match cull and control is unevidenced and a 2022 published and peer reviewed study of this has not been cited, nor alternative approaches such as matching a series of individual farms in cull and control which would have been simpler and easier to monitor on a quarterly basis since 2018, as was the expectation of a High Court ruling.

      The suggestion that incidence is the better indicator of true burden goes against APHA’s own epidemiology and the greatly clarified understanding of SICCT test sensitivity and specificity. The specificity change resulting from SICCT severe interpretation can be estimated and applied to OTFS, providing, with OTFW, a safe index of known infection. Gamma testing has shown numbers of unidentified reactors in both identified bTB infected and unidentified herds, and this ‘hidden’ reservoir remains undocumented although its size can be deduced.

      OTFW relates by its nature to older infections with lesions, yet it is newer infections (OTFS) that indicate disease control success (elimination) more accurately, over time. There is no reason to believe that OTFS+OTFW (as above) is not the superior approach and that OTFW is half of the full picture. During the RBCT, prevalence was growing considerably and OTFS rising so the analogy made in this manuscript is inappropriate and trying unconvincingly to defend the historic reference source.

      By reducing the pre-cull analysis to one data point the significant pre-cull decline is hidden and this also masks true trends.<br /> Because the methods are inappropriate, the results are flawed and the conclusions are flawed. There is no way to distinguish between different interventions and change in herd breakdowns since 2013 with the approach taken.

      There is also need to see clarification of the analysis that is presented in the Appendix – this does not seem to be the output from the constrained DID analysis, but something else not fully described in methods which ‘matches’ it – which stata process was used ? It is the first analysis referred to in the main text. As well as showing that the BCP effect accounts for less than 2% of breakdowns it doesn’t look quite right that the BCP effect is based on error dof – pseudoreplication?

    1. On 2023-08-22 17:11:14, user Stephanie Sibinelli wrote:

      Reviewed by: Julia Takuno Hespanhol and Stephanie Sibinelli de Sousa

      This review resulted from the graduate-level course "How to Read and Evaluate Scientific Papers and Preprints" from the University of São Paulo, which aimed to provide students with the opportunity to review scientific articles, develop critical and constructive discussions on the endless frontiers of knowledge, and understand the peer review process.

      In the study titled “Toxinome - The Bacterial Protein Toxin Database”, Danov et al. developed a publicly accessible website compiling the information on bacterial toxins and antitoxins from all available species from five distinct databases (SecRet6, BactiBase, TADB, BAGEL, UniProt). This system offers user-friendly browsing options, enabling users to explore the compiled data by organism or by Pfam code. Additionally, the website incorporates advanced search tools allowing searches based on amino acid sequences or specific keywords like product name, organism name, Pfam ID, and Pfam name. A particularly intriguing observation of the dataset was the identification of ‘Toxin Islands’, genomic regions that encode multiple toxins and/or antitoxins. These Toxin Islands not only provide information about known toxins but also serve as indicators of potential unidentified toxins within these genomic regions. The paper demonstrates the utility of this concept by providing an example where a hypothetical protein encoded within a Toxin Island exhibits structural similarities to a known toxin. Although the creation of the database and website is quite impressive, the compiled information could have been further explored.

      Major comments

      1. What specific analysis ensures that the domains present in the database truly belong to toxins and are not false positives? A search solely based on the "toxin/toxic" word through InterPro might have introduced many false positives. Furthermore, was there a previous quality check for the proteins imported from each database to ensure they were actually toxins? It is possible that false-positive toxins were included from the original datasets. One example is the structural protein Hcp from the T6SS (annotated in the SecReT6) gene id: 2503566284, which is depicted as a toxin, but the predicted protein does not encode any toxic domains. There are several examples of Hcp domain encoding proteins in the dataset that probably present only structural functions.

      In line 150, "The resulting dataset was then manually curated for quality assurance, and toxin or antitoxin genes erroneously included were removed”. Could you please elaborate on the procedure of this manual analysis? What specific criteria were employed to identify and eliminate false positives? This information is crucial to ensure the reliability of the identified toxins and antitoxins.

      1. A more comprehensive discussion of the existing databases would greatly enhance the work. It would be beneficial to elaborate on the total number of proteins present in each database and highlight any intersection between them. This additional information will provide a clearer picture of the data's scope and contribute overall credibility of the analysis.

      2. Although the analysis of toxin presence in bacteria living across varying temperature ranges does yield intriguing insights into the evolution of extremophiles, this aspect appears distant from the central topic of the work and offers a relatively superficial analysis. It would also be interesting to include other information: 1) what is the toxin distribution by habitat type and hosts; despite being a well-studied area, this association would lend support to the dataset's intrinsic conclusion; 2) target specificity and protein domains, what is the frequency of toxins targeting specific cellular components (lipids, proteins, nucleic acids, metabolites) based on protein domains (Pfam); 3) what is the toxin diversity among different phylogenetic groups; figure 6 explores this aspect to some extent, however, the toxin types (domains) are not mentioned.

      3. The concept of Toxin Islands is crucial to the study and could be explored more deeply. While Figure 6 focuses on the topic, only the toxin and antitoxin counts are provided. Details about toxin types within each island could be valuable. An observation is highlighted regarding the Bacilli class, whether a higher toxin count than the antitoxin count is observed. The hypothesis in the discussion part suggests that Bacilli may have an abundance of toxins targeting eukaryotes, potentially explaining the lower need for antitoxins. This hypothesis could be further investigated using the Toxinome data itself.

      4. The work uses protein structural prediction tools to examine the specific case of unannotated toxin and antitoxin found in the Toxin Island of Thauera phenylacetica B4P (Figure 7). It would be beneficial to the work to highlight other examples.

      Minor comments

      1. Figure 2 appears to be very complex, leading to difficulties in following the numerical sequence. This figure could be separated into Panels A and B.

      2. In Figure 3, consider providing annotations for Archaea species. Additionally, indicate on the phylogenetic tree the specific bacterial clades that have a depletion of toxins and antitoxins.

      3. There is no reference to a figure or table in line 285: "As we expect, there is a high correlation between toxin and anti-toxin content (R = 0.6581, pvalue = 7.59x10-13)".

      4. There is no reference to a figure/table/supplementary material to line 143: "To increase the number of toxins and immunity proteins into our database we used protein domain information. We added 219 and 94 toxin and antitoxin domains, respectively, to the resulting toxin gene set that we downloaded from the Pfam database.".

      5. The data present in Tables 1-4 could be effectively visualized using graphs, which would enhance the clarity and comprehension of the information.

      6. In the discussion (line 440): "For example, certain hyperthermophilic and halophilic Archaea were described to produce bacteriocins called sulfolobicins and halocins, respectively". It would be interesting to know if the authors could find these toxins in the compiled Toxinome dataset

    1. On 2023-06-05 11:13:47, user Peter Sabol wrote:

      Nice paper.<br /> Just a minor technicality - for the final publication, don't forget to indicate the At code for the genes mentioned in the paper.

    1. On 2023-05-23 03:53:12, user Louise wrote:

      For the SLE dataset, did you use GSE137029 or GSE174188? The code seems to say Perez et al GSE174188, but the manuscript says Mandric et al GSE137029?

    1. On 2023-04-30 15:24:14, user Gul Zerze wrote:

      I sincerely thank both Emil Thomasen and Kresten Lindorff-Larsen for their time, careful reading, and comments on the manuscript. Below, I attach my responses to each point with reproduction of the comment. Since these commentary is not capable of pasting modified visuals, added/modified visuals can be seen in the published version of the manuscript (doi: 10.1021/acs.jctc.2c01273)

      Comment:<br /> The manuscript by Zerze reports on molecular dynamics simulations of the intrinsically disordered low complexity domain (LCD) of FUS using a beta version of the coarse-grained force field Martini 3. The author performed simulations to study the formation of FUS LCD condensates under varying protein-water interaction strengths (in the Martini force field) and at different NaCl concentrations, and concludes that strengthening protein-water interactions by a factor of 1.03 improves the agreement with experimental transfer free energies between the dilute and dense phases. Additionally, the author concludes that the NaCl concentration affects condensate morphology and protein-protein interactions in the condensate, and that the effect of NaCl concentration on protein-protein interactions in the condensate is sensitive to rescaling of the protein-water interactions. The manuscript provides an interesting and novel benchmark of the (beta) Martini 3 model in predicting phase separation of IDPs, and reveals potential short-comings of the model in predicting protein concentrations in (or volumes of) the condensed and dilute phases. This benchmark will be useful for readers who wish to simulate liquid-liquid phase separation of IDPs with Martini 3, and the work will be interesting to a wider audience interested in the biophysics of IDPs and their condensates.<br /> Below we outline some questions and comments that the author might take into account when revising the manuscript. Our main comment regards a clearer assessment of the convergence of the simulations and correspondingly the lack of error estimates for observables calculated from the simulations. We also suggest a clearer presentation of the experimental data used to validate the simulations. While some of these changes are mostly textual, in other cases we suggest additional simulations. We realize that some of these simulations require substantial resources; if these are beyond what is available, we suggest at least to clarify caveats as per the points below.

      The author’s response: I thank the reviewer for their scrutiny and thoughtful comments that greatly helped substantiating the optimization analysis in the revised version of the manuscript.

      Comment: We have the following suggestions for revisions to the manuscript:<br /> 1)<br /> Fig. 1 and 2: The finding of non-spherical droplets is interesting and intriguing. To examine whether the formation of these shapes in the simulations with higher salt and λ-values represent stable states or perhaps trapped metastable states of the system, we suggest that:<br /> 1a) The author runs simulations with the parameters that give rise to non-spherical morphologies (e.g. λ=1.025 and 50 mM NaCl) starting from the structure of the spherical droplet (for example formed with λ=1.0 and no salt) and observe whether the non-spherical morphology is recovered or the droplet remains stable. If the droplet remains stable, then the effect of salt concentration on the inter-chain contacts (Fig. 6) could be assessed without potentially confounding factors from different dense phase morphologies.

      The author’s response: Following the reviewer’s suggestion, I have performed an additional set of simulations for all λ values (1, 1.01, 1.02, 1.025, 1.03) at 50 mM salt concentration starting from a preformed spherical droplet. The initial condition with the preformed droplet is obtained from the last saved frame of the λ=1 simulation for 0 mM salt. We ran the simulations for 10 microseconds each. Within the given time frame the droplet remained stable for λ values 1, 1.01, 1.02, and 1.025 without a dilute phase concentration. I now added these findings into the supporting information (Figure S5).<br /> I also modified the main text (Page 9 last paragraph and Page 10 first paragraph) as follows:

      “Recent studies from independent groups show that the nonspherical droplet formation might be a kinetic arrest, playing an important role in droplet maturation and aging [51–53]. To test whether the nonspherical morphologies we observed are impacted by the initial conditions, we rerun 50 mM at all λ values starting from a preformed droplet (last saved configuration of 0 mM salt, λ = 1 condition). We simulated each λ for 10 μs and presented the analysis in Figure S5. Within the given simulation time, the initially spherical droplets stayed intact and spherical, except for λ=1.03, which had one copy of the FUS LC protein exchange back and forth between the dense and dilute phases). The enlarged droplet in the case of λ=1.03 also deviated from its initially spherical shape. These findings show that the nonspherical morphology was not reproducible for λ values less than 1.03 when starting from a preformed spherical droplet. We argue that the strength of effective protein-protein interactions at low λ are largely<br /> responsible from the initial spherical droplet staying intact.”

      Since the droplets stayed nearly spherical, I also analyzed the contact formation in these simulations (50 mM added salt, initially starting from a spherical preformed droplet) and presented the findings in Figure S7.

      I also discussed these findings in the main text as follows (Page 19, 20, the last paragraph before Conclusions):

      “Finally, we also examined the contact formation for the case of 50 mM added salt that starts from a preformed droplet (see Identification of condensate formation subsection for the description). As presented in Figure S5, we found that the initially spherical droplet remains largely spherical within the simulation time (never forms rod-like percolated structures) for this case. Therefore, this case helps us assess the effect of salt concentration on the inter-chain contacts without potentially confounding factors from different dense phase morphologies. Figure S7 shows both the contact propensity (A.) and the effect of salt concentration (B.) on the contact propensity. Figure S7A shows that the contact propensity decreases as the λ parameter increases, similar to the findings in Figure 5. Figure S7B shows, however, that the change in contact fraction with respect to 0 mM salt at λ = 1 is weaker (resembling λ = 1.02 at 50 mM salt in Figure 6A) although the salting out effect at high λ (λ = 1.025 and 1.03) are more prominent and stronger compared to those in Figure 6A.”

      Comment: 1b) The author shows time-series or distributions of an observable that reports on the dynamics of the proteins in the non-spherical droplet (e.g. Rg, mean square displacement, residue-residue contacts) and/or of an observable that reports on the dynamics of the droplet shape (e.g. the x-, y-, and z-components of the gyration tensor).

      The author’s response: Following the reviewer’s suggestion, we added the analysis of observables that reports on dynamics of shape fluctuations and size and presented them in Figure S4.

      We also modified the main text (Page 9, second half of the second paragraph) to discuss these findings: <br /> “We also investigated the time dependence of the size and shape of these morphologies by quantifying the radius of gyration (Rg) and the ratio of the smallest and largest eigenvalues of the gyration tensor (Figure S4). The latter offers a measure of sphericity of droplets. We found that low λ cases (λ = 1, 1.01, 1.02) at 0 mM salt have the most spherical morphologies. Beyond λ = 1.025 at no salt, the cluster formation is not tight (as evident from the Rg) so it also loses its sphericity. The condition that shows percolation (λ = 1 at 50 mM salt) has the largest deviation from the sphericity (it is rod-like instead) combined with a large Rg.”

      Comment: 1c) Additionally, independent replicas of droplet formation for each condition and parameter set would be ideal, but we realize that this would be expensive in computational resources and may be infeasible.

      The author’s response: We agree with the reviewer that the molecular simulations presented in this work are highly computationally demanding (e.g., a 10-microsecond simulation of one of these simulations at given salt and given λ takes about 25 days in terms of walk-clock time, occupying 28 CPUs and 4 GPUs) While it certainly is computationally demanding to replicate all λ parameters at all salt concentrations, we now rerun 50 mM salt concentration at all λ parameters where we start from a completely different initial condition (preformed droplet) for each. And we found that the morphology was not reproducible within the given simulation time at low λ, highlighting the initial condition dependence at low λ conditions. We now discussed this in the main text (Page 21, Conclusions).

      “We also note that we observed an initial condition dependence of the morphology at low λ conditions at 50 mM salt. This finding emphasizes the necessity of future work for exploring condensate morphology with proper advanced sampling techniques.”

      Comment: 2)<br /> “As λ increases, the volume of the dense phase increases (and condensed phase concentration decreases accordingly) until the system is not capable of forming a dense phase (λ >1.03)”: From Fig. 1 it seems that the rate of cluster formation decreases as λ increases. Is it not then possible that droplet formation at λ>1.03 is stable at equilibrium, but occurs on time-scales greater than those tested in the simulations? To support the statement that no droplets are stable at λ>1.03, we suggest that the author runs simulations with a higher value of λ starting from the structure of the spherical droplet (formed with λ=1.0 and no salt) to observe whether the droplet is dissolved or remains stable.

      The author’s response: Following the reviewer’s suggestion, we have performed a simulation for λ=1.04 at no salt condition starting from the preformed spherical droplet (last saved configuration of λ=1.0 at 0mM salt) and we found that the droplet quickly dissolves for λ=1.04. This finding is now presented in Figure S3.

      The main text is also modified as follows (Page 9, end of the first paragraph):

      “To further verify that no droplets are stable beyond λ = 1.03, we also ran λ = 1.04 simulations<br /> at no salt conditions starting from a preformed spherical droplet (last saved configuration of<br /> λ = 1 at 0 mM salt). We then analyzed the cluster formation as a function of time (Figure<br /> S3) and found that the initial droplet dissolves quickly (at a timescale shorter than that of<br /> the formation of the droplets).”

      Comment: 3)<br /> Figure 3: The use of the radial distribution does not seem ideal for the droplets that have a non-spherical morphology, as certain distances will report on an average over the dense and dilute phases. This should at a minimum be discussed.

      The author’s response: Following the reviewer’s suggestion, we have added further discussion related radial density distribution to the main text (Page 12, first paragraph):

      “This approach works reasonably well for droplets that have spherical/ellipsoidal shapes. However, since the condensates for the conditions with finite salt concentrations significantly deviate from a sphere (they do not show a clear plateau as the center is approached), we used a surface reconstruction method [54] to estimate the volume and concentration instead of fitting the radial density profiles/using the limiting values.”

      Comment: 4)<br /> Table 1: It seems that the discrepancy between the sigmoidal fit approach and the surface reconstruction approach increases with λ, possibly due to sensitivity to the shape of the droplets, illustrating that there might be significant uncertainty associated with the reported dense phase volumes. We think it would be useful to have an error estimate for the reported dense phase volumes (e.g. an error over volume calculation approaches and/or over different probe sizes).

      The author’s response: The volume obtained by surface reconstruction is definitely highly sensitive to the probe size. To justify the size of the probe that I used, I directly compared the sigmoidal fit protein concentrations and the surface construction protein concentration calculated by different probe sizes (Figure S3 in the old SI, Figure S6 in the revised SI). Based on that comparison, probe radius 10 A was the size that minimized the differences considering all lambda values. That’s how I justified the probe size I used. For the uncertainty/error estimates, I performed block averaging analyses (please also see the response to the point 7).

      Comment: 5)<br /> Table 2 and Fig. 4: We suggest that the author more explicitly states which experimental data was used for comparison with the simulations in Fig. 4. We also suggest a more direct comparison with experimental data points where possible (e.g. by showing the experimental values of csat as a function of NaCl concentration).

      The author’s response: We used two experimental papers to extract the experimental data, one is reference 36. In reference 36, the authors state: “Using incubation on ice to increase the driving force for droplet formation followed by centrifugation to fuse the droplets due to their higher density, our 15 ml samples of 1 mM FUS LC phase-separated to form an ∼400 μl viscous, protein-dense phase stable for weeks at room temperature. FUS LC concentration in the phase is approximately 7 mM (120 mg/ml FUS LC) as determined by spectrophotometry.“

      We note that the salt concentration is not specified in this case (or the authors obtained approximately the same protein concentration in the dense phase regardless of the salt concentration). Also, the thermodynamic conditions defined here does not exactly correspond to those in our simulations. That’s partly the reason why we looked for multiple sources of experimental data. The other experimental work that we used is reference 39. In reference 39, the authors state that “The relative intensity of the glutamine side chain residue NMR resonances in the condensed phase compared to a standard concentration (100 μM) dispersed phase FUS LC suggests a concentration of 27.8 mM = 477 mg/ml in the condensed phase.”

      The salt concentration in the corresponding NMR experiments were carried out at 25 °C in 50 mM MES, 150 mM NaCl pH 5.5. The conditions do not exactly correspond to our thermodynamic conditions, either. Since an exact match is not available in the conditions, we did not prefer to present a direct comparison of dense phase concentrations, instead, we preferred to show a range in Figure 4. We now modified the main text (Page 15, right above the Contact Maps subsection) to more explicitly state the source of the data:

      “The experimental data range is referenced from the work by Fawzi and coworkers; [36,39] where reference [36] measures the FUS LC concentration in the dense phase as approximately 120 mg/mL (spectroscopically) and in reference [39], a 477 mg/mL FUS LC concentration is deduced from the relative intensity of the glutamine side chain residue NMR resonances in the condensed phase (compared to a standard protein concentration in the dispersed phase, which is given as 100 μM, or 1.71 mg/mL). 477 mg/mL FUS LC dense phase has been obtained from 15 ml samples of 1 mM FUS LC solutions [36] (from which we calculated the dilute phase concentration as approximately 14.3 mg/mL). We used these dense phase and their respective dilute phase concentrations to calculate the experimental range of transfer free energy (gray-shaded areas in Figure 4).”

      Comment: 6)<br /> “We used the “tiny” bead type (TQ1) both for Na+ and Cl- ions”: The author should clarify the reason for and possible effects of choosing the TQ1 bead type, as TQ5 is, we think, the standard bead type for Na+ and Cl- ions in Martini 3.

      The author’s response: We would like to clarify that tiny refers to the bead type being Txx. We then also would like to clarify that TQ5 type was not available in the MARTINI version that we used. Ion topology file in the version that we used only had TQ1 types as the ion type. We are pasting the contents of “martini_v3.0_ions.itp” file below:

      ;;; IONS<br /> ;

      ;;;;;; SODIUM ION

      [moleculetype]<br /> ; molname nrexcl<br /> TNA 1

      [atoms]<br /> ;id type resnr residu atom cgnr charge<br /> 1 TQ1 1 ION NA 1 1.0

      ;;;;;; CHLORIDE ION

      [moleculetype]<br /> ; molname nrexcl<br /> TCL 1

      [atoms]<br /> ;id type resnr residu atom cgnr charge<br /> 1 TQ1 1 ION CL 1 -1.0

      ;;;;;; CHOLINE ION

      [moleculetype]<br /> ; molname nrexcl<br /> NC3 1

      [atoms]<br /> ;id type resnr residu atom cgnr charge<br /> 1 Q0 1 ION NC3 1 1.0

      ;;;;;; CALCIUM ION

      [moleculetype]<br /> ; molname nrexcl<br /> SCA 1

      [atoms]<br /> ;id type resnr residu atom cgnr charge<br /> 1 SQ2 1 ION CA 1 2.0

      Since we understand that this is causing a confusion, we modified the sentence as below (Page 6, right above the Simulation Details section):

      “We used the relevant TQ bead types for Na+ and Cl- ions and kept the ion-water and ion-protein interactions unmodified.”

      For further details of the parameters (e.g., epsilon-sigma), we made our topology and run parameter files publicly available (please see the response to the point 10).

      Comment: 7)<br /> We suggest that the author, where possible, reports error estimates for the various observables, for example from block error analysis and/or repeated simulations.

      The author’s response: We performed block averaging analysis (using two block) for volume estimation (accordingly, the protein concentration in the dense phase) and included the error estimates in Table 1 (Page 12). We note that for most ???? parameters, the error was less than 1%. But we now added the errors larger than 1% in Figure 4. We modified the Table 1 caption as:<br /> “…. Statistical errors calculated by block averaging of the data (dividing the equilibrated data into two equal blocks) are less than 1% at low ???? conditions. Errors larger than 1% are reported.”

      Comment: 8)<br /> It would be useful to include a discussion of the effects of simulation convergence and simulation starting configurations on the reported results.

      The author’s response: We added a discussion of the reproducibility issue and the initial condition dependence both to the Results and Discussion section and the Conclusions section (please also see the responses to the point 1a and 1c).

      Comment: 9)<br /> A discussion of the potential differences in the effect of non-bonded cut-offs in the dilute and dense phase would also be useful.

      The author’s response: We used a fairly large cutoff distance (1.1 nm) for short-range treatment of vdW and electrostatics but a potential nonbonded cutoff effect that I can think of is the long-range treatment of electrostatics. While vdW interactions are large power of r in denominator (therefore, negligible contribution to the potential at large r), we may argue that the long-range treatment of electrostatics might be a concern in general. It is well known that the simple cutoff of electrostatic interactions introduces artifacts on phase behavior of anomalous liquids that has two distinct phases [e.g., J. Chem. Phys. 131, 104508 (2009)]. Here, we applied the reaction field method for long-range treatment of electrostatics. In this method, a given particle is assumed to be surrounded by a spherical cavity of finite radius within which the electrostatic interactions are calculated explicitly. Outside the cavity, the system is treated as a dielectric continuum. Any net dipole within the cavity induces a polarization in the dielectric, which in turn interacts with the given molecule. The reaction field method allows the replacement of the infinite Coulomb sum by a finite sum plus the reaction field. One caveat of this approach might be the nonuniform distribution of the particles within the system (i.e., one protein-dense phase and one protein-dilute phase), which may jeopardize the assumption that outside the cavity is a uniform continuum dielectric. While this caveat may make the Ewald summation (or particle mesh Ewald, faster version of Ewald sum) look more preferable, we note that Ewald sum and reaction field techniques yield nearly identical phase behavior for liquid crystals (also nonuniform in nature) (see, Molecular Physics 92(4), 723-734 (1997)). We discussed some of these points in the main text as follows (Page 6, third from the last sentence):

      “Long-range electrostatic interactions were calculated using a generalized reaction field method [45]. We note that a long-range treatment of electrostatic interactions is essential to obtain accurate phase behavior [46].”

      Comment: 10)<br /> It would be very useful if the inputs/settings (including starting configurations) used for simulation and code for analysis were available.

      The author’s response: Following the reviewer’s suggestion, we uploaded the initial configurations and run files for all lambda values for 0 mM salt and 100 mM to GitHub and made it publicly available. We now noted in the availability of the data in the main text by modifying the last paragraph of Modeling subsection as follows:

      “Equilibrated initial conditions, topology files, and run parameter files for all λ values of 0 mM and 100 mM salt are publicly available on GitHub (https://github.com/gzerze/m...

      Comment: We also have the following suggestions for minor revisions to the manuscript:<br /> 1)<br /> “We kept the protein-protein interactions unmodified (and no additional elastic backbone constraints were applied)”: The author should clarify whether this includes assignment of secondary structure and/or side chain angle and dihedral restraints (ss and scfix in Martinize).

      The author’s response: Yes, this would apply for any restraints (i.e., they would remain unmodified). This particular protein, FUS LC, is left fully flexible, without any backbone/side chain structure. We clarified this in the main text by modifying the relevant part in the Modeling subsection:

      “No elastic backbone (or side chain) constraints were applied (i.e., FUS LC is kept fully flexible). We kept the protein-protein interactions unmodified but systematically tested a range of scaled protein-water interactions.”

      Comment: 2)<br /> “All simulations were performed using GROMACS MD engine (version 2016.3).”: Error in references.

      The author’s response: The references are fixed.

      Comment: 3)<br /> In the Cluster Formation Analysis section: We suggest that the author cites the specific package used (e.g. SciPy).

      The author’s response: Following the reviewer’s suggestion, we added the name of the routine related references by modifying the relevant part in Cluster Formation Analysis subsection as follows:

      “Any two protein molecules are considered to be in the same cluster if any two beads of the molecules are within 0.5 nm (or less) distance from each other. Based on this criterion, we built adjacency matrices and then found the connected components by using the compressed sparse graph routines of public Python libraries [50]”

      Comment: 4)<br /> Fig. 2: There are small red dots on the droplets, which should either be explained in the figure text or removed.

      The author’s response: Following the reviewer’s suggestion, we remade the Figure 2 by removing the red dots.

      Comment: 5)<br /> Fig. 3: It would be useful for the reader if the NaCl concentration was labelled at the top of each column. Additionally, the radial distribution of the ion concentration is shown as two separate rows, which we assume corresponds to Na+ and Cl- ions. This should be clearly labelled.

      The author’s response: Following the reviewer’s suggestion, we updated Figure 3 with proper labels.

      Comment: 6)<br /> “We found the largest water fraction For the ionic species…”: Typo?

      The author’s response: We removed that incomplete sentence now.

      Comment: 7)<br /> Fig. 4: Depending on how the plot is updated with more details on the experiments, perhaps the range shown on the y-axis could be made smaller.

      The author’s response: Figure 4 is updated as presented above (please see the response to point 7 above).

      Comment: 8)<br /> Fig. 5: May be clearer with a colourmap with three colours, as in figure 6.

      The author’s response: Figure 5 uses a color scale that changes the colors uniformly from black to white. For contact maps (like Figure 5), since the range of change is sequential growth of fraction, we thought a perceptually uniform sequential color scale fits better as opposed to a divergent color scale (e.g. the color scale in Figure 6).

    2. On 2022-11-27 12:46:31, user Kresten Lindorff-Larsen wrote:

      Review of “Optimizing the Martini 3 force field reveals the effects of the intricate balance between protein-water interaction strength and salt concentration on biomolecular condensate formation” by Gül H. Zerze<br /> Reviewed by F. Emil Thomasen and Kresten Lindorff-Larsen

      Comments:The preprinted manuscript by Zerze reports on molecular dynamics simulations of the intrinsically disordered low complexity domain (LCD) of FUS using a beta version of the coarse-grained force field Martini 3. The author performed simulations to study the formation of FUS LCD condensates under varying protein-water interaction strengths (in the Martini force field) and at different NaCl concentrations, and concludes that strengthening protein-water interactions by a factor of 1.03 improves the agreement with experimental transfer free energies between the dilute and dense phases. Additionally, the author concludes that the NaCl concentration affects condensate morphology and protein-protein interactions in the condensate, and that the effect of NaCl concentration on protein-protein interactions in the condensate is sensitive to rescaling of the protein-water interactions. The preprint provides an interesting and novel benchmark of the (beta) Martini 3 model in predicting phase separation of IDPs, and reveals potential short-comings of the model in predicting protein concentrations in (or volumes of) the condensed and dilute phases. This benchmark will be useful for readers who wish to simulate liquid-liquid phase separation of IDPs with Martini 3, and the work will be interesting to a wider audience interested in the biophysics of IDPs and their condensates.

      Below we outline some questions and comments that the author might take into account when revising the manuscript. Our main comment regards a clearer assessment of the convergence of the simulations and correspondingly the lack of error estimates for observables calculated from the simulations. We also suggest a clearer presentation of the experimental data used to validate the simulations. While some of these changes are mostly textual, in other cases we suggest additional simulations. We realize that some of these simulations require substantial resources; if these are beyond what is available, we suggest at least to clarify caveats as per the points below.

      We have the following suggestions for revisions to the manuscript:

      1)<br /> Fig. 1 and 2: The finding of non-spherical droplets is interesting and intriguing. To examine whether the formation of these shapes in the simulations with higher salt and λ-values represent stable states or perhaps trapped metastable states of the system, we suggest that:

      1a) The author runs simulations with the parameters that give rise to non-spherical morphologies (e.g. λ=1.025 and 50 mM NaCl) starting from the structure of the spherical droplet (for example formed with λ=1.0 and no salt) and observe whether the non-spherical morphology is recovered or the droplet remains stable. If the droplet remains stable, then the effect of salt concentration on the inter-chain contacts (Fig. 6) could be assessed without potentially confounding factors from different dense phase morphologies.

      1b) The author shows time-series or distributions of an observable that reports on the dynamics of the proteins in the non-spherical droplet (e.g. Rg, mean square displacement, residue-residue contacts) and/or of an observable that reports on the dynamics of the droplet shape (e.g. the x-, y-, and z-components of the gyration tensor).

      1c) Additionally, independent replicas of droplet formation for each condition and parameter set would be ideal, but we realize that this would be expensive in computational resources and may be infeasible.

      2)<br /> “As λ increases, the volume of the dense phase increases (and condensed phase concentration decreases accordingly) until the system is not capable of forming a dense phase (λ >1.03)”: From Fig. 1 it seems that the rate of cluster formation decreases as λ increases. Is it not then possible that droplet formation at λ>1.03 is stable at equilibrium, but occurs on time-scales greater than those tested in the simulations? To support the statement that no droplets are stable at λ>1.03, we suggest that the author runs simulations with a higher value of λ starting from the structure of the spherical droplet (formed with λ=1.0 and no salt) to observe whether the droplet is dissolved or remains stable.

      3)<br /> Figure 3: The use of the radial distribution does not seem ideal for the droplets that have a non-spherical morphology, as certain distances will report on an average over the dense and dilute phases. This should at a minimum be discussed.

      4)<br /> Table 1: It seems that the discrepancy between the sigmoidal fit approach and the surface reconstruction approach increases with λ, possibly due to sensitivity to the shape of the droplets, illustrating that there might be significant uncertainty associated with the reported dense phase volumes. We think it would be useful to have an error estimate for the reported dense phase volumes (e.g. an error over volume calculation approaches and/or over different probe sizes).

      5)<br /> Table 2 and Fig. 4: We suggest that the author more explicitly states which experimental data was used for comparison with the simulations in Fig. 4. We also suggest a more direct comparison with experimental data points where possible (e.g. by showing the experimental values of csat as a function of NaCl concentration).

      6)<br /> “We used the “tiny” bead type (TQ1) both for Na+ and Cl- ions”: The author should clarify the reason for and possible effects of choosing the TQ1 bead type, as TQ5 is, we think, the standard bead type for Na+ and Cl- ions in Martini 3.

      7)<br /> We suggest that the author, where possible, reports error estimates for the various observables, for example from block error analysis and/or repeated simulations.

      8)<br /> It would be useful to include a discussion of the effects of simulation convergence and simulation starting configurations on the reported results.

      9)<br /> A discussion of the potential differences in the effect of non-bonded cut-offs in the dilute and dense phase would also be useful.

      10)<br /> It would be very useful if the inputs/settings (including starting configurations) used for simulation and code for analysis were available.

      We also have the following suggestions for minor changes to the manuscript:

      1)<br /> “We kept the protein-protein interactions unmodified (and no additional elastic backbone constraints were applied)”: The author should clarify whether this includes assignment of secondary structure and/or side chain angle and dihedral restraints (ss and scfix in Martinize).

      2)<br /> “All simulations were performed using GROMACS MD engine (version 2016.3).”: Error in references.

      3)<br /> In the Cluster Formation Analysis section: We suggest that the author cites the specific package used (e.g. SciPy).

      4)<br /> Fig. 2: There are small red dots on the droplets, which should either be explained in the figure text or removed.

      5)<br /> Fig. 3: It would be useful for the reader if the NaCl concentration was labelled at the top of each column. Additionally, the radial distribution of the ion concentration is shown as two separate rows, which we assume corresponds to Na+ and Cl- ions. This should be clearly labelled.

      6)<br /> “We found the largest water fraction For the ionic species…”: Typo?

      7)<br /> Fig. 4: Depending on how the plot is updated with more details on the experiments, perhaps the range shown on the y-axis could be made smaller.

      8)<br /> Fig. 5: May be clearer with a colourmap with three colours, as in figure 6.

    1. On 2023-01-18 18:20:58, user Nick Bauer wrote:

      I'm disappointed to see that the code for this has not been made public yet. The preprint states that the code will be made available with formal publication, but it has been three years since the preprint was posted and I can find no formal published version.

    1. On 2023-01-05 20:51:34, user Gregory Way wrote:

      Wong et al. present a deep learning approach called MOAProfiler (MP) to specifically predict compound mechanism of action (MOA) from Cell Painting images. They benchmark MP against CellProfiler (the standard image-based profiling approach) and DeepProfiler (an emerging image-based profiling approach also based on deep learning) using two publicly-available datasets (JUMP-pilot and LINCS). They evaluate these approaches using precision, recall, and f1 score at k for held out MOA predictions and by comparing similarity between same-MOA and different-MOA profiles. They report an astounding 1,000% performance increase for MP over CellProfiler and DeepProfiler in grouping like MOAs. We thank the authors for posting their work as a preprint - thank you!

      The primary innovation in MP is the specific training approach. MP uses the same architectural backbone as DeepProfiler (EfficientNet), but trains the model directly to predict compound MOA (instead, DeepProfiler uses EfficientNet to derive representations). Additionally, MP does not perform single-cell segmentation, instead training using full field of views and a series of data augmentations. The authors use the last layer as the per-compound feature embedding in their performance benchmarks. The authors also include several convincing supplementary analyses that further support their claims.

      Overall, the paper presents a very interesting observation and pushes against a commonly held mindset of analyzing Cell Painting data with generalist/universal approaches. Instead, the paper suggests that fine-tuned models for specific applications are vastly superior for specifically tailored tasks.

      However, we have two major concerns and several relatively minor comments that the authors might clarify in order to strengthen their findings and claims.

      Major concerns:

      • Our primary concern involves publicly-available resources. Namely, the github url is not public: https://github.com/pfizer-r.... Because we were unable to access the code, we were not able to perform a detailed code review. Additionally, the authors link to the CellProfiler and DeepProfiler embeddings they used to benchmark. These embeddings were derived from https://github.com/broadins.... These are not the official LINCS and JUMP resources, and at least one of the links pointed to level 3 profiles, which are not normalized. This could at least partially explain the exceptionally poor performance for CellProfiler and DeepProfiler.
      • Second, the authors train two separate MP models for both datasets. Did the authors try applying a trained-MP on the alternative dataset? The authors state: “To simulate the real-world use case of identifying MOAs of unknown held-out compounds, we performed an analysis where we split the dataset by compound instead of by wells (Methods).” We imagine that analyzing future compounds using embeddings of a pre-trained MP is also a common real-world application. This analysis would also reveal the level of overfitting occurring in each independently trained dataset. Would combining datasets improve performance?

      Minor comments and concerns:

      • The authors state: “​​Although traditional computer vision techniques have proved useful, they often require much fine-tuning and require human intelligence and intuition for deciding which phenotypic features and their parameters are important to measure.” We think this is a really good point, and we are glad someone else brought up the parameters and all the fine-tuning that typically needs to happen, even for generalist approaches.

      • The authors state: “In contrast, deep learning has emerged as a tool for learning and encoding meaningful representations (i.e. embeddings) without requiring humans to know beforehand what features may be useful for the task of interest.” We may have missed this, but the authors might decide to mention the deep learning limitation of having unlabeled and difficult-to-interpret features.

      • Figure 1B needs a scale bar

      • The authors state: “We divided the dataset such that 60% of the wells were assigned to training, 10% to validation, and 30% to test (Methods).” What does “class-balanced the training set” mean? Is this during cross validation? The authors should clarify.

      • The authors state: “We also ensured each MOA’s test wells spanned multiple plates (at least seven, Figure 2D, left)”. However, Figure 2D shows that most MOAs in LINCS spanned fewer than 7 plates, what did the authors do with those?

      • The authors state: “We also included the negative DMSO as a class to learn but excluded it from all performance metrics because of its overrepresentation in the dataset”. It would be helpful for the authors to clarify how they handled positive controls. Also related, the authors state: “we performed four analyses to assess how well the embeddings captured MOA-specific features.” How did DMSO perform? It would be interesting to see the distribution of DMSO probabilities across classes, which could point to classes with no effect or how often DMSO features might be influenced by batch effects.

      • For Figure 3A, the authors should clarify that their supervised learning architecture was multi-class. This is not explicitly stated.

      • The authors state: “On the held-out test set, the model achieved an area under the precision recall curve (AUPRC) of 0.46 (random AUPRC = 0.006) for image field classification over 176 MOA classes (Figure 3A)”. How are the authors calculating this random AUPRC? If this is theoretical, the authors should compare performance with a model trained with a randomly shuffled baseline.

      • Additionally, the authors state elsewhere: “it was able to correctly predict MOAs for 10.2-13.6% of the compounds in a space of 176 possible MOAs. Compared to a random baseline of 0.6%, this is a 17.9-23.9x improvement.” This begs another question of how the authors formed the baselines. Also, why did the authors choose to not include DP and CP in this eval?

      • The Figure 4B plate map might be wrong. There are more DMSO and what are the NAs?

      • How did the authors determine the categories “strongly correlated” and “weakly correlated”? At different thresholds did MP still outperform?

      • The authors state: “Performance varied depending upon whether we predicted MOA by the neural network’s classification output or by a compound’s latent similarity to training compound embeddings”. The authors should clarify how they determined these classification outputs and latent similarities as they are introduced.

      • The authors state: “(delta = 0.44 for MP, Figure 3C). For both CP and DP, the difference was smaller (delta = 0.03 for CP, 0.03 for DP).” The authors define delta in the figure legend, but this should also be clearly delineated in the methods.

      • We were confused by the legend in figure 4D - why are each of the models showing a different k? Is this the optimal k? MP doesn’t look optimal at k=4.

      • The authors state: “From a low-dimensional t-distributed stochastic neighbor embedding (TSNE) visualization of embeddings from three example MOAs, we could see that different compounds with the same MOA were clustered together with different MOAs inhabiting different areas in latent space (Figure 3G).” How did the authors choose these three example MOAs? Why not include all of them? It would be nice to visualize all embeddings for both datasets, and the TSNE plots look a bit strange, with highly similar distances between points.

      • The authors state: “However, the model created embeddings that were clustered by MOA despite each MOA being represented by multiple compounds (Supplemental Figure 1).” Supplementary Figure 1 is not a specific enough reference - there are multiple panels and it is unclear which panel the reader should focus on.

      • The authors state: “We found minor differences in classification accuracy (0.54 vs .50) suggesting that the model was not leveraging much confounding edge-specific features for its learning” Given the number of NA’s (especially in LINCS platemap in Figure 4), normalization to remove batch effects or TSNE/UMAP to suggest no batch effects would be more convincing.

      • Figure 6G mentions different shapes in the legend but all look like circles in the image (they are different but it's very hard to tell). The authors also forgot to include the letter g in the figure legend.

      • Does supplementary figure 3 show MP embeddings? This is not explicitly stated.

      • Performance across MOA counts for MP is impressive! Very strong performance at low n

      • In the discussion, the authors state: “Second, all the analyses were performed on compounds with just one known MOA. Understanding drugs that are associated with multiple MOAs is an important task, but our study did not address this question.” The authors seem to avoid explaining this in-depth throughout the article. Why is it an important task and is their justification for not including drugs with multiple MOAs good enough? They mention that they didn’t include compounds with multiple MOAs to simplify the compound space and limit polypharmacology intricacies. Did they try to include compounds with multiple MOAs? If so, I think they should report the results. If the results are bad, then that could give insight into how we can improve performance.

      • In the discussion, the authors state: “Although DP is another deep learning based approach to phenotypic profiling that also uses an EfficientNet backbone architecture, we observed larger performance gains with MP.” What was the authors’ rationale for using EfficientNet? Also, “architecture” here and in other sentences appears to have a broad meaning. Could another word be substituted for greater specificity? We think it would be helpful to include a diagram of their model architecture.

      • The authors state: “We permuted each channel’s brightness and contrast independently by a random factor in the range of 0 to 0.30 (just for the LINCS dataset).” This seems non-traditional, the authors should provide a citation. Why not perform this in the JUMP dataset?

      • The authors state: “As a final training augmentation step, we performed random 90-degree rotations on each image, along with random horizontal flips.” The authors should specify how many augmentations they performed, how did this expand the dataset, were any specific augmentations particularly helpful?

      • The authors state: “We kept only the compound data that had no more than one known MOA according to the CLUE Connectivity Map…” How often does compound data have more than one MOA from the CLUE Connectivity Map? Would it create a significant difference in results if others were included? How was CLUE connectivity data joined with or used as a filter for JUMP1?

      • The authors state: “We trained for 100 epochs and selected the model that had the highest accuracy on the validation set. We used a learning rate of 0.1, a weight decay of 0.0001, a dropout rate of 0.2, a learning momentum of 0.9, a learning rate scheduler with a gamma decay of 0.1 at epoch 50 and 75, and batch size of 56 for training.” Did the authors perform any sort of hyperparameter optimization? How did they select these hyperparameters?

      • For their CellProfiler pipelines, the authors do not explain why they used specific modules. The pipeline utilizes various different modules that I haven’t seen in other pipelines so it would help to know what is being done if there were notes.

      Reviewed by:<br /> Gregory P. Way, PhD<br /> Jenna Tomkinson<br /> Roshan Kern<br /> Dave Bunten<br /> Parker Hicks<br /> Rose Doss<br /> Keenan Manpearl

    1. On 2022-11-05 08:46:05, user Heather Etchevers wrote:

      Very convincing, careful enhancer analyses.

      For the record, my collaborators and I published the following some time ago (J Med Genet 2006;43:211–217. doi: 10.1136/jmg.2005.036160):

      "Presumably, the CHD7 protein plays an important role in chromatin remodelling during early development and allows a level of epigenetic control over target genes expressed in mesenchymal cells derived from the cephalic neural crest. We analysed the expression pattern of the CHD7 gene during early human development. CHD7 is widely expressed in the undifferentiated neuroepithelium and in mesenchyme of neural crest origin. Towards the end of the first trimester it is expressed in dorsal root ganglia, cranial nerves/ganglia, and auditory, pituitary, and nasal tissues as well as in the neural retina."

      The involvement of some of the transcription factors of interest in the present work (SOX10, FOXD3, CHD7), were also established in human neural crest and other embryonic tissues in this early transcriptomics paper back in 2008 (Hum Mol Genet. doi: 10.1093/hmg/ddn235):

      "We finally examined the spatial expression of a selection of genes identified by SAGE using in situ hybridization on human embryo sections at C13 (Fig. 4). SOX11 and MAZ code for transcription factors and GJA1 for a critical gap junction protein; other genes we studied (not shown) include the transcription factors SOX10 (24), ZEB2 (25) and CHD7 (26) and HEYL; the receptors encoded by NOTCH2 and FGFR2; and the cytoskeleton-associated CTNNB1 and MID1 (27). All were expressed in both the neuroepithelium and NCC, with the exception of SOX10, which only postmigratory hNCC appeared to express at C13. (...) As in animals, SOX10 (24) and FOXD3 (Fig. 3) appeared to be more expressed by early postmigratory hNCC than the neural tube."

      Due probably in part to restrictions on the number of references, neither of these contributions had been mentioned in the important basic research published thereafter by Bajpai et al. 2010, Sperry et al. 2014, or your own group's Williams et al. 2019.

      But it may not be too late to recognize that others, too, laid foundations for the community's further insights into the specific impacts of widely important (perhaps not "housekeeping") chromatin modifiers such as CHD7 on vertebrate neural crest lineages in particular.

    1. On 2022-11-04 01:00:24, user E. Castedo Ellerman wrote:

      Wonderful that you have been sharing the code as it is getting developed. You can use a reference more permanent than a github.com URL if you use a SWHID. For instance, your official Sep 10 release has SWHID swh:1:rev:a5523b8abe525c0630308dec31801eafc83133d7<br /> which can be used at archive.softwareheritage.org or any service in the future that supports SWHIDs. Software Heritage has more details on how to cite, badges, etc...

    1. On 2022-10-18 15:12:35, user Shaun Mahony wrote:

      This is an interesting paper that applies neural networks to model ATAC-seq and TF binding data in Plasmodium for the first time. However, it would be useful to provide a more detailed description of the methods, particularly to clarify the following points:

      Q1: The training labels are applied to 200bp bins, but the input sequences are expanded to 1Kbp centered on each bin. I'm confused about how the labels propagate when the windows are expanded. Let's say you have a negative bin that neighbors a positive bin. When you expand the 1Kbp window around the negative bin, the input sequence will encompass the positive bin. Is this 1Kbp input then labeled as positive or negative?

      Q2: When you split the data into training/validation/test, do you randomly choose bins or restrict them to particular chromosomes? If you randomly choose 200bp bins for your test data, won't the sequences from the test bins also be present in the training set (because of the way that windows are expanded to 1Kbp)?

      Q3: How were the Enformer and Basenji2 models trained? Are they trained on all data at once (i.e., multi-headed output model) or on each dataset separately?

      Q4: According to Supp. Table 1, most models include the following hyperparameter settings: convKnernal1st=320, convKnernal2ed=480. These hyperparameters are labeled as follows in the table: "convKnernal1st: The filter size of first convolution layer; convKnernal2ed: The second size of first convolution layer". I'm confused by this description. Do these hyperparameters represent the filter width or the number of filters in each layer? I'm presuming the latter, but then what width filters were used?

      Q5: The Github doesn't seem to contain code for training the models; is this code available?

    1. On 2022-10-09 16:47:55, user Jordan Willis wrote:

      Hello,

      Any software that bridges the gap between computationalist and experimentalist is great. However, the primary use case for this will be remote servers. People just don't have the laptops required to run the cellranger software. Even if they did, once you throw docker into the mix, you ask the experimentalist end user to know quite a bit about the command line. If they know enough to use docker, they would probably be able to use a lot of the command line apps available for this. With that in mind, I'd like to see way more documentation on the remote setup. Especially the incorporation of the ShinyProxy. I want to use this for folks on my team but I'm not especially clear on how this app works with the proxy.

      My other concern is the closed source code baked into the docker image. I'd encourage you to release it.

      Also, the youtube documentation while great, also should be posted as static documentation.

      Thanks,<br /> Jordan

    1. On 2022-09-19 14:09:03, user Gregory Way wrote:

      We reviewed this preprint as a part of Arcadia's preprint review initiative: https://twitter.com/Arcadia...

      Peidli et al. present a data resource (for single-cell perturbations) and apply energy distance (e-distance) to quantify differences in perturbations. For the data resource, the authors focus on curating single-cell RNAseq and ATACseq measurements perturbed with CRISPR, drug treatments, and a few other perturbation types. The authors curate a total of 44 datasets. Overall, the paper is very well written with a sound logical flow. However, many elements of the paper seem incomplete. We provide several specific comments regarding our views on how the paper could improve. We thank the authors for posting their preprint and code publicly.

      Our two primary comments are:

      1. The data are not harmonized from reads. Instead, the authors process (in most cases) already processed read count by gene matrices. The authors also use different versions of scanpy to process different datasets. This is definitely still valuable, but the authors should state these facts earlier and probably decrease the use of “harmonization”. Additionally, there is no evaluation to determine the effect or benefit of this read count harmonization. Calculating e-distance before and after harmonization across datasets might be helpful.

      2. E-distance is not sufficiently benchmarked. The math and intuition are described marvelously, but how does E-distance behave across datasets and common perturbations? How does subsampling read depth impact E-distance calculations? How does drug dose impact e-distance? How does sequencing technology impact e-distance? How does modifying the distance metric within the E-distance calculation impact calculations?

      We also have several general comments on different aspects of the paper and github repository. We hope that the authors can benefit from our deep dive on the paper. Thanks again!

      Introduction

      • Definition of single-cell perturbation data (SCPD)

      Overall, this subsection is more of a “methods/techniques overview” of how to collect SCPD rather than defining what SCPD actually is. What is output from these techniques?<br /> - The authors should define these data in more detail.<br /> - The authors should also further define the techniques as it is helpful to have a general idea of why the data collected from the techniques are “good” and not just “more data are better”.

      Motivation for distance measure of high-dimensional profiles:

      • The authors claim that E-distance can identify strong or weak perturbations. It’s unclear what a strong or weak perturbation is. I was unable to find this information from a quick google search so I think they should define that here (not found in methods either).

      Motivation for unifying datasets

      • Their motivation only seems to be “it doesn’t exist yet because it’s difficult to do” so therefore we should do it. What will/could come of the integrated and standardized datasets? What would we hope to find?

      Web Interface

      • The authors claim, “a web interface for data access, analysis and visualization is available at scperturb.org.” There is data access on that site, but analysis and visualization appear to be absent using Brave and Safari browsers.
      • It seems that one would require a computer with enough memory (500G) to run scPerturb to reproduce the analysis. The authors present solutions for how to overcome these requirements, but it did not seem that they attempted to solve them.
      • The authors state that there are Quality Control plots for each dataset on the website but we could not find.

      Results<br /> - The authors should briefly describe the methods underlying the statement “dense low-dimensional embeddings of the original data (see Methods for details)” in a bit more detail upon introduction.<br /> - It is surprising to me that there are so many cells with 2 perturbations (proportionally to a single perturbation) (sup fig 1). Is this because of an overweighting of a specific study?<br /> - It might be helpful to add targeted sequencing depth to table 1 per study, also helpful to add the sequencing platforms used.<br /> - Data source trust: Zenodo sources appear to be auxiliary data downloads as opposed to direct sources. How might other researchers assume trust in the sources? Are the included metadata implied or entrusted to the authors?<br /> - Are the UMAPs in Figure 3E the same UMAP space or are the spaces fit independently in both panels?<br /> - Need to provide a bit more rationale for why the authors chose E-distance over the other options.<br /> - Did they calculate E-distance for all perturbations? Sup Fig 3 shows this, so maybe? It was not obvious where to find the measurements.<br /> - There are only 11 drug perturbations in common. This is a very interesting observation! How many genes are perturbed in common datasets?

      Methods<br /> - For the scATAC-Seq data, it’s not clear to me if they perform LSI jointly across all samples or not. This would cause non-interoperability across datasets if not done jointly since each LSI dimension may mean something different in each dataset. In addition, they provide peaks x counts matrix -- which is dataset specific. I would suggest aligning jointly using a uniform set of peaks -- Running MACS2 on all datasets would be a huge benefit to the community.<br /> - How do the different versions of scanpy impact data processing? Typically, harmonized data are generated with a single pipeline.<br /> - When performing subsampling to fit PCA, did the authors transform the full data subsequently? In other words, does the PCA fitting step impact cell count for e-distance calculation?<br /> - What distance measure is used in the E-distance calculation for ||x_i - x_J||? L2? For perturbations, comparing L2 to other metrics would help benchmark the method.

      Code/Github<br /> - It seems to us a good idea to spend time improving the existing model / code at https://github.com/theislab.... The authors should justify why they are not contributing to existing open source code.<br /> - I can’t find the script “fragments2outputs.R” in their github. From their paper: “All features described in the overview above were computed with ArchR functions. For details inspect the “fragments2outputs.R” script in our code repository (see Data Availability).”

      Data Repo comments:<br /> - Manual data testing for reproducibility within https://github.com/sanderla... (one must perform the steps, the repo doesn’t provide or outline within the code itself)<br /> - Suggests using “mamba” but does not provide instructions on how to install mamba <br /> - Would suggest a small description for each folder in the directory (README) explaining its contents <br /> - There’s no usage example on how to download the data or use the program<br /> - Would be best to have a notebook (or bash script) that describes the entire workflow. <br /> - The notebooks are not sequentially executed and there are no execution instructions<br /> - What environment (OS/hardware/configuration/etc) is required to run the code?<br /> - Is notebook (.ipynb) output expected within committed code? (should these have been scrubbed with nbconvert/jupytext?)

      Data Availability<br /> - Based on this section, their website only contains the first three bullet points (e.g scRNA-seq data, scATAC-seq data, and details about the datasets). We could not easily find the last three bullet points (Quality control plots for each dataset, Filtering, e.g., by readout or type of perturbation, Commands for direct file download using the Unix command curl)

      This review was produced jointly at The University of Colorado by:

      Gregory P. Way, PhD<br /> Natalie Davidson, PhD<br /> Erik Serrano<br /> Parker Hicks<br /> Jenna Tomkinson<br /> Dave Bunten

    1. On 2022-08-25 15:11:03, user Sam Nooij wrote:

      I would like to thank the authors for sharing an updated version (v3) of such a fascinating manuscript. I enjoyed reading it and I have a couple of questions/points of feedback that I would like to share.<br /> 1. Do I understand correctly that in lines 133-134 the authors state that higher percentages of read recruitment to donor MAGs suggest bacterial colonisation? This seems to me very indirect evidence and not a justifiable conclusion based on this observation alone.<br /> 2. In figure 1, Canada is shown slightly bigger than the other countries. Is this because the donors and patients from the current study are also from Canada? I could not find this information in the text.<br /> Furthermore, I like that the authors made a distinction between industrialised and less industrialised countries. Would it be possible to change the colours slightly (e.g. make the red magenta) to make it easier for colour blind people to see?

      Also, the blue and purple rows are somewhat difficult to distinguish for me. Is this intentional, as the donor and post-FMT recipient microbiota are supposed to be similar?<br /> 3. I find it interesting that the authors found that only 16 and 44% of donor microbial genomes were detected in all donor metagenomes, suggesting that only a minority of bacteria is stably present over longer periods. (Lines 180-182.) Have the authors considered doing similar analyses on these genomes to find if they are HMI and LMI bacteria?<br /> 4. I find the comparison made between short-read taxonomy and donor population detection (lines 184-188) a little difficult to follow. Would it be possible to rephrase this part or add a little explanation of what exactly is compared?<br /> 5. Lines 190-194 are also fascinating! Even with such small numbers, it is striking to see that more bacteria colonise from pills that from colonoscopic transfer. Could the authors speculate or provide additional info on why this may be the case?<br /> 6. From the final conclusion (lines 431-435) I gather that FMT or similar microbiota therapeutics are unlikely to (temporarily?) cure IBD. Do the authors have suggestions as to what might work better, and would they like to share their perspectives on promising new treatment options?<br /> 7. And finally, what do the ellipses in supplementary figures 2 and 3 represent? I suppose they show some sort of area around each cluster centroid. A few extra words of explanation in the figure captions would be nice. This information is also not easily found in the analysis scripts. (And by the way, it is wonderful that the authors share all code and instructions on how to reproduce the analyses!)

    1. On 2022-07-10 01:25:17, user Charles Warden wrote:

      Hi,

      Thank you very much for posting this preprint.

      I think this provides a number of useful references and comparisons.

      I also have a few comments:

      1) While rMATS and MISO are popular programs, I don't think they are necessarily my preferred starting point for splicing analysis.

      For example, I might consider the following as starting points for splicing analysis:

      a) QoRTS + JunctionSeq (an extension of DEXSeq with a separate dispersion estimate for junctions versus exons)<br /> b) Custom analysis of the tab-delimited splice junction counts (SJ.out.tab) from STAR.

      Perhaps b) is too open-ended. However, if you labeled the junctions like genes or transcripts, you could calculate Count-Per-Million (CPM) values and perhaps use traditional differential expression methods (DESeq2, edgeR, limma-voom, etc.).

      Unlike a), there would not really be specialized splicing visualization for b) without additional coding.

      However, both of these strategies do not fit the description of "alternative splicing events (ASEs) are conceptualized as binary choices". Instead, each junction is treated like a feature.

      So, I can see how that can complicate comparisons. Nevertheless, limma-voom and DESeq2 are used as part of this package. If I look at the ASE-methods.R code and the supplemental methods, I think the main difference is the "Treatment:Isoform" interaction variable.

      In other words, if you had 2 groups with 3 replicates and 2 junctions as part of splicing event, am I understanding the that the difference between SpliceWiz and what I would have otherwise expected is that you now have 12 count measurements (2 junctions) instead of 6 count measurements (1 junction)?

      So, if I am understanding everything, I am not sure if you have comments with respect to why a strategy more like JunctionSeq (or what I might write with custom code for individual junctions) is not preferable and not provided as an option within SpliceWiz?

      2) I thought Figure S4 was interesting and useful.

      However, the retained intron counts are quite different, as also mentioned in the text: "Of the 15,188 extra retained introns annotated by SpliceWiz, 15,144 (99.7%) of these would have been excluded due to this extra criterion."

      Identifying additional differential intron retention events might be helpful, but I also wonder if perhaps this also corresponds to additional false positives. For example, I am also not sure how often there might be pre-mRNA and/or pseudogenes where no introns are spliced (not a single retained intron event). For example, if there are confounding factors not sufficiently modeled in simulated data, then I would expect estimates of accuracy could be considerably different than in actual data.

      3) As a minor point, there is a typo in the description for Figure 5: "Heirarachical" instead of "Hierarchical".

      Thanks Again,<br /> Charles

    1. On 2022-06-30 16:17:06, user Arianna Basile wrote:

      Hello :) Very interesting, thank you for this tool and best luck for its publication. May I ask if you are planning to release the code on a public GitHub page?

      Best,<br /> Arianna

    1. On 2022-06-18 13:53:05, user Marc RobinsonRechavi wrote:

      Dear Yamaguchi et al,

      You write under Data accessibility:

      All code necessary to repeat the analysis described in this study have been made available. SLiM source codes of our model for speciation dynamics will be hosted on Dryad Digital Repository upon acceptance. There are no data to be archived.

      This is a publication, i.e. it is made public as part of the scientific record and is citable, thus I strongly invite you to make the corresponding source code available without delay.

    1. On 2022-06-14 08:39:13, user Jade Bruxaux wrote:

      Hi!<br /> Would it be possible to get access to the Supplemental methods / code mentioned in the text?<br /> Thanks in advance!

    1. On 2022-05-20 03:55:55, user Jake Gratten wrote:

      Response to Morton et al. (2022): model mis-specification criticism overlooks sensitivity analyses and orthogonal analyses

      The core criticism of our study (Yap et al., 2021) made by Morton et al. was that the linear mixed model (LMM) framework we employed includes a questionable biological assumption – that diet and the microbiome are independent. They correctly note that diet is known to influence the microbiome (David et al., 2014; Rothschild et al., 2018), and thus, as these factors are inter-related, our model may be prone to biased inference. We acknowledge these points in relation to the specific LMM (see below) on which the critique by Morton et al. is focused. However, we respectfully disagree with their conclusion that this issue invalidates the findings reported in our paper, because their critique (1) incorrectly asserts that this result formed the basis of our conclusions, and (2) it overlooks several key analyses, including extensive sensitivity analyses that were specifically performed to test this (and other) assumptions.

      Morton and colleagues focus on a single LMM analysis of ASD in their critique, in which we adjusted for sex, age and diet, the latter by fitting the top three principal components from PCA of the centre log ratio (clr)-transformed percent energy variables from the Australian Eating Survey (AES), a validated food frequency questionnaire. In this analysis, we found that 0% of the variance in ASD diagnosis was associated with the microbiome, irrespective of the microbiome features used to construct the correlation matrix describing the relationships between random effects (e.g., common species, rare species, common genes, rare genes) (Yap et al., 2021). As diet is correlated with the microbiome, it is possible that adjusting for diet in this analysis has removed variance in ASD diagnosis that may be attributable to the microbiome. In their critique, the authors present simulations purporting to show that this issue could lead to failure to detect even very large proportions of variance (in their example 83%) (Morton, Donovan, & Taroncher-Oldenburg, 2022).

      Unfortunately, they fail to mention that we also performed a LMM analysis of ASD in which we did not adjust for diet (or sex or age). If there was an effect of the microbiome on ASD that had previously been removed by adjusting for diet, then this should now be “revealed” (i.e., captured by the microbiome random effect). However, we found precisely the same result as in our original analysis: that is, 0% of the variance in ASD diagnosis is associated with the microbiome (Yap et al., 2021). Based upon this analysis of the available data we believe it is unlikely that our conclusions have been biased by model mis-specification.

      The authors also do not acknowledge that we performed LMM analyses of traits other than ASD, and whereas there was negligible signal for ASD, IQ and sleep problems, we found large and significant associations of the microbiome with age, sex and stool consistency. Our results for age (i.e., ~30% of the variance associated with common microbiome species) are particularly notable because they recapitulate the findings reported in a large (independent) sample of >30K adult stool metagenomes (Rothschild et al., 2020). Our LMM results for age, sex and stool consistency were also largely unaffected by adjusting for diet (Yap et al., 2021). These analyses, which were specifically included for the purpose of benchmarking the findings for ASD, provide further evidence that our methods are not prone to under-estimating the proportion of trait variance associated with the microbiome.

      It is also relevant to highlight that the directionality of the causal graphs presented by Morton et al. in Figure 1 of their article (i.e., a causal effect of both the microbiome and diet on the host phenotype) are problematic, since the variance component estimates from these models might reflect cause or consequence of the focal trait. This is because microbiome taxonomic proportions change, unlike genotypes used in analogous LMM methods for estimating heritability (which are present at birth and therefore representative of causality). To demonstrate this, take as an example our analysis in which age was the dependent variable and microbiome measures were fitted as random effects (allowing capture of their interdependence). We find roughly 30% of the variance in age is associated with common microbiome species. Clearly, the way to interpret this result is that age is causal for the variance in the microbiome, not the other way round. It is equally possible that ASD influences diet and in turn the microbiome, as opposed to the opposite view espoused by Morton et al. Indeed, the wording used in their critique (i.e., “A more accurate model would have assumed an architecture that explicitly incorporates the direct influence of diet on the ASD phenotype as well as an indirect influence of diet on the ASD phenotype via the microbiome”) appears not to recognise this possibility.

      Looking beyond the LMM analyses in our paper, Morton and colleagues also did not consider several other key sets of analyses on which are conclusions are based, including differential abundance testing using ANCOM (Analysis of Composition of Microbiomes) and extensive linear model analyses. In our ANCOM analysis of ASD, we find a single robustly associated species (Romboutsia timonensis) when adjusting for sex, age and dietary PCs, but this same species remains the only significant finding in analyses without covariates (Yap et al., 2021). This is entirely consistent with our LMM model findings but is not what would be expected if the microbiome was associated with a high proportion of variance in ASD diagnosis. Indeed, irrespective of how the data are analysed (e.g., sibling pairs only, excluding siblings, excluding children with recent exposure to antibiotics, and others), we find negligible evidence for association of individual species with ASD (other than R. timonensis), and no support whatsoever for taxa previously reported to be associated with ASD.

      In our linear model analyses, we show that quantitative measures of the autism spectrum, including both psychometric measures (e.g., ADOS-2/G Restricted and Repetitive Behaviour (RRB) calibrated severity scores) and polygenic scores were associated with reduced dietary diversity (Yap et al., 2021). The most parsimonious interpretation of these findings is that RRBs, which are one of the core diagnostic signs of ASD, manifest in the form of more selective dietary preferences. Polygenic scores, as an immutable component of propensity to ASD-associated traits, are an important and novel aspect of our analysis, given they facilitate preliminary causal inference (noting that we were careful to avoid strong statements about causality in our paper). In contrast, other cross-sectional autism microbiome studies – whose results have been prioritised by Morton et al. – have not exploited genetic predictors for autism-related traits and so cannot distinguish between cause and consequence.

      Overall, using a variety of orthogonal analytical approaches, we find a strong and consistent signal that ASD (and autistic traits) is associated with reduced dietary diversity, and that diet in turn is associated with the microbiome (Yap et al., 2021). These results are consistent with existing evidence for dietary effects on the microbiome (David et al., 2014; Rothschild et al., 2018) – as pointed out by Morton et al. – and with prior evidence (backed by clinical and lived experience) for an association of autism with diet (Berding & Donovan, 2018). We find no direct association of ASD with the microbiome, a result to which Morton and colleagues express surprise, their argument being that if ASD is associated with diet and diet influences the microbiome, then how can there be no direct ASD-microbiome association? The answer is simply that we have a finite sample, and the effect sizes are subtle. We expect that in a larger sample we might observe a direct association, but also stronger evidence that this is due to changes in diet that are related to autistic traits. This is a considerably more intuitive and parsimonious explanation for associations of the microbiome with ASD than the idea that the microbiome contributes to autistic traits, not least because there is strong evidence that ASD is a neuro-developmental condition, and expression of established ASD genes is enriched prenatally (Satterstrom et al., 2020). In this context, it is worth emphasising that the high estimated heritability of ASD (70-80%) (Bai et al., 2019) leaves relatively little room for other putative etiological causal factors (e.g., maternal immune activation). This is especially true given de novo mutations that are known to be important in ASD (Sanders et al., 2015; Sanders et al., 2012; Satterstrom et al., 2020) largely do not contribute to heritability estimates (i.e., because they are not shared by relatives) and so must consume an additional proportion of the remaining 20-30% of variance.

      Morton et al.’s criticism of our study comes despite it being the largest (and therefore most statistically well-powered) to date. Our study also has the dual benefits of matching data on diet and other confounders, which are lacking in many prior studies, and deep metagenomic sequencing, compared to inferior 16S technology in most published ASD microbiome papers. We note that ours is not the first study to report negligible association of the microbiome with ASD (Gondalia et al., 2012; Son et al., 2015). We also point to a recent review in Cell on microbiome studies in animal models (including for autism) highlighting the implausibility of the high proportion of positive findings, asserting that the field suffers from publication bias (Walter, Armet, Finlay, & Shanahan, 2020). That said, we acknowledge that our study has limitations, reflecting difficulties of collecting idealised data sets. Prospective studies collecting faecal samples from infants prior to autism diagnosis are needed to further advance the field, but these are challenging both logistically and because sample size is limited by the population prevalence of ASD (~1%).

      To sum up, we thank Morton et al. for their comments in relation to one specific analysis in our paper. This provides us with the opportunity to clarify the detailed analyses that we performed to reach our conclusions. Unfortunately, the critique from Morton et al. (based solely on simulations) overlooks most of our results, including sensitivity analyses that directly address their criticism. The authors suggest that our data should be re-analysed. We note that our data are available by application to the Australian Autism Biobank which allows other researchers to provide objective empirical evaluation. We are committed to transparent research and provide extensive supplementary materials and publicly available code and hope others in the research community will build upon our work.

      Chloe X. Yap, Peter M. Visscher, Naomi R. Wray and Jacob Gratten <br /> (On behalf of all authors)

      References<br /> Bai, D., Yip, B. H. K., Windham, G. C., Sourander, A., Francis, R., Yoffe, R., . . . Sandin, S. (2019). Association of Genetic and Environmental Factors With Autism in a 5-Country Cohort. JAMA Psychiatry, 76(10), 1035-1043. doi:10.1001/jamapsychiatry.2019.1411

      Berding, K., & Donovan, S. M. (2018). Diet Can Impact Microbiota Composition in Children With Autism Spectrum Disorder. Front Neurosci, 12, 515. doi:10.3389/fnins.2018.00515

      David, L. A., Maurice, C. F., Carmody, R. N., Gootenberg, D. B., Button, J. E., Wolfe, B. E., . . . Turnbaugh, P. J. (2014). Diet rapidly and reproducibly alters the human gut microbiome. Nature, 505(7484), 559-563. doi:10.1038/nature12820

      Gondalia, S. V., Palombo, E. A., Knowles, S. R., Cox, S. B., Meyer, D., & Austin, D. W. (2012). Molecular characterisation of gastrointestinal microbiota of children with autism (with and without gastrointestinal dysfunction) and their neurotypical siblings. Autism Res, 5(6), 419-427. doi:10.1002/aur.1253

      Morton, J. T., Donovan, S. M., & Taroncher-Oldenburg, G. (2022). Decoupling diet from microbiome dynamics results in model mis-specification that implicitly annuls potential associations between the microbiome and disease phenotypes—ruling out any role of the microbiome in autism (Yap et al. 2021) likely a premature conclusion. biorxiv. doi:https://doi.org/10.1101/202...

      Rothschild, D., Leviatan, S., Hanemann, A., Cohen, Y., Weissbrod, O., & Segal, E. (2020). An atlas of robust microbiome associations with phenotypic traits based on large-scale cohorts from two continents. biorxiv. doi:https://doi.org/10.1101/202...

      Rothschild, D., Weissbrod, O., Barkan, E., Kurilshikov, A., Korem, T., Zeevi, D., . . . Segal, E. (2018). Environment dominates over host genetics in shaping human gut microbiota. Nature, 555(7695), 210-215. doi:10.1038/nature25973

      Sanders, S. J., He, X., Willsey, A. J., Ercan-Sencicek, A. G., Samocha, K. E., Cicek, A. E., . . . State, M. W. (2015). Insights into Autism Spectrum Disorder Genomic Architecture and Biology from 71 Risk Loci. Neuron, 87(6), 1215-1233. doi:10.1016/j.neuron.2015.09.016

      Sanders, S. J., Murtha, M. T., Gupta, A. R., Murdoch, J. D., Raubeson, M. J., Willsey, A. J., . . . State, M. W. (2012). De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature, 485(7397), 237-241. doi:10.1038/nature10945

      Satterstrom, F. K., Kosmicki, J. A., Wang, J., Breen, M. S., De Rubeis, S., An, J. Y., . . . Buxbaum, J. D. (2020). Large-Scale Exome Sequencing Study Implicates Both Developmental and Functional Changes in the Neurobiology of Autism. Cell, 180(3), 568-584 e523. doi:10.1016/j.cell.2019.12.036

      Son, J. S., Zheng, L. J., Rowehl, L. M., Tian, X., Zhang, Y., Zhu, W., . . . Li, E. (2015). Comparison of Fecal Microbiota in Children with Autism Spectrum Disorders and Neurotypical Siblings in the Simons Simplex Collection. PLoS ONE, 10(10), e0137725. doi:10.1371/journal.pone.0137725

      Walter, J., Armet, A. M., Finlay, B. B., & Shanahan, F. (2020). Establishing or Exaggerating Causality for the Gut Microbiome: Lessons from Human Microbiota-Associated Rodents. Cell, 180(2), 221-232. doi:10.1016/j.cell.2019.12.025

      Yap, C. X., Henders, A. K., Alvares, G. A., Wood, D. L. A., Krause, L., Tyson, G. W., . . . Gratten, J. (2021). Autism-related dietary preferences mediate autism-gut microbiome associations. Cell, 184(24), 5916-5931 e5917. doi:10.1016/j.cell.2021.10.015

    1. On 2022-05-16 21:15:23, user Jingyi Jessica Li wrote:

      We thank Dr. Hejblum et al for sending us a draft of this article on May 3 before posting it. Below I'm pasting our reply sent to Dr. Hejblum et al on the same day. We believe that our discussion will be beneficial for the community.

      Dear Dr. Hejblum and all,

      Thank you for sending us your correspondence draft. We appreciate your professionalism.

      The main message of our article is that using popular methods without a sanity check may lead to inflated FDR, and permutation offers an easy sanity check.

      We agree that normalization is a tricky issue, and when samples do not need normalization (as is the case for permuted samples, which all come from the same "condition"), normalization may introduce unwanted bias, violate the null hypothesis, and thus deteriorate the FDR control. Meanwhile, we stand with our fundamental assumption that permuted samples should contain no true DE genes. Since many DE methods include normalization as an internal step and only accept count data as input, the only way to fairly compare them is to apply each method as a whole pipeline, not just its DE statistical test step, to the permuted samples. (That is, the "normalization first" approach in your manuscript is inapplicable to the DE methods that only accept count data, unless we dissect these methods and modify their code, which is beyond the scope of our benchmark study.) As a result, any bias introduced by normalizing the permuted samples (which do not need normalization) would be reflected in the actual FDR inflation. The Wilcoxon test is an exception because it is not a DE analysis pipeline, so we applied it to permuted samples without doing normalization in Figure 2A. This explains why our Figure 2A differs from your Figure 1A.

      We would like to clarify that our study is not a comprehensive benchmark because (1) there are numerous DE methods and (2) we did not want to dilute the cautionary message against using the popular DESeq2 and edgeR without a sanity check. Hence, we did not do a dissection of each method to find out how to fix the inflated FDR issue. Our dearseq results are based on dearseq (asymptotic), not dearseq (permuted), because we deemed dearseq (asymptotic) more appropriate when the sample size is large.

      We appreciate your clarification about the effect of normalization on the dearseq performance, and your results motivated us to think about the problem more clearly. However, we respectfully disagree with your conclusion that dearseq outperforms Wilcoxon in your results. Our reasoning is that only dearseq (asymptotic), not dearseq (permuted) has a slight power advantage over Wilcoxon, but dearseq (asymptotic) does not guarantee to control the FDR when the sample size is under 40; on the other hand, Wilcoxon only sacrifices power but not FDR control when the sample size is small. Nevertheless, we agree that dearseq is advantageous in that it can account for more complex experimental designs.

      We would be happy to publicly respond to your correspondence when needed. We believe that our discussion will be beneficial for the community.

      Best,<br /> Jessica


      Jingyi Jessica Li, Ph.D.

      Associate Professor<br /> Department of Statistics<br /> University of California, Los Angeles

      http://jsb.ucla.edu

    1. On 2022-05-12 15:43:02, user L. Collado Torres wrote:

      Hi,

      Congratulations on getting this project to the pre-print finish line! Kudos to you!

      Given some of my research projects, I'll need to read in detail your pre-print as I find it very interesting. That's why I made a feature request (FR) on GitHub asking for a documentation website or information on how to use GPSA https://github.com/andrewch.... I might have missed it, and look forward to further interacting with you. As you are likely acutely aware, sometimes testing software in a different computational system or dataset might reveal some bugs or potential new feature requests. I recognized that you have implemented a GitHub Actions workflow and automatically test your software https://github.com/andrewch... on Python 3.8, which is formidable. As I'm a Python novice, I don't know if there's an equivalent to covr in R (https://CRAN.R-project.org/... for code coverage.

      On the pre-print itself, I greatly appreciate how you've shared all appendixes, and in particular, I love sections 6.1 and 6.2 where you described where you got the datasets you analyzed and link to the code you used, respectively. I would further encourage you to deposit your code at a permanent repository like Zenodo or Figshare (or even bioRxiv) since code can be deleted from GitHub. You'll get a DOI that you can cite in an update pre-print or peer-reviewed version of your manuscript.

      While I'll need external help and/or quite a bit of time to understand your mathematical models, I also like how you have described it in detail.

      I'll repeat here (and edit) some of my questions I asked publicly on Twitter & live during Andrew Jones' talk at #BoG22:

      * Would you be interested in trying out your method in our 2021 data with spatially-adjacent replicates?<br /> * How far can you go in µm? We have some replicates 300 µm apart. (Jean Fan from JHU BME asked the same question framing it as a Z-axis distance question).<br /> * Can you combine H&E + smFISH images?<br /> * For the future studies that you described, what contingencies are you considering for cases where an intermediate tissue slide has a technical problem like tissue folding?

      I recognize that these questions are beyond the code of this pre-print and will likely be answered elsewhere.

      If it helps, we would be happy to chat with you about our data we have publicly available and some that we are also generating.

      Best,<br /> Leonardo

    1. On 2022-05-12 15:05:36, user L. Collado Torres wrote:

      Hi!

      This is a massive project, kudos to you for the pre-print!

      I noticed that no data was shared at the pre-print stage, though you have promised that it'll be shared by publication time on the "Data Availability" section. That is a bummer: we have an example where we shared the data at the pre-print stage in Feb 2020, another method used it in October 2020 for their pre-print, which accelerated since our published version appeared in February 2021. While I recognize that with a large team, coordinating when to release data is tricky, I would encourage you to share data at the pre-print stage in the future. I'd be happy to chat with you in more detail about the advantages of open science: you are on that route already by having a pre-print and participating in the bioRxiv commentathon trial.

      On a similar theme, I greatly appreciate that you have stated specific version numbers of the software you used. That's really useful as software changes frequently, particularly in frontier fields like yours. It's great to see GitHub links to the software pipelines that were developed as part of this project. However, code on GitHub is not permanent. Are you considering depositing the code at Zenodo, Figshare or other permanent code repository where you'll get a DOI? In addition, ultimately code is the final documentation for what you did. I might have missed it, but I don't see a repository for the actual code used for this project (and it's DOI), only for the software pipelines. That code will be useful to see since you might have used non-default parameters which can greatly influence the results.

      Thank you!<br /> Leonardo

    1. On 2022-04-18 19:25:05, user Daniel Himmelstein wrote:

      Living manuscripts with transparent version history are hopefully the way of the future. Since 2017, we've been building an open source tool called Manubot for collaborative authoring on Git/GitHub. Our core repositories currently have 300+ stars on GitHub and have been used to author a wide range of manuscripts.

      For more information, see Open collaborative writing with Manubot. In fact, I encourage the authors of this preprint to submit a pull request to the Manufesto source code to add a citation to their survey in the "Living Manuscripts" section!

    1. On 2022-03-28 21:55:53, user Fraser Lab wrote:

      In this paper, the author uses innovative SVD rotation methods to analyze previous serial crystallography datasets with the aim of delineating retinal isomerization in bacteriorhodopsin (bR). This analysis reveals novel intermediate states that lend insight into how the transition takes place as well as the role played by residues in bR’s active site. The author hypothesizes that a non-specific Coulomb attraction force triggers sampling of multiple cis conformations and single bond rotations in retinal, but the selection of the 13-cis conformation is achieved by the stereochemical constraints in the protein environment. The author lays out the major questions of the paper very clearly in the introduction and the results and discussion sections remain well-focused on answering those questions. The mathematical and computational methods presented here demonstrate how they can be used to tease out the intermediates from “multi” datasets (as shown by the author in his other publications). We share the author’s frustration that “never-observed dataset reconstituted from a collection of experimental datasets does not match the well-established crystallographic template of PDB” - there are interesting new directions in joint refinement across datasets that are considered siloed by the PDB - and this paper contributes well to that frontier. Overall, the paper is well written and the methods section is particularly instructive in the general approach, but lacking details that would allow us to follow the exact implementation used for this paper.

      Major Points:

      1. Overall: no statement on data/code availability is given. Even if proprietary software is used, it is essential to know what options are used and to have a detailed protocol to reproduce the work.

      2. The author comments on how timepoints from the Kovacs et al drive the vectors of interest (U10, U14, U17), yet it’s challenging to differentiate the insights gained from the SVD over FO-FO maps from Kovacs. We suggest a quantitative comparison showing signal-to-noise or goodness of fit which may help the reader interpret the benefits gained from SVD.

      3. While the methods section is quite detailed in describing how SVD is performed and how the Ren rotations retain the orthonormal properties of the SVD, the details of how the angles for rotations are chosen are not given. The author says “SVD analysis presented in this paper employs rotations extensively” but what angles of rotations were sampled? Which were the good rotations and which were the bad rotations? A read-through of the earlier papers of the author (Ren 2019, Ren 2016) did not provide answers for these questions either (with large portions of the text in the methods sections being nearly identical between these papers) We infer this is using the software “Dynamix” based on the Acknowledgements. While it is understandable that decisions on what rotations need to be done are subjective, it is imperative that the author guides the reader through these decisions to help in reproduction/application of this method. Without these details it is difficult to know, for example, whether the oscillatory vectors are related to particular choices in the SVD procedure or are relatively without bias.

      4. Majority of datasets are clustered in the current SVD iteration - did another Ren rotation not separate Kovacs datasets according to an interesting piece of metadata? This seems like an opportunity to rotate the SVD and further demonstrate the utility of the Ren rotation method.

      5. The author shows the oscillatory behavior of 5 pairs of components in figure S3. But how did he arrive at these 5 pairs? Did other pairs not show such kind of oscillatory behavior?

      6. The author goes into fine details of the intermediate conformations of retinal and shows differences in distances in the range of less than an angstrom. Is there a goodness-of-fit metric that can be used by the author to show that the atomic positions determined by the author are indeed the best/only possible positions? What if there is another intermediate conformation of retinal that fits equally well into the densities but does not have the S-shaped creasing? How do these conformations navigate the need to fit to density without being too high energy?

      Minor points:

      1. Labeling of “inboard” and “outboard” terms can be done in main Figure 1 rather than referring reader to S1

      2. Figure 2(b): overall trend is useful, might there be an easier way to visualize? Some sort of 1D - distance plot of integrated intensities?

      3. Panel C in figure 2 is cited extensively in the text. Rather than mentioning the sub-panels as the 1st, 2nd, 3rd panel, etc., the author could just split panel c into d,e,f, etc.

      4. Figure 2(c) - lines are difficult to visualize. They could be made bold. The author could also reduce the panel to show only atomic displacement and torsion angle and move the rest to supplement if necessary, so that the two main plots could be made bigger for easier interpretation.

      5. Figure 1(c) and Figure 3(a) & (b) are very difficult to interpret. The author can zoom in further and use translucent volume rendering rather than mesh

      6. Figure 3(d) - The author could label as “helix A, B, …, G” so that it’s immediately obvious what’s shown in figure, otherwise a nice illustration of motion

      7. Figure 3(e) retinal could be color coded or more easily separated from amino acids to make it easier and faster to interpret.

      8. The maps of I’ (Fig S6), I (Fig S7), J’ (Fig S9) and J (Fig S10) are contoured at different sigma levels presumably due to the Fo contributions differing based on the reconstitution procedure. Is there any trend in the reconstitution that would guide us to calibrate what sigma levels to be equivalent across datasets, or is this purely visual?

      Alexander Wolff, Ashraya Ravikumar, James Fraser (UCSF)

    1. On 2022-02-11 14:28:21, user Mikhail Schelkunov wrote:

      There is a statement in the manuscript "For each query, we greedily collect only the best alignment for each location (option ‘--range-culling --max-target-seqs 1’)". However, I see no "<br /> --max-target-seqs 1" in the code of Proovframe 0.9.7.

    1. On 2022-02-08 22:51:11, user Fraser Lab wrote:

      The application of new machine learning techniques to the protein structure prediction problem has produced revolutionary results in recent years, culminating in the AlphaFold2 performance at the most recent CASP competition. These approaches often leverage deep multiple sequence alignments, which may limit their applicability to sequences without many (sequenced) homologs and to designed proteins. The most important aspect of this paper is that it is determined to take distinct approaches to AlphaFold to the problem, differentiating on speed and (lack of) reliance on multiple sequence alignments. While the compute efficiency is not likely a major breakthrough in the applications that are the major focus of this paper (protein structure prediction of relatively small proteins), the embedding of this approach within a protein design cycle could make it multiplicatively important (and additional speculation to that end would be warranted in the discussion). Given the sophisticated nature of all of these approaches, RGN2 stands out as something that can help unpack what drives the “black box” to work well and it could have an invigorating effect on the field to see how a wider range of model ablations tease out the contribution of different parts of RGN2. The contributions to geometric representation are likely to have an impact in other approaches as well. Overall this paper nicely synthesizes several innovations (Language Models, Geometric Representations, etc) in a fast moving discipline and includes an important new benchmark set of “orphan” proteins.

      Major Points:

      The experiments in this manuscript detail proteins with no homologous sequences. It would be interesting to see a more granular tradeoff between single sequence structure predictors and MSA-based structure predictors for varying MSA depths. See https://proceedings.mlr.pre...<br /> Figure 2-Right as an example plot that could be adapted for the task in this study. It is possible that many methods perform well without MSAs for many targets.

      Minor Points:

      Do RF/AF2 perform better when there is structural (even in absence of sequence) similarity to a known protein or are the orphans/designed proteins topologically distinct enough to make this question silly?<br /> How are disulfides handled, especially in the context of cyclic peptides?

      Lines 68-70 allude to single sequence structure prediction being useful for function. I find it unlikely that structure predictor outputs would be useful for functional/fitness prediction. Could the authors find any indication that their models can perform fitness prediction on DMS datasets organized by DeepSequence, Envision, or FLIP [1,2,3]?<br /> [1] https://www.nature.com/arti...<br /> [2] https://www.cell.com/cell-s...<br /> [3] https://www.biorxiv.org/con...

      How well calibrated is RGN2? Are there probability outputs that are reliable estimators of the model’s confidence?

      Did the authors use the entirety of UniParc as-is or perform some type of downsampling based on sequence identity clusters? In general, rebalancing the train dataset sequences based on 50%/80%/90% sequence identity clusters could enable a more versatile representation useful for structural and functional prediction tasks.

      Lines 147-149: The ProtTrans https://www.biorxiv.org/con... model uses a span masking objective derived from T5 in NLP https://arxiv.org/abs/1910.... which is essentially very similar to what the authors use.

      Line 153: why didn’t they finetune?

      Could authors provide a comparison on how RGN2 would perform with respect to language models that perform contact prediction such as ESM-1b https://www.pnas.org/conten... or MSA-Transformer https://proceedings.mlr.pre...

      Given the authors previous record, we assume the code will be available soon: https://github.com/aqlabora... is out of date...

      James Fraser (UCSF) and Ali Madani (Salesforce Research)

    2. On 2021-10-02 21:14:31, user Travis Wheeler wrote:

      The results SEEM relevant and important ... but without code to test/review, there's really not much to say about the paper. The preprint has been posted for 2 months, but no RGN2 code is available; please share it.

    1. On 2021-11-16 02:38:55, user Iris Young wrote:

      The capability of cryo-electron microscopy (cryoEM) to capture multiple and native-like conformations of large macromolecules is transforming structural biology. This manuscript explores intricacies of the cooling process as it relates to structural ensembles. Specifically, how do variations in starting sample conditions (water layer thickness, water/sample starting temperature) and cooling (ethane layer thickness, rate of cooling) affect the distribution of structural states captured in the resulting micrographs? Can we be confident that the results of "time-resolved" cryoEM experiments are representative of barriers and basins we hope to capture? By a combination of molecular dynamics simulations and cryoEM experiments, the authors guide us to an empirical understanding of these questions.

      To understand the relationship between cooling and structural ensembles, we must begin with the thermodynamic principles in play. Ensembles represent the many possible conformational states of a structural unit, and the occupancies of the individual states depend on the energy landscape across which they are related. At the extremes, we may imagine an ensemble cooled instantaneously to 0 K, whose component structures would not be able to traverse the energy landscape in any direction, as well as an ensemble injected with enough energy to overcome any energy barrier on the landscape (i.e. a system at thermal equilibrium), whose component structures would move freely to occupy all possible states. In the latter case we would not expect all states to be occupied by the same number of particles, however — particles with exactly enough energy to breach a particular energy barrier are equally likely to fall to either side of it, but particles at different starting points with the same starting energy have different likelihoods of escaping their local energy minima. In aggregate, this produces the Boltzmann distribution, in which the populations of different states depend entirely on their relative energies.

      For intermediate temperatures, it is useful to speak in shorthand of energy barriers and a system's ability to overcome them. We believe the introduction of this manuscript deserves a more complete illustration of the fundamental principles, however, and would encourage the authors to possibly even add a diagram or two to aid in this effort. It is too easy to confuse the fact that cooled structures move toward global energy minima with the idea that cooling gives them the energy needed to overcome energy barriers, which is precisely the opposite of the truth. The abstract in particular ought to be reworded to avoid this misinterpretation.

      The design of the computational and wet lab experiments is carefully geared toward isolating the relevant variables and reproducing the relevant states and processes. For example, to choose the equilibration times to use in MD simulations, the authors first simulated how long it would take for samples of various thicknesses to vitrify. By and large we are satisfied with the parameters of these experiments, but we question one unsupported assumption: the authors enforced an ethane bath outer boundary held at constant temperature. This could be possible if the ethane bath remains in contact with a heat sink and the equilibration time between ethane and the heat sink is negligible compared with that between ethane and the sample, but we do not see this discussed or justified, nor is this standard practice when freezing grids, as the ethane bath is isolated from the standard liquid nitrogen heat sink after reaching the desired temperature to prevent the ethane from freezing solid. We were confused by the plot of equilibration times for ethane layers of different thicknesses, and hypothesize that the unexpected (to us) trend is a result of this boundary condition: we would have imagined a thicker ethane layer to allow quicker absorption of heat from the water layer, but the opposite is shown. Moreover, there is often a cold gas layer (see work by Rob Thorne on hyperquenching: https://pubmed.ncbi.nlm.nih... and https://journals.iucr.org/m.... While this complication might be very difficult to simulate, it should be explained how it might affect the interpretation of results.

      Although it does not impact the methods or results of this paper, we are also unconvinced that the use of any vitrification bath held at a lower temperature than the commonly used ethane bath would necessarily result in faster freezing, as heat transfer is also dependent on heat capacity (hence the selection of ethane for plunge-freezing rather than liquid nitrogen!) Propane does indeed appear to have a higher heat capacity than ethane at similar cryogenic temperatures, so in the case of an ethane-propane mixture, this assumption does hold, but we would prefer the authors include this detail.

      As for the results of the study, it is well-evidenced and clearly presented that conformational distributions present before plunge-freezing are reflected in vitrified samples when a standard vitrification protocol is followed, and that the rate of cooling does indeed impact the degree to which these distributions are preserved. We especially applaud the authors' careful wording around what interpretations are supported or suggested by the data, leaving open the remote possibility of other explanations — they draw very clear distinctions between observations and analyses. This is good science!

      Finally, we find the implications of the study meaningful. The selection of a ribosomal complex as an example particle perfectly illustrates the biologically relevant range of flexibilities and temperature-dependent conformational ensembles. This example gives us an intuitive measuring stick for other types of structures. Taken as a whole, the analyses inform future "time-resolved" studies using cryoEM and the design of other experiments that depend on the preservation of a conformational ensemble by rapid cooling. This is a very exciting direction of inquiry that we will continue to watch with great interest!

      Minor points:<br /> - The authors could make the introduction even more clear and accessible by specifying that liquid specimens present a challenge because their vapor pressure is incompatible with high vacuum.<br /> - We favor the wording "most often liquid ethane" over "mostly liquid ethane" for describing the standard vitrification setup.<br /> - Sentences such as the second to last sentence in the second paragraph of the introduction could be broken into multiple sentences or otherwise simplified to avoid confusion among the several "which," "from" and "and" clauses. <br /> - Sobolevsky’s work on TRP channels and vitrification probably deserves a mention in the intro alongside the other examples, esp. because those probe a natural temperature sensor! (https://www.nature.com/arti..., https://www.nature.com/arti...<br /> - Tomography can be used to resolve position in ice layer, “Apart from the water-layer thickness, the temperature drop also depends on the position within the layer with the slowest drop in the center (Fig. 1a), which is relevant, because in the time between the spreading of the sample onto the grid and the plunging, the biomolecules tend to adsorb to the air-water interface 62.” (https://pubmed.ncbi.nlm.nih...<br /> - The authors could comment on whether the effects of being in different parts of the ice layer would affect RMSF values. If the effects are not too small, reconstructions from different layers could yield B-factors that could deconvolute different effects! <br /> - Are there local effects to their RMSF kinetic/thermodynamic models in the ribosome? For example, could they subdivide RNA/Protein, small/large subunit, exterior/interior sites, etc? Are there any regions that increase conformational heterogeneity upon cooling (as we have seen often in multi temperature crystallography)? Using a global RMSF metric may be leaving out interesting phenomena. <br /> - Are the frames and code deposited somewhere for others to examine?

      Iris Young and James Fraser (UCSF)

    1. On 2021-11-12 18:37:39, user Subhamoy Mahajan wrote:

      Please visit https://github.com/subhamoy... for associated code and tutorials. There has been several updates since the pre-print was posted: two new depth-variant PSF, compatibility with TIFF format to generate multi-dimensional images compatible with ImageJ, and addition of Gaussian and Poisson noise, etc.

    1. On 2021-10-05 18:24:10, user Marc RobinsonRechavi wrote:

      Just after the discussion, the authors write:

      The source code and weights for the trained models will be made available shortly.

      This is a publication, i.e. it is made public as part of the scientific record and is citable, thus I strongly invite the authors to make the corresponding code available without delay.

    1. On 2021-08-26 23:28:45, user Laura Sanchez wrote:

      Dear Willems et al, this preprint was discussed in a lab meeting and we would like to offer the following for review. Thank you for posting this very interesting manuscript. Best, The Sanchez Lab:

      The manuscript by Willems et al. presents an overview of a new software, AlphaTims, which allows for fast indexing and retrieval of LC-TIMS-MS/MS data. The authors present a clear explanation of how software indexing and data matrix construction works, along with an example of how their software is used to simplify the accession of data from complex samples. This open-source program has the potential to make data more accessible without proprietary software, and has many possible applications in conjunction with further data processing software. Overall, AlphaTims looks to be an impressive piece of software, and figure 1 especially did a great job of visualizing and explaining the difficult concept of data indexing in 4 dimensions. This was a great, polished read, and very informative in elucidating the experimental process for those who are less familiar with the LC-TIMS-MS/MS workflow. It was also appreciated that the data was publicly available on PRIDE! Below please find a list of critiques that may improve the accessibility of some of the concepts and better illustrate the software’s utility.

      ● Figure 1 is very well done and all the terms are well-explained. It may add further clarity if the authors were to add a part (d) to help visualize what the data looks like once indexed, perhaps with number values. Additionally, the color palette is difficult to interpret when printed in black and white; the authors may wish to consult ColorBrewer.

      ● Some of the data on the speed of the software is difficult to conceptualize without benchmark data for comparison. There’s a brief comparison between the program and Bruker’s DataAnalysis on two different systems (one local and one Github VM) with different specifications, which doesn’t provide a lot of context as these do not seem to be directly comparable. While the speed of the software compared to others is not necessarily a major focal point, as AlphaTIMS carries out different processes to DataAnalysis, it’s still a little difficult to conceptualize the relevance of the times in figure 2, as these events are somewhat decontextualized. With that being said, it was apparent that both systems used solid state drives. Therefore, even if runtimes cannot be directly compared, it may be helpful to note whether the code is CPU-bound, storage speed bound, etc. and suggest what hardware upgrades may help increase performance.

      ● The documentation in general is very good! The accessibility and the included explanations for all of AlphaTims’ reading, writing, slicing, and visualization workflows are impressive. However, more examples for other included processing functions (i.e. centroiding) would be helpful.

      ● Some more broad data visualizations in figures 3 and 4 would be helpful. Seeing an overall LC-MS and/or TIMS-MS spectrum in figure 3 could help contextualize the complexity of the data, beyond the quality control purposes expressed in the figure. This would also help to visualize the described “polygon filter” - again, not an expressed priority of the paper, but helpful for readers to connect the software to data. Additionally, in figure 4, showing a comparison of the selected peptides between the 6 sample conditions could help to visualize the utility of the software for determining sample optimization. There’s also a slight disconnect in the captions for figure 4 - “cell 24” could refer to In [24] or Out[24]. This is worth clarifying with traditional a, b, c, d, e, f labels.

      ● The forward looking statement in the conclusion may provide a great opportunity to expand on future possible applications. For instance, do the authors see this being developed with additional data processing/analysis functionalities, or integrated into new/existing data analysis programs? Is there possible functionality to visualize and compare multiple datasets at once?

    1. On 2021-07-15 06:54:53, user Marc RobinsonRechavi wrote:

      The github referenced in this manuscript is empty:

      The code for this project is available at: https://github.com/jokelley...

      At this url I only find a README which contains the title of this manuscript.

      This is a publication, i.e. it is made public as part of the scientific record and is citable, thus I strongly invite the authors to make the corresponding code available without delay.

    1. On 2021-07-08 10:55:42, user Jelger Risselada wrote:

      After the manuscript has been accepted via the regular peer reviewed process the here-used EVOMD code will be made publicly accessible on github. Nevertheless, if you are willing to already try out or use our evoMD method simply drop us a line.

    1. On 2021-06-28 03:16:33, user Jesse Bloom wrote:

      I thank Stephen Goldstein and an anonymous commenter with the username ACC for their helpful comments above on the initial version of the bioRxiv pre-print.

      Below I outline how I have addressed these comments in a revised version of the manuscript.

      These revisions distinguish version 1 of the bioRxiv pre-print (uploaded on June-18-2021) and version 2 of the bioRxiv pre-print (uploaded on June-27-2021). To see the details of all revisions, go to https://github.com/jbloom/SARS-CoV-2_PRJNA612766/compare/initial_bioRxiv_version...second_bioRxiv_version#diff-983b58b8186b0a4ed7f280f258cdab3eb0dd7d5136f8ac361ba982a43cfb7136 and if you are specifically interested in the manuscript (rather than the code) find the file called paper/paper.tex and click on Load diff to see all the changes.

      Below I summarize the revisions:

      ADDING E-MAILS REQUESTING SRA DELETIONS<br /> Both commenters pointed out that it is hard to be sure why the data were deleted from the SRA. In the revised version, I have added additional information relevant to this point.

      Specifically, after I e-mailed the pre-print to the NIH on June 18, 2020, they sent me a copy of the e-mails from Wuhan University requesting that deletion. I now include these e-mails (with the redactions and highlighting made by the NIH) as Figure 6 of the revised manuscript (they are also at https://raw.githubusercontent.com/jbloom/SARS-CoV-2_PRJNA612766/main/paper/figures/SRA_email.png). I have added these e-mails instead of referring to the NIH statement (as suggested by Stephen Goldstein) because these e-mails contain more detail than the NIH statement.

      In the revised version, I have refrained from assigning a motive to the authors. However, I do point out that although their e-mail said they were submitting the data to another website, I can find no website with the data. I also point out that whatever the motive, the practical consequence of removing the data from the SRA was that no one was aware of its existence even if it was technically available both on the Google Cloud and in a table in journal Small. In particular, removing the sequencing data from the SRA meant it was not in the list of locations from which the joint WHO-China report collected their data.

      While I think the commenters were correct in asking me to moderate my strong suggestions about the motives for removing the data, I similarly think commenter ACC is incorrect when (s)he says that the fact that the paper corresponding to the data was published shows with high confidence that the authors were not trying to obscure the data. In fact, the authors had uploaded a pre-print to medRxiv on their study in early March, and there is no mechanism for deleting a pre-print (unlike for SRA data). So the authors were committed to having the manuscript permanently available as soon as the pre-print posted in early March.

      EXAMPLE SRA DELETION FROM ANOTHER STUDY<br /> In the original pre-print, I included as Figure 2 an example of how SRA data are removed by showing an e-mail illustrating this process for data from a different study on pangolin coronaviruses. This was an e-mail excerpted from page 50 of https://usrtk.org/wp-content/uploads/2020/12/NCBI-Emails.pdf. I included the e-mail because it was the only publicly available example I could find of the SRA deletion process.

      However, both commenters suggested this e-mail was confusing because it was from a different study. I accept this point. In addition, as described above, I subsequently received from the NCBI the actual e-mail to withdraw the BioProject relevant to the current study, so have included that as a more relevant example of the deletion process.

      The commenter ACC further suggests that my original reference to the deleted pangolin coronavirus accessions SRR11119760 and SRR11119761 is a “mistake” because those data are available on the SRA again. I checked, and this is indeed the case, but those two SRA files only re-appeared on June-18-2021 (according to timestamps from vdb-dump --info, see: https://github.com/jbloom/SARS-CoV-2_PRJNA612766/blob/main/paper/figures/SRR1119760_SRR1119761_obj_timestamps.png) which is the same day that I submitted my pre-print to bioRxiv. I obviously had no way of knowing these two pieces of missing data would reappear on the SRA after a 15-month hiatus almost concurrently with submission of my pre-print.

      MAKING CLEAR DATA ARE STILL LISTED IN TABLE OF PAPER IN SMALL<br /> Both commenters correctly pointed out that even after the SRA deletions, the mutations were still available in a table in the paper in the journal Small. In the revised version, I have added additional text to make that point very explicitly (it was mentioned before, but only in passing). However, I also note that Small is primarily a chemistry journal that is not read by virologists, and the practical consequence of removing the sequencing data from the SRA is that no one was aware of the list of mutations in the paper in Small until I recovered the sequencing data from the SRA.

      HOW TO REFER TO EARLY EPIDEMIC SEQUENCES<br /> The commenter ACC says it is too vague to describe the samples as from “early in the epidemic.” However, this is an exact quote from the Wang et al pre-print, where they describe the samples as being from outpatients “early in the epidemic.” The final published version of Wang et al changes this to “early in the epidemic (January 2020),” but neither the pre-print or paper give more exact dates of sample collection. I therefore have retained this description since it is how the authors of the study themselves describe their samples.

      IMPORTANCE OF THE NEW SEQUENCES<br /> Both commenters correctly emphasize that the new sequences do not transform our understanding of viruses in Wuhan at this time. Rather, they provide modest additional information on viral diversity at this time, including more evidence for the C29095T mutation (which makes the virus more similar to the bat outgroup viruses). I fully agree that the data are not transformative, but contend that any new data are valuable. The original manuscript already made this point: for instance, clearly saying that the data supported the prior inferences of Kumar et al about proCoV2 (which was even used as the alignment reference). However, in the revised version I have further emphasized the data are supportive of existing conclusions by Kumar et al and several others, while leading to incremental advances (such as possible importance of C29095T). However, I am unsure how to respond to commenter AAC’s criticism of the title and abstract: the title says “sheds more light” which is a modestly worded phrase that is appropriate for the moderate increase in knowledge that has accrued from these new sequences. These are the only new early Wuhan sequences we have had in over a year, so I consider that a net plus for scientific knowledge even if it’s not transformative.

      DISCUSSION OF HUANAN SEAFOOD MARKET<br /> In the original manuscript, I had a paragraph describing why I think two of the theories about the difficulty of rooting the SARS-CoV-2 are rather unlikely: the idea that RaTG13 is faked to confuse root placement, and the idea that there were multiple zoonoses from multiple markets. Both commenters raise various questions about this part of the manuscript. I still think both of these theories are unlikely, but discussing them in detail is not central to the main points of the paper. So I have shortened to just mention that I think that the two-market zoonosis is less plausible for the reason explained by Trevor Bedford here https://twitter.com/trvrb/status/1408080716286414852.

      ADDITION OF SRR11313490 AND SRR11313499<br /> In the original manuscript, I could recover all of the deleted sequence data from the Google Cloud except SRR11313490 and SRR11313499. After I posted the original pre-print, I was contacted by several individuals who realized that they had downloaded copies of those data prior to the data deletion in June of 2020. I have added those new data in the revised manuscript. It turns out that their addition makes no meaningful difference, as both are from low-coverage samples for which it is still impossible to call meaningful sequence information.

      PROPER TRIMMING OF THE READS<br /> It was brought to my attention by Brendan Larsen that I had failed to properly trim the sequencing data in the original analysis, and so was analyzing the primer overlap sites as well as the sequenced region. I have masked these sites in the analysis in the revised manuscript. This also makes no meaningful difference in the results, since no mutations of interest are in these binding sites, although it does slightly reduce the fractional coverage over the region by opening small gaps at the primer binding sites.

    1. On 2021-06-14 15:16:03, user Ariel Mundo wrote:

      My co-authors and I welcome any comments or feedback on this work. Comments on the GitHub site where the code and data are shared are also appreciated!

    1. On 2021-06-10 09:34:35, user Marc RobinsonRechavi wrote:

      Under Data availability, the authors write:

      All underlying data and code will be persistently archived upon acceptance

      This is a publication, i.e. it is made public as part of the scientific record and is citable, thus I strongly invite the authors to make the corresponding data and code available without delay.

    1. On 2021-06-05 10:20:07, user Marc Robinson-Rechavi wrote:

      Under Data and code availability, the authors write:

      The code for evolutionary simulations together with sample notoebooks and data used in the article will be made available on GitHub along with the publication of the manuscript.

      This is a publication, i.e. it is made public as part of the scientific record and is citable, thus I strongly invite the authors to make the corresponding GitHub available without delay.

    1. On 2021-06-01 17:42:44, user Claudio Tennie wrote:

      We have uploaded our response to the main point rasied in Mielke’s 2020 bioRxiv comment on our oranzee MS elsewhere (below his original comment). Due to character limitations of disqus (or in any case upload issues we experienced) we will now answer the remaining points here. Having answered Mielke’s main point, below, we will therefore only answer to Mielke’s additional points where he a) directly claims to critique us beyond a critique of our specifics claims and/or b) where we deemed that Mielke raises points that might appear to readers to apply beyond a critique of our specifics claims. Given that we can show with data how and why these critiques are not relevant (see below), and given that the appropriate literature containing this data is and was already cited in the manuscript, we have not changed our manuscript based on these comments addressed below (although note that we have since worked on the manuscript more generally, and therefore it is likely that these points are now made clearer in the new manuscript). Note that we experienced upload issues again here at disqus when uploading our reply, which is why we spplit our reply. This here is PART 2 of our reply. A third part will follow.

      The claim that we deny the ecological importance of ape culture. E.g. Mielke writes “The model, and I would say the theory underlying it, ignore that an ape who fails to learn a skill correctly faces immense costs.”

      Answer: We do not ignore the ecological relevance of ape cultures, and never did. As for the implication by Mielke - that copying is necessary for ape cultures - this is a widespread idea, but it is empirically unsupported. Apes do not require copying to acquire even their most complex behaviours (e.g. even nutcracking with hammers can be reinnovated from scratch (shown by Bandini et al. (accepted with minor revisions); available at https://www.biorxiv.org/con.... Therefore, while it is true that apes have to learn, the question we keep on asking is: how do they learn it? We already laid out how they do so (based on the relevant literature) in our original manuscript. In sum, and as reflected already in our original manuscript, the evidence points away from form copying, and supports socially mediated reinnovation of know-how. See more on this below.

      The claim that a lack of certain social learning biases in our model renders our model invalid. E.g. Mielke writes “So, if learning probability in the simulations is based on the frequency a behaviour is observed in the population, treating all potential models evenly and not weighting the impact of potential models by their age (e.g. remove infants and juveniles) biases the outcome of the results.”

      Answer: Our general results do not depend on such specifics of social learning biases. Specific outcomes (e.g. exact number of cultural traits) do, but we no longer make claims based on such specifics (i.e. we no longer make the ‘specifics claim’, see above). In fact, social learning biases could be easily added to the model, but note that, because of the stochastic nature of some of them (e.g. prestige bias) their addition may even be likely to increase the probability of finding distributional “cultures” without form copying.

      The claim that there is an “open-endedness” to ape culture in real-life and which needs to be inbuilt into our model. E.g. Mielke writes “There are hundreds of different ways to groom someone.”.

      Answer: In theory, there could be hundreds of ways to groom - thousands even. It is therefore all the more surprising - but only under the assumption that apes regularly copy forms - that, in this case and across cases, apes are nevertheless empirically restricted towards certain forms, and which are few in number. Indeed, in a first, indirect “ballpark-count” it was recently concluded that apes are thus biased towards only a few thousand behavioural forms overall, across all ape species, behavioural domains (incl. grooming) and populations (the wild included; Motes-Rodrigo & Tennie 2021 Bio Rev). This is in heavy contrast to the human case, who have many more behavioural forms; indeed counting not in thousands or even in millions, but in the billions (Motes-Rodrigo & Tennie 2021). Therefore, whilst the theoretical logic proposed by Mielke holds - it only holds for humans in real life, but it does not hold for apes. Across our publications, and all considered, we continue to explain the most parsimonious reason why this is - ultimately, this difference derives from apes not basing their behavioural forms on copying. Instead, they seem to instead trigger forms that they latently possess (latent solutions).

      Having a finite number is a simplification for modelling purposes. In our model we consider a finite number - 64 - of forms, but notice the specific number is arbitrary and does not change our main result. And, again, ape culture does not show signs of open-endedness in real life, so, strictly in keeping with Mielke’s call to make models more real-life like, we also do not see the need to implement open-endedness.

      The claim that the meaning of social behaviours is equal to their behavioural form. E.g., Mielke writes “Play elements that nobody else knows will not lead to successful play.”

      Answer: The question of how apes glean meaning from behaviour is a question that we did not study. We also fail to see the relevance of this comment to the source of behavioural form. Meanings might be related to ‘this or that’ factor, yes. However, none of this determines the behavioural form itself, and which is what our manuscript discusses.

      The claim that the available data regarding ape social learning must be interpreted as spontaneous form copying. E.g., Mielke writes “Apes have probably in access of 100 different play elements in each group (Nishida et al 2010; Petru et al 2009), and it can easily be expected that innovation and social transmission occurs in this context (Perry 2011).”

      Answer: Apes fail to copy forms under controlled settings (unless this skill is artificially planted via training and/or enculturation), and ape’s behavioural forms are reinnovated across populations and to a large extent, even across ape species (Motes-Rodrigo & Tennie 2021 Bio Rev and references therein). This should be expected, given that apes reinnovate these forms from scratch (and this is including social behavioural forms; e.g. handclasp grooming, mentioned by Mielke, is a nice point in case as its (few) various subforms also emerged across culturally unconnected populations, including in captivity). It is likewise irrelevant for questions of form copying whether a behaviour involves multiple individuals (Mielke: “rain dancing, cannot conceivably be reinnovated by one individual – what would that even look like, given that it is a coordinated action of several individuals with no discernible physical function?”). To show this, let us take the example of bird flight: to form a V formation in the sky cannot be done by a single bird either, but in contrast to Mielke’s claim this specific fact does not answer the question of what leads to this V shape - what leads to this V form. The answer is, perhaps in this case, a large influence of the environment, i.e. aerodynamics in conjunction with socially flying birds. The exact drivers most likely will differ across cases (e.g., sometimes the genetic level may play more of a role than the environment), but the general point does not depend on these details. The main point remains that none of these factors can pinpoint form copying. The fact that the various raindance forms occur elsewhere however speaks against form copying (as we explain here: Tennie & van Schaik 2020, Phil Trans B - compare Motes-Rodrigo & Tennie 2021 Bio Rev)

      The claim that our model cannot - but should - capture multi-step behavioural forms. E.g. Mielke writes “Many of the described behaviours in Whiten et al 1999 are not simple behaviours that occur in a vacuum, but action sequences with several elements that have to be fulfilled in the exact right order and are embedded in sequential behaviour patterns; for example leaf clipping.”

      Answer: We do not deny that ape behaviour can be complex and composed of multiple steps. Nobody we are aware of denies this. However, this, too, does not pinpoint form copying. To see why, consider this. A spider making a web is (and must be) using multiple steps in the correct order, too. But this does not mean that the spider must have copied this behavioural form [important note: we use the spider example as an illustration of the logical principle, we are not saying that apes learn just like spiders]. Another way of saying it is this: does multi-step behaviour require copying? It does not logically (see the spider example), but is there maybe a reason to assume that multi-step behaviour requires copying in the case of apes? If so, then we would agree. However, instead we know that multi-step behaviours do not require form copying in apes (e.g., see the empirical finding of ape nut-cracking reinnovation from scratch, using hammer sequences; Bandini et al (accepted with minor revisions)). We therefore must reject this notion on both logical and empirical grounds - despite it being intuitive and widespread.

      As for the open-ended comment/aspect, If Mielke here refers to the fact that our model does not consider multi-step behavioural forms, again, this is a simplification for modelling purposes (and see also our notes on open-endedness above).

      Claims regarding ‘known unknowns’. E.g. Mielke writes “detailed analysis will likely find that there are different ways of dragging a branch, as has been found for other behaviour on the list (e.g. digging for honey Estienne et al 2017).”

      Answer: We would not like to engage with claims regarding known unknowns. Yet, note that some previous unknowns have been studied and thus turned into interesting knowns. For example, we recently ran a study on target-naive, unenculturated chimpanzees in which these nevertheless spontaneously developed a whole range of digging behaviours using organic tools - i.e. without the need for form copying. Again, we are not denying that apes show several behaviours - they clearly do (see above) - we are questioning whether any of these require form copying. Our study was designed 8among other things) to test the widespread claim that wild ape cultural patterns per se involve such copying and/or that they must involve such copying. Our study showed that, conversely, they do not. These aspects of ape behaviour do not pinpoint form copying. Another recently published study used a new, dedicated method for its search for ape form copying (the method of local restriction). It likewise demonstrated that most ape behavioural forms essentially auto-repeat. It also showed that the number of ape behaviours that show substantial (indirect) signs of being influenced by form copying in apes is extremely low. It ranges between zero and three (!) cases - across all ape populations, species and behavioural domains (Motes-Rodrigo & Tennie 2021 Bio Rev).

      The claim that it matters how well our model states match real ape states. E.g. Mielke writes “Function and context might be specific to one sex or age-group: drumming in juveniles, for example, is often part of play; female chimps will slap the ground in displays rather than drum, even though drumming is sometimes observed.”

      Answer: Our general results - and our main message - proved very robust and do not depend on these types of details.

      The claim that our simulation was unbound by reality. e.g. Mielke writes “In general, it would be fantastic if there was even a basic description of why any of the simulations was designed as described here”

      Answer: These descriptions were already part of the original manuscript, and several colleagues who had read our manuscript understood our approach. However, again, our newest manuscript is generally improved and clarified, so we believe that our new manuscript should be fully understandable (in addition to being replicable, as we lay open its code).

      The claim that “‘Innovation probability’ is meaningless for social behaviours”.

      Answer: Social behaviours are innovated just like material behaviours are, and so we fail to see a problem here. Indeed, one social behaviour in chimpanzees, handclasp grooming, has clearly been reinnovated multiple times across multiple culturally unconnected populations - including in captivity. More to the point of our manuscript, the source of the forms of these behaviours (in apes, at least) is very unlikely form copying (e.g. Tennie et al. 2020 Bio & Phil). In social and other contexts, apes can be triggered in various ways to reinnovate similar forms as those shown by their social partners.

      The claim that the ape form copying assumption is still superior to the assumption of socially mediated reinnovation. E.g. Mielke writes “These seem to completely ignore anything we know about social learning or the life history of animals.”

      Answer: We must strongly disagree, and indeed we do so on strictly empirical grounds. There is a long-held assumption of form copying underlying ape cultures, yes - but it has fared very poorly empirically. For some time, it indeed had appeared as if apes are spontaneously copying forms. However, claims for such spontaneous ape copying either had tested enculturated/trained apes (irrelevant for the question at hand as the skill is no longer spontaneous) or had merely tasked apes to seemingly (!) copy behaviours but that they could already do on their own. The problem here is that, clearly, things one can do on one’s own, do not need to be copied. Hence, any “copying” in these situations might have been an illusion of form copying. These results were then indeed proven illusionary in other, controlled studies - in studies that tasked apes with the copying of forms that they would not (!) already develop on their own (note, when human children learn behavioural forms, e.g. the word “laptop”, they are indeed learning forms that they would not otherwise produce). When tested in this way, unenculturated, untrained apes invariably fail to copy such novel forms - of either behavioural type (Clay & Tennie 2017 and references therein) or of artefact/tool type (Tennie et al. 2009 Phil Trans B). Note that we have another manuscript at the ready in which we additionally show that the failure of apes to copy in such situations is also not due to any social learning bias (and even ape mothers are not copied in such situations). Indeed, in stark contrast to human form learning (e.g. of the word “laptop”), ape behavioural forms repeat across populations and even species - and the overall number of ape behavioural forms is very small when compared with (form-copying) humans (Motes-Rodrigo & Tennie 2021 Bio Rev). The perhaps last possible objection that might be raised here, namely that maybe all these forms are still copied in the wild, but go back to some original cultural source population, can also be rejected. Not only would unavoidable copying error have led to a different picture today (Tennie et al. 2016, 2017), but when we test apes whose previous experiences are known and for whom any such cultural connected are effectively severed, they spontaneously reinnovate these forms, too ( including multi-step, complex behavioural forms such as nut-cracking with hammers; Bandini et al. (accepted with minor revisions)). Thus, truly taking into account everything we know about social learning of apes, the only available mechanism that can explain all these findings is that these forms are instead socially mediated, but not copied. Their form requires individual reinnovation (Bandini & Tennie, 2017; 2018). As a side note, Mielke also writes “by the time any chimp starts nut cracking, they will have observed several millions of strikes by their mother and other group members – are we to believe that they did not in any way take this information into account when acting?” Here, there can only be two answers in the light of the available data. First, socially mediated reinnovation happens in such cases, and therefore juveniles do absorb information (e.g. know-what (nuts), know-where (these trees) etc.; see Bandini et al. 2020, Biol Lett; Tennie et al. 2020 (Bio & Phil), Motes-Rodrigo & Tennie (2021)). Second, the idea that form copying is responsible for the similarities in behavioural forms between mother and offspring is strong and widespread, but it is - everything considered - most likely an illusion. relevant here is, again, that when tested at a later age, in the absence of demonstrations (note that orangutans do not crack nuts in the wild even), older, and therefore more individually (bodily) capable orangutans reinnovate the nutcracking form from scratch (Bandini et al) - and they can do so fast (for more and in-depth details on this comparison, please consult Bandini et al. (accepted w. minor corrections)).

      The claim that an absence of evidence for form copying does not mean evidence of absence for form copying. E.g. Mielke then continues:“That reinnovation [of nut-cracking] is potentially possible does not rule out that most individuals in a community do not, in fact, reinnovate”

      Answer: We have been very explicit in our manuscript from the start that our model does not allow for this claim. Instead, we did and do point to additional findings - for example, that apes do not copy novel forms, and that they spontaneously reinnovate the ones shown elsewhere (see also above). And so, the possibility of ape form copying playing any large role is rendered unparsimonious by that other, relevant, data. We merely apply Occam's razor across the available, relevant data. However, our model adds to this evidence as well. It helps triangulate the question by disproving one claim about what the method of exclusion can pinpoint.

      The claim that innovation has no long-lasting effects in our model. E.g. Mielke writes “They don’t ‘reinnovate’ nut-cracking every time they are hungry, this renders the idea of innovation meaningless.”

      Answer: This is simply a misunderstanding, as it is not how innovation is implemented in our model. Sensibly, innovation does lead to long-term consequences in our model. In particular, it does so by changing the “state” of our agents, i.e. when the agents innovate in a particular domain they (probabilistically) do not innovate any more within that domain.

      The claim that our results do not provide relevant information. E.g. Mielke writes “However, the same thing is also true the other way round” in response to our claim that individual level mechanisms cannot be identified through population level outcomes.

      Answer: This might have been true if our results existed in a vacuum. Yet there are many other, relevant types of data (cited above, and also in our manuscript), which in turn allow us to determine with high likelihood (using parsimony as our guide) the types of social learning mechanisms used by apes. The most parsimonious interpretation is that apes produce and maintain their cultures not via form copying, but via individual reinnovation of these forms which however is very often socially mediated (the latter bit renders these very clearly cultural, yes, but just not cumulative cultural in their forms).

    1. On 2021-05-28 18:28:26, user morgandjackson wrote:

      Hi, I was keyed onto your recent in-press paper (Molecular Biology & Evolution) by a colleague and figured it would be prudent to leave a comment here regarding the status of your new species "Bombus incognitus". I'm not sure whether the intent was to describe "B. incognitus" as a new species or whether you're just using "Bombus incognitus" as a placeholder name for communication purposes as you continue to unravel the evolutionary patterns within the populations, but as it currently stands the name "Bombus incognitus" does not meet the requirements of the International Commission of Zoological Nomenclature (ICZN) and is thus not considered a valid species or available name yet. If the intent is to formally describe your cryptic population as a putative new species, I would recommend you become familiar with the requirements for doing so as outlined by the ICZN in The Code, particularly chapters 3 & 4 which detail the steps and requirements necessary to publish a new name that can be recognized officially. It may seem inane and inconsequential, but there remains the possibility that the organism you're referring to could be misconstrued by others or the name you've proposed replaced with something else entirely by someone else, creating a web of complications for future studies involving the taxa in question. Good luck.

    1. On 2021-05-12 07:49:10, user Wouter De Coster wrote:

      Dear authors,

      This looks very interesting, and I am eager to try this on our datasets. However, I am disappointed that you did not make the tool available upon posting the preprint. I could not find any availability information in the preprint, nor a repository on GitHub. If this is just a way to "claim priority" without actually "moving the field forward", and you plan to release the code upon actual publication then you have not understood the purpose of preprints.

      I hope this can be rectified.

      Sincerely,<br /> Wouter De Coster

    1. On 2021-04-29 03:04:51, user Doug Shepherd wrote:

      Hi Bin and crew, awesome work and beautiful presentation!!

      Regarding:

      Previous OPM systems with full NA detection12,14 reported an imaging volume of 180um × 180um × 60um, whereas DaXi can image a volume about 400 times larger

      . In the eLife paper we showed stage scanning without motion blur for >1 cm x 220 um x 60 um (with some loss in resolution outside the Snouty v1.0 sweet spot as you note in your discussion). We then stitched each FOV to generate up to 1 cm x 1 cm x 60 um images. See Figure 9 and "stage scanning" setup in the methods.

      We've made some changes since publication. All of our current Pycro-manager based control and Python reconstruction code (as well as Micro-manager 2.0 control code used in eLife version of the instrument) is available at our GitHub repo: qi2lab OPM repo.

      Doug

    1. On 2021-04-08 22:20:07, user Charles Warden wrote:

      Hi,

      Thank you for putting together this preprint.

      I wish Genome Research had a Disqus-style comment system, similar to some other journals. I am currently taking something that I wanted to post as a comment on a different Genome Research article, and I instead will post it on PubPeer. So, in general, that is an option, even though I do not think that it is ideal for something truly intended as a comment.

      I think I would have to take a closer look to develop more of an option on this specific matter, but I did have a couple thoughts:

      1) My impression from this Tweet was different than my impression from the beginning of this preprint:

      https://twitter.com/KVittin...

      I can see certain parts of the preprint where I think the tone is more like a comment or complementary study. However, that is different than a "Contradictory Results" preprint or a PubPeer posting (assuming that the intention is to flag something about the study that may need to be corrected or retracted). I am not sure if that can still be changed for this preprint.

      My preference would probably to be something closer to my impression of the Tweet (as positive as possible), but it is not immediately clear to me if that is the intention. Essentially, I am noting that I am not entirely clear about the severity of the message.

      For example, I think there are usually limitations to simulated data. I can also believe that there are viewpoints that may benefit from being made more clear, beyond what is within the paper. However, I believe that I would agree with the idea of limitations due to extra noise in the RNA-Seq data, and I am not sure why this has to be "contradictory" instead of "alternative" results.

      I am not sure if it helps, but I have some notes about the RNA-Seq DEG method limits here. While I could see how someone could call the discordance "contradictory," I would say it is more of an overall limit (where no 1 strategy is best, and you can't lock down everything in advance). In that sense, I would say different freely available methods complement each other, rather than detract from each other.

      It is certainly possible that I missed something, since I did not have a strong opinion when I first saw the Varabyou et al. 2020 paper. However, unless I am missing something important, I wonder if a slightly different tone might help?

      In terms of material linked from my blog post (and other experiences), I think what I have seen matches some extra noise in the Salmon gene lists (relative to the TopHat2 / STAR htseq-count gene lists), but not with a horrible amount of extra noise. So, in terms of what I see in the abstract, I think my experiences might agree with the last sentence in the abstract (even though I am talking about extra noise in differential gene lists and the authors are talking about detection of expression in that particular sentence). Is there something different that you are concerned about?

      2) You say that "All data and scripts are available upon request.". I am not sure how large the intermediate files are in this situation. However, I think code and perhaps some processed data can be made available through means like GitHub, Zenodo, etc. (similar to what you have on FigShare).

      In general, I think it is best to avoid providing materials upon request, if at all possible. So, I am not sure what is the limitation (or how much you are not providing already), and I think either making more available and/or making clear what subset you couldn't provide might help.

      Best Wishes,<br /> Charles

    1. On 2021-03-08 23:34:17, user Milka Kostić, PhD wrote:

      Dear authors,

      thank you for sharing this preprint with everyone. I recently decided to support research in chemical biology by providing feedback and thoughts on preprints in this field. Your preprint caught my eye, and I hope you will find the comments below useful (I hesitate to call this peer review, but if you would like to share what I wrote with a journal please feel free).

      Kind regards,

      Milka

      Comments to authors:<br /> In this preprint the authors combine genetic code expansion (a strategy that allows one to incorporate non-canonical amino acids at a specific position within the protein), with biorthogonal labeling (a type of chemical reaction executed in living systems in a way that is orthogonal to biological/physiological reactions) and optical microscopy to visualize two members of the transmembrane AMPA receptor (AMPAR) regulatory protein (TARP) family, 2 and 8 in live neurons and brain slices.

      The main problem that the authors set out to address is as follows:<br /> - The authors present evidence that antibodies can’t recognize endogenous TARP gamma2 and gamma8 in neurons because these proteins are found to be associated with AMPAR’s ligand binding domain (LBD) in a manner that masks the epitope. This suggests that strategies like immunostaining using fluorescently labeled antibodies are not appropriate for TARP imaging. Therefore, to solve this problem the authors decided to pursue a completely different strategy that does not rely on antibody use. Overall, this is an important research problem and the approach the authors chose to pursue is appropriate for addressing this problem.

      Summary of the approach:<br /> - The authors incorporated clickable trans-cyclooctene derivatized lysine into TARP gamma2 and gamma8 using established strategy for genetic code expansion. Exposing the modified TARP gamma2 and gamma8 to cell-impermeable tetrazine-dyes resulted in biorthogonal reaction called strain-promoted inverse electron-demand Diels-Alder cycloaddition reaction (SPIEDAC), whereby TARP gamma2 (or gamma8) featuring a modified lysine is covalently labeled with the fluorescent dye. The strategy does not seem to prevent TARP gamma2 and gamma8 from interacting with AMPARs, or affect TARP gamma2 and gamma8 function making this a well-tolerated modification. Additionally, the fluorescent signal is strong enough to allow visualization in primary neurons, as well as organotypic hippocampal slice cultures.

      Summary of key biological observations:<br /> - The key observations made in this study are:<br /> o Most of endogenous TARP gamma2 and gamma8 are found to be associated with AMPAR at the surface of hippocampal neurons.<br /> o TARP gamma2 was found to be mostly localized to synaptic spines as opposed to extrasynaptic sites, while TARP gamma8 had a more even distribution between the two, while still favoring synaptic spines.<br /> o Analysis of visual data allowed for identification of TARP gamma2 synaptic clusters (80 nm in size) that localized within dendritic spines. TARP gamma8 was not observed to form clusters under the conditions used in this work.

      Concluding remarks:<br /> Overall, I found the application of the genetic code expansion in this neurobiology context to be of interest and I can see the potential of this methodology (with further improvements elaborated on by the authors in their Discussion section) for imaging systems that are otherwise difficult to address using optical microscopy.

    1. On 2021-01-29 09:05:33, user Meghamsh Teja wrote:

      In fig.3A, I need some clarification. First, size exclusion chromatography was performed, fractions were collected and then subjected for SDS-PAGE. Color codes represent the type of the sample and color coded box represents their respective gel analysis result. In gel3, according to color code, only nucleosomes were added, so how did Mer2 appear on gel. Can someone explain it ?

    1. On 2021-01-26 16:07:06, user Chris ADAM wrote:

      Hello,<br /> I'm a Master student in bioinformatics working on epigenetics data. I aim to diagnose genetic diseases using machine learning. At the end of this article, it's said the code will be made available in a public repository. Has it been done yet ?<br /> Best regards,<br /> Chris

    1. On 2021-01-09 19:44:10, user Miha Kosmač wrote:

      Hi Izzy and Rafa,

      I've really enjoyed reading your manuscript. If it's not too computationally expensive I think it could be a really nice way of assigning known cell types with considerable certainty. <br /> I guess the one thing I am a bit iffy on is the paragraph on the canonical CD4 marker genes. FOXP3 is a generally accepted marker of Tregs (a particular and usually very small subset of CD4 T cells). PTPRC (a.k.a. CD45) is used as a pan-immune cell marker and IL7R (CD127) is used as a marker of memory B cells as well as an exclusion marker for Tregs (CD3+ CD4+ CD25+ CD127-). In flow cytometry there is usually a hierarchy of how the markers are used for cell assignment, which I suppose is lost when people use them in scRNA-seq. <br /> I think the question of marker specificity is a very complex problem in cell biology. It is always dependent on what cell types are being compared and specific markers in absolute terms are very rare. <br /> I thought I'd share my thoughts on this particular aspect of the manuscript. Thank you for both the publication of the framework and the code.

      Best wishes,<br /> Miha

    1. On 2021-01-05 00:58:07, user Charles Warden wrote:

      Hi,

      Thank you for posting this pre-print.

      I see that this pre-print has both supplementary material and a link to code:

      https://www.biorxiv.org/con...

      https://github.com/OSU-BMBL...

      However, the section for "Supplementary Materials" says "Supplementary Data are available at Science Advance online".

      Is this intentional (perhaps part of an automatic submission from a journal?), or should the section say something else?

      Best Wishes,<br /> Charles

    1. On 2020-12-15 11:05:02, user Dominik wrote:

      As others pointed out, this paper is seriously flawed in many aspects and should be retracted, especially considering that many people are afraid of a new type of vaccine (mRNA) and conspiracy theorists certainly will take this paper to "proof" that mRNA vaccines can in fact alter your genetic code.

    1. On 2020-11-29 22:33:24, user Simon Burgermeister wrote:

      Hi, the GitHub repository contains the processed result analysis but not the original data. Would it be possible to have access to it as well in order to test the execution of the code with a working example?

    1. On 2020-11-21 18:52:07, user Saurabh Gayali wrote:

      Excellent work. I have been working on a software named filterforge (https://filterforge.com/) for years. It provided me freedom to create flow based image editing workflows. The software is not standard in scientific world plus it is not exactly free. The big thing that it misses is it has no feature for exporting the code in standard languages and this will surely help a lot to me and other non programmers helping programming free approach to algorithm designing. Congratulations once again.

    1. On 2020-11-16 23:18:09, user Fraser Lab wrote:

      This manuscript details the efforts of a team of structural biology computational experts to cross-validate the proliferating SARS-CoV-2 structures emerging during the COVID-19 pandemic. Over the past five months, as soon as each new SARS-CoV-2 structure is made publicly available, the authors have subjected it to a barrage of validation metrics as well as residue-by-residue manual inspection. When they were able to get a hold of the raw data, they analyzed that as well for several of the most commonly occurring pathologies. Re-refined structures were sent back to the structures' original authors for reupload to the PDB via the recently available versioning option that preserves the PDB code (although it would be nice to quantify how many authors were contacted and what the “re-versioning” rate is after contact). In this manner, the structural biology community has simultaneously benefitted from an increased number of experimentalists' single-minded focus on the coronavirus (even where these efforts fall partly outside their areas of expertise) and these experts' careful curation of the resulting structures.

      The manuscript represents an incredible effort. As the authors call attention to in a few places, the errors in data processing and modeling are not only inevitable (especially under the circumstances) but tolerable, as long as they can be identified and corrected in a timely manner — the goal is not to gatekeep so that only experts are permitted to do this work, but to tag-team as effectively and efficiently as possible. Furthermore, there is the separate issue of pathologies resulting from decisions during data collection that cannot be corrected after the fact. It is critical that fixable and unfixable issues are extremely clearly distinguished from each other. We suggest the authors rewrite some of these narratives with the deliberate aim of identifying the origins of pathologies that can be mitigated or corrected in full, again differentiating between these, and taking care that the wording is as charitable as possible toward the researchers responsible.

      There are a few cases of oversimplified concepts that we believe can be succinctly expressed more accurately. For example, where the authors describe data as being "incomplete due to radiation damage," they could instead take the time to explain the difference between incompleteness resulting from a poorly chosen collection strategy, incompleteness in higher resolution bins, and radiation-induced damage that renders some reflections (and some real-space features) self-consistent but inaccurate. The "lower quality" of datasets suffering from these pathologies could be separated into uniformly low resolution datasets, which are more easily recognizable, and seemingly high-resolution datasets with serious systematic errors.

      The authors could also be more clear with a couple choices of wording around concepts of correctness. They write, "While the deposited structures are often improved by PDB-REDO, they need to be checked and should not be viewed as 'more correct' purely on [the] basis of a lower R value." In this and several other instances, we challenge the authors to replace any terms assigning value (improved, correct, error, bad, misidentified) with descriptions of what metrics they are examining and what they mean for the model and data. This publication is an opportunity to instill readers with a stronger sense of how to use the existing validation tools, and what to do when they turn up serious issues. It would be highly useful to go into some explanation of what constitutes model bias and how this is detected in crystallographic and EM data, what metrics we traditionally use to detect it, what happens when we refine against those metrics (!), and how the tradeoff between agreement with priors (geometry, clashscore) and agreement with data (real space CC, FSC) should vary with map quality. If the authors are willing to go as deep as explaining how the available validation metrics were devised, the average reader might learn quite a bit!

      A separate but closely related issue is the identification of real features that conflict with prior knowledge. Under what circumstances do we accept "bad" geometry is actually the right way to model something? These are often information-rich and functionally relevant discoveries, such as Hoogsteen base pairing or very strained geometries at a catalytic site. This is worth calling attention to.

      We read the opening of the "manual evaluation" section as a framing of structure solution as tedium that should be automated as much as possible, but whose results nevertheless fall short in the absence of an expert's intervention. This is unfortunate. We would rather laud both the amazing efficiency (and thereby throughput) that automating routine steps has made possible and the important role of the researcher in guiding the process and interpreting the results.

      On the topic of data not deposited in the PDB, the authors describe a case of a severely radiation damaged dataset and how it was necessary to reprocess the raw data to improve it. We strongly agree that raw data should be made publicly available for exactly these sorts of reasons. Once again, separating this administrative barrier from the researchers' decisions during data collection would be helpful in setting a positive tone. The authors point out the amazing proteindiffraction.org resource and should call for more deposition there (or to SBGrid DataGrid). In EM, the EMPIAR database plays a similar role (with greater proportional adoption) and the reprocessing potential of datasets deposited there should be highlighted and celebrated.

      The "supplying context", "summary" and especially "outlook" sections bring up some extremely important points that could bear to be repeated at the beginning of the manuscript to help frame this work. The tradeoff necessary under the present circumstances in particular — the fact that imperfect "first draft" structures are still useful, and much more useful when they can be quickly updated with any corrections — deserves greater emphasis, and perhaps further discussion of how the field should go about addressing and documenting problems with models and data after the pandemic. We are overall very excited to see this work in print alongside the resources already publicly available at insidecorona.net. Collectively, that resource and this manuscript represent an exciting development in peer review away from gatekeeping and toward continuous improvement!

      Finally, we note a handful of points that we suggest would improve readability:<br /> SARS-CoV is now also known as SARS-CoV-1. We strongly suggest using this term throughout the manuscript to differentiate it from SARS-CoV-2.<br /> The phrase "not by experimentalists, but scientists from other fields" suggests a false dichotomy. We recommend rewording so as to recognize the existence of experimentalists in other fields. <br /> The rationale for annotating secondary structures with the Haruspex neural network is not yet clear.<br /> The COVID-19 pandemic is "unprecedented" in very recent history, but arguably not unique even in recorded history — we would favor a different term here.<br /> The abbreviation RdRp is not defined.<br /> "fulfil" is a typo.<br /> “Structures solved in a hurry to address a pressing medical and societal need _are_ even more prone to mistakes.” - suggest "may be"

      James Fraser and Iris Young (UCSF)

    1. On 2020-11-09 21:31:18, user Clemantine wrote:

      Why would they use aborted fetal tissues lines to experiment with? Using any cell line after the 1947 Nuremberg code requires the informed consent of the Subject and the ability to withdraw from the experiment, both denied to aborted human babies. All cell lines derived after 1947 using aborted fetal tissue is immoral, unethical and illegal and must be discontinued!

    1. On 2020-11-07 00:51:18, user Charles Warden wrote:

      Thank you for posting this preprint.

      I am not sure what the expectations are for describing the methods in the Abstract (for your communication goals as a preprint and/or if you have a specific eventual peer-reviewed journal in mind). When I saw the abstract, I thought it was a bit strange because some steps were left out (such as your alignment using STAR).

      So, I might consider re-wording the "Methods" within the Abstract (either adding some more information like the aligner, or perhaps providing a broader description in the Abstract and the full details in the Methods section)? However, that is just my individual opinion.

      If possible, you could provide a link to code for maximum reproducibility. Am I overlooking that in the main text?

    1. On 2020-10-09 19:44:54, user Whitney Hansen wrote:

      At the stage where the authors share code on how to utilize the list column feature to make a column for each home range estimator, and using the "map" function applied to the "data" list column, I received an error that "hr_mcp" could not be applied to a tibble object. I had to first perform a mutation where I generated a new list column of tracks (i.e. <br /> dat1<- dat1 %>%<br /> mutate(<br /> track = map(data, ~ mk_track(., x, y, timestamp, <br /> crs = crs, order_by_ts = TRUE))<br /> )

      Then I was able to proceed with homerange estimation.

    1. On 2020-10-04 12:53:46, user Michael Clerx wrote:

      On the downside, CellML does not support full object orientation for composing models. This means that base classes that define an interface need to be imported as components of the model, which requires a more verbose syntax. It also does not support discrete variables, but only event triggers. To implement a variable like the interbeat interval d_interbeat, which stays constant between events, an additional equation would have to be added to the model, which sets the derivative of this variable to zero. This both introduces unnecessary code and makes the model less understandable as there is no clear distinction between discrete and continuous parts apart from the labels assigned to the variables

      These concerns should be somewhat alleviated in the new standard, I hope: https://doi.org/10.1515/jib...

    1. On 2020-09-22 23:38:27, user Fraser Lab wrote:

      Overall, this paper aims to demonstrate the ability to build high resolution Cyro-EM models from data collected in less than a day without recently reported new hardware upgrades, as well as establish two metrics for Cryo-EM model refinement. These metrics are a b-factor equivalent and a ‘goodness of fit’ of water molecules. As resolutions improve, the value of these metrics and procedures for water placement are becoming more urgent. <br /> This work expands on this group’s previous work on Cryo-EM model building evaluation, mainly the Q-score metric(Pintilie et al. 2020). However, we think that this paper could be stronger if the authors provided analysed their metrics to other high resolution structures (and perhaps even other ApoF maps/models) to improve the transferability of these metrics. Additionally, we would like to see more data provided to better understand the value of certain metrics shown including the change in the sigma value for Q-score and the chosen sigma value for the SWIM method. Finally, we believe this paper would be strengthened by comparing the proposed methods (updated Q-score and SWIM) with existing methods (original Q-score and phenix.dowser). We would like to see these comparisons both qualitatively with images and quantitatively with tables.

      Clarifying comments:

      1) Please explicitly state or name the two different models created from the two distinct maps (1.36 and 1.34 A resolution) - or is the same starting model real space refined into both? The wording throughout the paper makes it is unclear if there were two different models built. For example, in the introduction, you state ‘we achieved cryo-EM maps of apoferritin, reconstructed from the images collected from the commonly available 300-kV Titan Krios microscopes at 1.34A using K3 detector and at 1.36A using Falcon4 detector’. <br /> However, in the results you refer to both 2 different maps (the majority of the time) and 2 different models (once and a while). ‘per-residue Q-score plot for our two maps to range between 0.85 and 0.88’ and ‘The Molprobity and PDB reports of our two models are ranked very highly in all the assessment scores on the adherence of models to the chemical properties of proteins’.

      2) Please draw strong semantic distinctions between atomic displacement parameters and Wilson B and sharpening B, specifically in the following sentences:<br /> a) ‘Both maps have comparable resolution based on the Fourier Shell correlation of 0.143 threshold6 and similar cumulative B-factor as estimated from reconstructions with varying numbers of particles.’<br /> b) ‘Traditionally, B-factor is used to assess the atom position uncertainty in crystallography and is a weighting factor to allow computing a model-based map identical to the experimental map.’<br /> c) ‘We thus introduce B’ factors derived from per-atom Qscores’<br /> d) ‘The B’ factor, which will be deposited to the PDB, serves the same purpose as the crystallographic B-factor in such a way that we can compute a model-based map that can match well with the experimental cryo-EM density map.’

      Major comments:<br /> 1) Throughout the paper, we would like to see the authors use data from their structures as well as other high resolution cyro-EM apoferritin structures that have recently been published (https://doi.org/10.1101/202..., https://doi.org/10.1101/202.... While these structures are mentioned, this paper could be strengthened by comparing Q-score and water placements across this author groups collected apoferritin models and the two new high resolution apoferritin models. Specifically, commenting on the number and placements of water molecules as well as the Q-score of hydrogen atoms where modeled in the structure from Yip et al (https://doi.org/10.1101/202.... <br /> 2) We would also like the authors to analyze other high resolution Cyro-EM structures, such as β3 GABA-A and the recently resolved B1 GPCR complexes (https://doi.org/10.1101/202..., https://doi.org/10.1101/202... to better demonstrate the transferability of the metrics proposed. <br /> 3) Please clarify why and how a sigma value of 0.4 was chosen for Q-score. Please clarify what would happen if there are structures with resolutions better than 1.2A (see EMDB https://www.ebi.ac.uk/pdbe/.... One potential solution would be to provide a sliding scale of Q-scores so that Q-score could only be compared among structures with a similar resolution. However, we would still like to see justification of sigma values for each resolution ‘bucket’.<br /> 4) Please comment on if/how the Q-score can be used alongside metrics of model quality such as bond angles, length, ect.<br /> 5) We would like the authors to provide data on why certain scaling factor values were used (‘the calculation involves a single scaling factor; the optimum scaling factor is determined empirically by testing which value makes the resulting model-map matched the cryo-EM map better by FSC’). Additionally, instead of trying to match the scaling factor to the FSC, is it possible to compare it to an iso_bfactor for the overall map?<br /> 6) Are there any examples of modeled alternative conformers with better Q-scores than the ones shown in Figure 1f? Do the Q-scores of these residues improve with less map sharpening? <br /> 7) Please provide details on how the Q-score would change with varying sharpening factors. Specifically, as Q-score increases with ‘sharper peaks’, is there any correlation between a sharper map and an increased Q-score? In the context of the other two high resolution apoferritin papers, please compare the differences in map sharpening from improved hardware versus post processing updates. Please include comments on sharpening the B-factor using the Ewald Sphere in Nakane et al (https://doi.org/10.1101/202.... <br /> 8) Please comment on how you recommend evaluating or modifying Q-scores for alternative conformers. <br /> 9) Please provide data and/or rationale on why 2x sigma was chosen within the SWIM method.<br /> 10) It is unclear what the SWIM method would be replacing/improving. Will this be used during refinement or to adjust water molecule placement afterwards? Either way, please compare this method to the state of the art (Phenix.dowser and/or coot water picking). <br /> 11) In the results section, the difference between water placement from different ApoF structures are compared, but it is unclear how your method would improve these discrepancies. Please provide table of water and ion placement discrepancies between apoferritin structures (both those mentioned in the paper plus the two additional high resolution structures) using current methods and your updated method.

      Provide the open source code (or link to said code) for the updated Q-score and SWIM analysis.

      Minor comments:<br /> 12) There is an incomplete sentence in the last paragraph on the second page. ‘Since the resolution used in cryo-EM is neither a conventional optics criterion nor the detection of diffraction spots in a diffraction pattern of a crystal.’

      Stephanie Wankowicz and James Fraser (UCSF)

    1. On 2020-08-13 08:23:35, user Martin R. Smith wrote:

      This is an interesting study and a promising approach. <br /> My one question would be whether the Robinson-Foulds distance is the most suitable measure of tree distance on which to base the linkage ratio. The RF distance suffers from a number of shortcomings, many of which are exacerbated when pectinate (i.e. fully unbalanced) trees are involved, as moves of a single leaf can 'knock out' a disproportionately large number of splits – so it might have particular scope to produce misleading results in the examples that you have used.<br /> I've reviewed some possible alternatives in Smith (2020), Bioinformatics, doi:10.1093/bioinformatics/btaa614 , and fast implementations are available in my R package 'TreeDist' – though unfortunately I don't yet have a python front end to the underlying C++ code. The quartet distance might be particularly relevant, as the distance between two entirely random trees takes a constant value (1/3) – would this obviate the need to generate unliked topologies in order to normalize the linkage ratio?

    1. On 2020-08-02 22:50:59, user E. Castedo Ellerman wrote:

      In terms of lessons learned, it might be worth mentioning in the <br /> paper the pithy easy to remember software guideline of writing "DRY" [1] code. The second bug mentioned in the paper is a classic example <br /> motivating 'DRY'. It is one of the kinds of bugs writing <br /> DRY code is intended to avoid.

      [1] https://en.wikipedia.org/wi...

    1. On 2020-07-29 13:23:01, user Jamie Carpenter wrote:

      This ensemble ML/AI method sounds really interesting as it appears to handle large multi-dimensional data by compressive-sensing-like stochastic sampling of subsets of manageable size. The reported convergence is very encouraging and hopefully this can be backed up with a rigorous mathematical derivation (e.g., upper bounds on bias, prevision, variances, information, etc.) I wonder if the code is available for community testing and independent validation.

    1. On 2020-07-13 19:31:15, user Open-Source Helps Science wrote:

      This is interesting work, but the authors should consider adding a detailed section to Materials and Methods explaining exactly how they implemented their dynamical correction. To add transparency, the authors should also consider releasing the code they used and making it OPEN-SOURCE as part of the Supplementary Information. This would allow anyone to implement their procedure on their own data. As presented, it is not at all clear how anyone would go about reproducing the authors’ procedure, because insufficient detail is provided about what was actually done. This is not good for science.

      The authors do cite previous work (Clabbers et al., 2019), but they also say that this procedure is distinct from the one detailed in that paper. Also, many of the parameters defined in that work are not discussed here. For instance, what was the scale of the correction applied to each dataset? The authors' procedure appears to have promising effects on improving refinement, but more detail is necessary.

    1. On 2020-07-13 02:35:26, user Charles Warden wrote:

      Hi,

      Thank you for putting together this pre-print.

      1) In this pre-print, you mention “our previous study in which we identified a higher frequency of genes with a dN/dS ratio significantly above 1 encoded on the lagging strand six diverse species”. However, in that study, you say “We initially observed a higher frequency of genes with dN/dS values exceeding 1 among the HO genes of each species (listed in Supplementary Table 3). Yet the total number of genes in each species was too small to establish statistical significance, necessitating a combined analysis of all data points.

      In other words, I thought plots showing differences in means for dN/dS values were smaller than 1 (in part) because you couldn’t achieve statistical significance with for the sets of genes with dN/dS values greater than 1. Am I misunderstanding something?

      While I understand 1 is a commonly used threshold, I would guess a dN/dS value of 1.01 could really be neutral. In order to shorten this response, I moved the other details here.

      So, if the proportion of a collection of genes with dN/dS > 1 was not significantly different between groups based upon genomic orientation, then it might still be possible to say that there was a qualitatively higher frequency of genes with dN/dS whose individual “significant” values were substantially greater than 1 (when referring to the earlier paper). If that is the case, then the word “significantly” is being used accurately in the current sentence in this pre-print. However, I think that might cause some confusion for readers. Plus, I am currently having a hard time finding such subset of genes (or the criteria to define some threshold significantly greater than 1) in the Nature Communications paper. So, if that is what you mean, I would very much appreciate if you could please help point out where those specific results can be found (or confirm that is not what you mean).

      A single mutation can cause an important phenotype, but I am not sure if the current phrasing is giving the reader the right impression about the genes with dN/dS greater than 1.

      2) It is a minor point, but I had a problem with the link to the code under “Data/Code” (https://github.com/lh64/MultihitSimulation).

      I can see other repositories under https://github.com/lh64.

      Best Wishes,<br /> Charles

    1. On 2020-07-02 20:26:44, user Matt Grainger wrote:

      I really enjoyed reading this paper. I have a few questions/ comments.

      You talk about subgroup analysis, but from what I can see from the R code in the appendix you have not actually carried out subgroup analysis but have subsetted your data and run rma. Subgroup analysis would involve looking at the differences between subgroups (e.g. male versus female; or USA versus the rest of the world or brassica versus other crop cover types). Am I misunderstanding the code and the approach here?

      With the meta-regression you might need to think (or help the user of the shiny/website think) about the theoretical basis to the model - dredging the model (all possible combinations of variables) might lead to some combinations that are not reflective of underlying mechanisms being modelled. I have had this comment before from reviewers in the past using the same/similar approach.

      Subgrouping the model will require higher power as will meta-regression. The probability of a type 1 error is going to be quite high, and will get higher as you reduce the number of studies and sample sizes. How will you deal with this and communicate the risk to end users?

      From my point-of-view it seems like you are throwing away very useful information (by subsetting the limited data) and risking bad inference. Using a Bayesian model (of some sort - it could be a meta-analysis or decision model) would allow this information to be useful perhaps?

      Finally, I think one question that remains (but perhaps some of Alec's work touches on this) how much does local evidence actually make a difference -is there a big-enough return to the manager or farmer for this local subsetting given it's limitations to make a difference to the "local" effect estimate. This is a decision problem - I would be interested in exploring this more.

      I hope these comments help. I really like the concept and I would love to see this up and running. I think it could be very useful to embed meta-analysis & ES in conservation decision making.

      Cheers<br /> Matt

    1. On 2020-06-05 11:38:46, user Patrick Lemaire wrote:

      ASTEC software and datasets: Updated Access.

      Following further developments, we are happy to share below updated access to the data and software presented in the preprint. The information below and more can also be found on the ASTEC webpage: http://www.crbm.cnrs.fr/en/...

      We hope that you will like these tools and data and that they will be useful to your own work.

      Patrick Lemaire and co-authors.

      Software and Tutorials:

      Github repositories:

      The software, standard parameters files and tutorials can be found in the Github repository: https://github.com/astec-se....

      "astec-2019-published" is the repository dedicated to this paper with a fixed and autonomous (full version, python and C codes) ASTEC software package. This include all the codes and libraries necessary to install the software with examples as standard parameters files and a tutorial to test it. A complete documentation is also provided to guide the users.

      "astec" is the repository for for the ongoing development, enhancement and storage the package development process. Only the python codes are included in this repository. The standards parameter files or tutorial provided in the astec-2019-published repository may have to be adapted to more recent versions of the code.

      Standard parameters. We provide a standard example parameter file that can be initially used to setup each ASTEC step on either the test set we provide (see below) or to initiate a project on personal data.

      Tutorial and test dataset. We provide a tutorial to test each step of ASTEC, a set of 20 timepoints from ASTEC-Pm1 with a downsized resolution to 1μm3. The ASTEC pipeline on this dataset can be run a standard workstation (4 threads; < 1 mn for the fusion and segmentation of a timepoint).

      Jupyter notebook. Provides examples on how to use the outputs of ASTEC can be found in the Github repository: https://github.com/leoguign...

      Datasets

      The ASTEC imaging, segmentation or geometric informations for two wild-type Phallusia mammillata embryos (ASTEC-Pm1 and ASTEC-Pm2) can be downloaded from Figshare (https://figshare.com/s/765d.... These datasets include:

      – Raw images: The four views for some time points of the movie Astec-Pm1 are shared to test the fusion algorithms or evaluate the quality of the acquisition obtained with the MuViSPIM light-sheet microscope. The format of the images is .hdf5 files that can be easily read with an ImageJ/FIJI plug-in.

      – Fused images: complete sequence of fused images are shared for ASTEC-Pm1 and -Pm2.

      – Segmented images: complete sequence of segmented images (voxelic representation) are provided for the entire ASTEC dataset.

      – Meshed segmented images: a meshed version of the segmented images (.obj format), which can be uploaded into the interactive Morphonet web tool (morphonet.org), is provided for the complete dataset.

      – Geometric properties: Properties measured after spatial registration of the movies are shared in both .pkl and .xml format for the complete dataset.

      Note : For each embryo, all files are compressed into a single archive file (tar.gz). A tool provided in the astec-2019-published Github repository can be used to convert .inr format files into the more common .mha files.

    1. On 2020-06-03 12:36:36, user Иван Азаров wrote:

      Hello! Thank you for your tool!<br /> It's very intresting possibility to use CPM modelling.<br /> Do you plan develop your code?<br /> Do users have possibility modify full energy function, which used inside CPM models, to add your own parts?

    1. On 2020-05-18 16:18:00, user Anita Bandrowski wrote:

      "Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day.

      Specifically, your paper (DOI:10.1101/2020.04.03.023846); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> IACUC/IRB: not detected ; randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected .

      We found that you used the following key resources: cell lines (1) . We recommend using RRIDs to improve so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/U4WZvo-tEeqx...<br /> References cited: https://tinyurl.com/y7fpsvzy"

    1. On 2020-05-18 16:17:13, user Anita Bandrowski wrote:

      "Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day.

      Specifically, your paper (DOI:10.1101/2020.03.02.972935); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> IACUC/IRB: detected ; randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected .

      We found that you used the following key resources: antibodies (2), cell lines (3), software (8) . We recommend using RRIDs to improve so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/puNh0o-sEeqg...<br /> References cited: https://tinyurl.com/y7fpsvzy"

    1. On 2020-05-18 16:16:27, user Anita Bandrowski wrote:

      "Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day.

      Specifically, your paper (DOI:10.1101/2020.03.02.972927); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> IACUC/IRB: detected ; randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected .

      We found that you used the following key resources: antibodies (1) cell lines (2) . We recommend using RRIDs to improve so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/pn6YiI-sEeqN...<br /> References cited: https://tinyurl.com/y7fpsvzy"

    1. On 2020-05-18 16:15:38, user Anita Bandrowski wrote:

      "Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day.

      Specifically, your paper (DOI:10.1101/2020.03.10.986711); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> IACUC/IRB: not detected ; randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected .

      We found that you used the following key resources: cell lines (1) software (2) . We recommend using RRIDs to improve so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog).

      We also found bar graphs of continuous data. We recommend replacing bar graphs with more informative graphics, as many different datasets can lead to the same bar graph. The actual data may suggest different conclusions from the summary statistics. For more information, please see Weissgerber et al (2015).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/xbw2sI-sEeqf...<br /> References cited: https://tinyurl.com/y7fpsvzy"

    1. On 2020-05-18 16:14:36, user Anita Bandrowski wrote:

      "Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day.

      Specifically, your paper (DOI:10.1101/2020.04.14.041228); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> IACUC/IRB: not detected ; randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected .

      We found that you used the following key resources: cell lines (2) . We recommend using RRIDs to improve so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog).

      We also found bar graphs of continuous data. We recommend replacing bar graphs with more informative graphics, as many different datasets can lead to the same bar graph. The actual data may suggest different conclusions from the summary statistics. For more information, please see Weissgerber et al (2015).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/vCb_gI-tEeqa...<br /> References cited: https://tinyurl.com/y7fpsvzy"

    1. On 2020-05-14 16:35:22, user Anita Bandrowski wrote:

      "Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day.

      Specifically, your paper (DOI:10.1101/2020.02.10.936898); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> IACUC/IRB: not detected ; randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected .

      We found that you used the following key resources: cell lines (8), software (4) . We recommend using RRIDs so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/b2aE0o-sEeqm...<br /> References cited: https://tinyurl.com/y7fpsvzy"

    2. On 2020-05-13 16:53:24, user Anita Bandrowski wrote:

      "Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day.

      Specifically, your paper (DOI:10.1101/2020.02.10.936898); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> IACUC/IRB: not detected ; randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected .

      We found that you used the following key resources: cell lines (8), software (4) . We recommend using RRIDs so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/b2aE0o-sEeqm...<br /> References cited: https://tinyurl.com/y7fpsvzy"

    1. On 2020-05-13 16:48:07, user Anita Bandrowski wrote:

      "Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day.

      Specifically, your paper (DOI:10.1101/2020.04.07.029090); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> IACUC/IRB: not detected ; randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected .

      We found that you used the following key resources: cell lines (1) . We recommend using RRIDs to improve so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/dcw1vo-tEeqL...<br /> References cited: https://tinyurl.com/y7fpsvzy"

    1. On 2020-05-13 16:47:11, user Anita Bandrowski wrote:

      "Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day.

      Specifically, your paper (DOI:10.1101/2020.02.01.929976); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> IACUC/IRB: not detected ; randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected .

      We found that you used the following key resources: cell lines (1) organisms (2) . We recommend using RRIDs to improve so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/Y29BKI-sEeqb...<br /> References cited: https://tinyurl.com/y7fpsvzy"

    1. On 2020-05-13 16:45:45, user Anita Bandrowski wrote:

      "Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day.

      Specifically, your paper (DOI:10.1101/2020.03.29.014290); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> IACUC/IRB: not detected ; randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected .

      We found that you used the following key resources: cell lines (1) software (6) . We recommend using RRIDs to improve so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/J5kQBo-tEeq5...<br /> References cited: https://tinyurl.com/y7fpsvzy"

    1. On 2020-05-13 16:44:21, user Anita Bandrowski wrote:

      Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found, and our team verified. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day.

      Specifically, your paper (DOI:10.1101/2020.03.04.976662); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> IACUC/IRB: not detected ; randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected .

      We found that you used the following key resources: cell lines (1) software (8) . We recommend using RRIDs to improve so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/sgFLao-sEeqy...<br /> References cited: https://tinyurl.com/y7fpsvzy"

    1. On 2020-05-12 20:34:06, user SciScore Test wrote:

      Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day. Specifically, your paper (DOI:10.1101/2020.03.29.013490); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> IACUC/IRB: not detected ; randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected.

      We found that you used the following key resources: antibodies (3) cell lines (1). We recommend using RRIDs to improve so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog). We also found bar graphs of continuous data. We recommend replacing bar graphs with more informative graphics, as many different datasets can lead to the same bar graph. The actual data may suggest different conclusions from the summary statistics. For more information, please see Weissgerber et al (2015).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/JskapI-tEeqh...<br /> References cited: https://tinyurl.com/y7fpsvzy

    1. On 2020-05-12 02:02:46, user SciScore Test wrote:

      Hi, we're trying to improve preprints using automated screening tools. Here's some stuff that our tools found. If we're right then you might want to look at your text, but if we're not then we'd love it if you could take a moment to reply and let us know so we can improve the way our tools work. Have a nice day. Specifically, your paper (DOI:10.1101/2020.03.16.990317); was checked for the presence of transparency criteria such as blinding, which may not be relevant to all papers, as well as research resources such as statistical software tools, cell lines, and open data.

      We did not detect information on sex as a biological variable, which is particularly important given known sex differences in COVID-19 (Wenham et al, 2020).

      We also screened for some additional NIH & journal rigor guidelines:<br /> randomization of experimental groups: not detected ; reduction of experimental bias by blinding: not detected ; analysis of sample size by power calculation: not detected . IACUC/IRB: detected.

      We found that you used the following key resources: antibodies (1) cell lines (1) organisms (1) . We recommend using RRIDs to improve so that others can tell exactly what research resources you used. You can look up RRIDs at rrid.site

      We did not find a statement about open data. We also did not find a statement about open code. Researchers are encouraged to share open data when possible (see Nature blog).

      We also found bar graphs of continuous data. We recommend replacing bar graphs with more informative graphics, as many different datasets can lead to the same bar graph. The actual data may suggest different conclusions from the summary statistics. For more information, please see Weissgerber et al (2015).

      More specific comments and a list of suggested RRIDs can be found by opening the Hypothes.is window on this manuscript, direct link https://hyp.is/3BOPqI-sEeqk...<br /> References cited: https://tinyurl.com/y7fpsvzy

    1. On 2020-05-11 22:58:25, user Donna K. McCullough wrote:

      Hello. As part of a graduate program journal club, my class reviewed and critiqued your paper. We included comments on aspects of your article that would have enhanced our understanding of your work. The comments are below.

      Summary:

      Authors want to combine previously considered cell-fate networks with models of how methylation affects gene silencing to get a better overall picture of what can happen to a cell. (Cell-fate networks generally do not consider methylation as the networks go through gene regulation and methylation is more of a chromatin modification rather than something you can measure/predict with the expression levels of transcription factors.)

      Comments:

      In general, the background does a very good job giving the reader information that they need to understand this work. A bit more explanation on the BoA and why it cannot accurately be used to describe high dimension systems using just s0 would have been helpful but I managed without it. (Explain a bit more about why you needed to use the volumetric definition of BoA.)

      How does making this one assumption reduce the state space from 17 to 4? More detail would be useful to some readers.

      The equations at the bottom section 2.1.1 are very thin and hard to read. Could you consider making the font bold here? Perhaps dividing a longer equation into two (break the line) will allow to increase the font size.

      Some of the legends on the figures (like figure 8) are also very hard to see. The writing is thin and small so when printed it is almost impossible to read. Perhaps increasing the font size could help. Also, in general, figure legends are very short. Consider putting more information into legends because readers often focus on figures/legends to understand the paper.

      Do cells generally change the methylation/demethylation rates throughout their lives? Is it not a fixed rate of cell life? Some clarification of the methylation process could be useful for very inexperienced readers.

      BoAp in Figure 12 legend is written two different ways (in terms of capitalization). Is there a reason for that?

      I would have liked to have gotten a bit more of an understanding of the state of the literature now. Has no one ever looked at methylation computationally or is it specifically in cell determination networks where methylation parameters have been lacking? <br /> Here is another paper (https://journals.plos.org/p... that looks at methylation as well. It may be useful for the introduction of your paper.

      Where is the evidence of BoA for all the states of a cell? It is like a ratio where there is only 100% to pull from or can BaAs all increase? It is unclear how you quantified the BoA of S1 compared to S0…do they add up to 1?

      For the equations on page 8, a table of what the parameters in all the equations would be helpful. And one of the equations is missing its equation number.

      The figures at the beginning of the paper were helpful to understand how the genes were connected but it may be useful to have the equations next to them.

      Concerning figures 5 and 6; why does the BoA axis not go to 100% ? A horizontal cut off line would be useful if there is one. Also, all of the colors of the lines in these figures is confusing. Is it individual parameters being tested each get a colored line? What does it mean? This is what the figure legend is supposed to tell us this.

      Kinetic rates should have a dimension, generally, per unit of time.… Figure 7 legend says the values are kinetic rates but no dimensions for each rate are provided. Are these dimensionless parameters? Then they should not be called kinetic rates.

      A table of your parameter names, their units, and what they encompass would help the readability of this paper.

      Would epsilon-independent BoA always increase when the methylation state increases? Are there no instances where this would not be the case?

      I like that you run the simpler model and then also do a numerical analysis to see if that would hold up in a higher level model.

      It would be nice to see a comparison of a different type of model to see if yours is the best one to explain how methylation affects BoA of a cell’s state.

      Can your model be used to connect to real data? Can the model be used to make a prediction that could be tested experimentally? For example, you could make a prediction for a specific system where you could predict the impact of methylation/demethylation on BoA of a given state.

      Why is the programming code only available upon request? Is it possible to get it included in the public repositories, e.g., github?

      A table in the supplemental with all of their abbreviations used in the main text would help me so readers aren't constantly going back and forth in the writing to remember what the abbreviations stand for.

    1. On 2020-05-07 07:53:00, user Wiep Klaas Smits wrote:

      This looks quite interesting. However, the paper does not seem to have any reference to code in a repo, or web implementation. This severely limits the uptake as a "tool" (that is suggested by giving the algorithm a name). I hope the authors will fix this.

    1. On 2020-05-04 21:38:47, user Daniel Himmelstein wrote:

      My review of version 2 of this preprint is available online. The TLDR summary is that the approach and utility of SciScore are important advances, however the lack of data and code availability for the study limits its impact.

    1. On 2020-05-03 20:52:07, user Jean-Yves BOULAY wrote:

      This paper may interest the author: this article investigates the molecular modules system proposed by Professor Sergei Petoukhov. This study describes numerous phenomena of symmetry in the distribution of the amino acids in the genetic code table. These phenomena consist to arithmetical arrangements of sets of modules numbers, or/and protons numbers which are counted in each of the 20 amino acids used by the standard genetic code. These arithmetical phenomena are by configurations of multiples of prime numbers also.<br /> https://www.researchgate.ne...<br /> some examples :<br /> Without the rebel group (see paper):<br /> In the genetic code table, the total modules sum of the columns 1and 2 (126 + 112) and the total modules sum of<br /> the lines 3 and 4 (122 + 116) are identical! And the sum of the right chequered configuration is also the same.<br /> Also, the total modules sum of the columns 3 and 4 (114 + 158) and the total modules sum of the lines 1 and 2<br /> (144 + 128) are identical! And the total modules sum of the second chequered configurations is also the same.<br /> https://uploads.disquscdn.c...

    1. On 2020-03-16 22:12:55, user Laura Sanchez wrote:

      Dear Yu and Petrick, this preprint was discussed in a lab meeting and we would like to offer the following for review. Thank you for posting this very interesting manuscript. Best, The Sanchez Lab:

      As frequent users of molecular networking tools which are based on tandem MS data, we enjoyed reviewing a method not commonly used in our lab. It is clear that reactomics networks may help to identify analogs of compounds that are not easily matched by MS2 networking, especially for molecules with poor fragmentation patterns. We are very interested to see the application of linking enzymes to disease states through metabolomics approaches. That said, we have also included a list of critiques and suggestions for the preprint.<br /> Major:

      • Networking tools incorporating mass shift and KEGG integration such as MetaNetter 2 (Cytoscape plugin) and MetaMapR (R package) exist. The manuscript could benefit from clarifying what makes this approach unique.

      • Overall, the manuscript assumes a high level of knowledge. There are processes that go unexplained, such as the full extent of pre-processing accomplished in CAMERA and RAMclust, and why the Pearson’s correlation coefficients threshold was set to 0.6, etc. The pre-processing is extremely important for this technique to work due to the issues that were stated in the text. E.g. how is the difference between an isotope and a PMD determined and how do you differentiate between in-source fragments and neutral losses and the PMDs reported.

      • As it stands, self-loops in figure 1 are confusing and add a lot of visual clutter. Based on the source code, it appears that retention time is taken into account when defining peaks, so we are assuming what’s happening is two peaks with the same accurate mass have different retention times. However, this should be explicitly stated within the text.

      • In “Source appointment of unknown compounds”, the second paragraph is very difficult to understand. It sounds as if all compounds plotted are carcinogenic, but then “carcinogenic compounds were not connected by high frequency PMDs”? This then doesn’t seem to align with the next sentence discussing the average degree of connection, for which the values of 8.1 and 2.3 arealso not explained. The carcinogenic 1, 2A, and 2B could be briefly explained or perhaps just calling them level 1, level 2A etc. would help. In the same vein, the Figure 3 legend could benefit from changing labels from “Endogenous compound 2” to “Level 2 endogenous compound” or something else to clarify that it is referring to the toxicity level. Even then it is not sure if including that information is relevant to the figure.

      • The final example in “Biomarker Reactions” feels underwhelming in its current state. Even though it’s claimed to have a high significance, it is not convincing that the +2H mass shift is a significant biomarker when it’s such a common reaction. If this could be tied to a specific enzyme that has been implicated in lung cancer, or some other biological evidence to support this claim, it would be more convincing. Otherwise, it could be worth making networks based off of the biomarker metabolites found in the original publication e.g. NANA?

      • Figure 4 does not effectively communicate the significance of the data points as it is explained, the plots look very similar to the naked eye. Perhaps there is a statistical visualization that would make the point more clearly.
      • Finally, it might be worth adding a discussion regarding the resolving power needed and limitations. The described workflow seems to highly depend on the ability to accurately gather three decimal places. This will take a higher mass resolving as the molecular weight of the analytes of interest increase. It is unclear right now what the targeted mass ranges were for the measured compounds and what resolving power may have been used. Given that this is an experimental variable users can designate on an Orbitrap for instance, is there a recommended number or do the authors envision leaving this to the mass spectrometrist to define? Depending on the target audience this could greatly aid in implementation and general understanding. <br /> Minor:
      • The first sentence of the manuscript defines metabolomics using specifically an untargeted metabolomics definition. Some clarification of targeted vs untargeted or a focus on untargeted would make this clearer.

      • Methylation, oxidation, are common artefacts from the extraction process and analysis of biological samples. These are unavoidable facts of metabolomics, but perhaps the authors could comment on accounting for sample degradation processes.

      • As a proof-of-concept, we would have liked to see orthogonal identification for some molecules to prove that the molecules are related. An exact mass is only a “Level 5” identification (doi/10.1021/es5002105).

      • The two matrices might be clarified by being presented as tables or figures with abbreviations as an element of explanation. For instance, adding in “Ethylnitronate (S1), Oxygen (S2)” would help clarify and remove the necessity for the sentence starting in “For KEGG reaction R00025..”. The following sentence could be made into the table caption.

      • Use of the phrase “Topological structure” in the section “PMD Network Analysis” is confusing when it’s used in proximity with a compound’s chemical structure. It is unclear whether stating that chemicals with similar topological structure have similar biological activity refers to the topological properties of the network representation or if it is referring to the chemical structure itself. The reference to Figure 1 does not clarify this and either way this is a statement that requires further experimental or literature support.

      • Figure 2 is very low resolution and the yellow and peach colors are hard to make out. Overall we’d suggest changing edges to have a text reference to what reaction is being used as opposed to color. Addition of the chemical structures for TBBPA and some of the analogs into this figure would help to visually make the point that what you are seeing is a representation of molecular analogs. We also offer that the node color could be based on what has been confirmed in the previous study and what was newly annotated by this analysis to more clearly make the point that there is new information to be gleaned by running this analysis

      • Would it be possible to re-examine tandem data collected from the original paper to match fragmentation and confirm that the secondary network is related to the confirmed network?
    1. On 2020-03-11 03:02:04, user V Zenkov wrote:

      Summary of the paper:

      The authors propose a new deep learning framework, NetFLICS-CR, which speeds up hyperspectral lifetime imaging. It is an extension of a previous product, NetFLICS (from https://arxiv.org/abs/1711..... Both strategies use single-pixel imaging, which is typically based on an inverse problem that reconstructs 2D images from measurements featuring a series of patterns. They speed up the process using compression sensing. NetFLICS is 4 orders of magnitude faster than the previous solution and requires less photons, but it has a compression ratio of 50% for 32x32 images. NetFLICS-CR improves by staying fast, and works on 128x128 images. For perspective, NetFLICS-CR reduces the acquisition time from 2.5 hours at 50% compression to 3 minutes at 99% compression while using a single-pixel Hyperspectral Macroscopic Fluorescence Lifetime Imaging (HMFLI) system.

      Comments:

      • is there a reference for TVRecon? There is a summary in the supplement, perhaps it should be referenced in the text.

      • there should be a consistent numbers of iterations. Why are there sometimes only five? Some justification would be appreciated.

      • it would be very helpful to have the code on GitHub for reusability.

      • what are the limitations of this research?

      • Figures 1,2,3,4: The legends would benefit from having summary sentences and conclusions. (The current legends only guide the reader to what is in the figure.)

      • Figures 1,3: If the paper is printed in gray scale, the lines in the graphs look largely similar.

      • Figure 1b: What are the units? Also, why is mean absolute error (MAE) used instead of mean squared error (MSE)? Mean squared error is traditional, as far as I know.

      • Figure 1b: The top right of both graphs shows a "zoomed in" view of the bottom right rectangles. However, it is not immediately clear that that is what is happening because the rectangles have different proportions. One way to help would be to add numbers on the axes to show that the epoch number / MAE are "zoomed in".

      • Figure 1b: It could help to add the title "compression ratios" next to the compression ratios at the top of the figure.

      • Figure 1b: The legend could say "six different compression ratios" to be even clearer.

      • Page 2: The text below Figure 1b defines Compression Ratios as #acquired patterns / #full patterns, but the examples in the text are instead the ratio (#full patterns - #acquired patterns) / #full patterns. I believe the latter is used throughout the paper.

      • Figure 2b, especially the top graph: The legend may be confusing because all the lines seem to overlap, with the only visible color in the graph being yellow. This might be fixable with smaller points.

      • Only one metric was used (Structural Similarity Index Metric) in the paper, which could lead to bias. The optimization is then based on the same metric used to assess the performance of the model, which could be biased toward that metric. Having multiple metrics / measures of performance could help remove this possible bias.

      • Figure 3c: I would find 3 panels easier to read than the 3D graph.

      • Figure 3c: P and I/R are labeled backwards.

      • Figure 3c: The legend says NetFLICS, but it seems like it should say NetFLICS-CR. (This is confusing because NetFLICS is also a thing!) The legend says "per CR" and that would make more sense in the context of the paper.

      • Figure 4a: Next to the red images, why is External CCD turned 90 degrees to the left while other text in the figure is turned 90 degrees to the right? If all the text is turned in the same direction then it will be more consistent.

      • Figure 4b: TVRecon and NetFLICS-CR are not next to each other in the graph, which makes it difficult to compare them.

      • Figure 4b: Why is this the only figure with error bars?

      • There is no actual labeled conclusion section - the conclusion begins with "in conclusion" in the text. Having a set-off conclusion section would make it easier for someone to quickly read the conclusion.

    1. On 2020-03-05 15:16:47, user Wouter De Coster wrote:

      Dear authors,

      Your method looks very interesting, and I have some datasets in mind on which I would like to try this. Do you plan to make your code available?

      Thanks,<br /> Wouter

    1. On 2020-02-24 22:03:07, user Fraser Lab wrote:

      The authors propose a method that would automate atomic model building for near-atomic resolution cryo-EM maps of proteins for which no prior structural information is available. The proposed method is discussed in the context of segmenting cryo-EM maps of a multi-subunit protein complex in order to accurately identify individual subunits. The advantage of this method is that it allows for accurate model building of a multi-subunit complex when there are no structural models for individual subunits by leveraging evolutionary couplings (as a scoring criterion and as a validation metric). Overall, due to the lack of benchmarking against other methods and the limited test cases examined here, we are not yet sure whether this manuscript demonstrates that the method is robust or that it presents any advances over existing methods that do not rely on evolutionary couplings.

      The authors state that their proposed method will improve segmentation of cryo-EM maps, but there is little discussion of how their method addresses this task. The majority of the paper focuses on model building, not segmentation. On page 4, the authors outline several tasks involved in “tracing a chain in the EM density map of a multi-component complex… i) the map needs to be segmented into sub-maps of the individual components; ii) probable locations of amino acid residues need to be identified in the map; iii) these locations need to be assigned to the primary sequences of the components; iv) full atom models of the protein need to be constructed.” Of these tasks, only the first one is relevant to the proposed method. The other tasks, including “automatic map segmentation, automatic map sharpening, automatic interpretation of multiple chain types, and automatic application of reconstruction symmetry” have already been implemented by software packages, some of which this manuscript cites (Cit. 10 - Terwilliger et al 2018; Not cited - Terwilliger et al, 2019 doi: 10.1002/pro.3740 also see: “Automated Segmentation of Molecular Subunits in Electron Cryomicroscopy Density Maps” by Baker et al, 2006). It may very well be that this method is a major advance over these previous studies, but it is currently impossible to tell from the manuscript.

      The major flaw in what is presented is that the proposed method is not rigorously tested on multiple test datasets (e.g those that are publicly available on EMPIAR or EMDB). Rather, the method is used to build a model into a map of the TssF protein, a subunit of the Type 6 Secretion System baseplate. However, the authors modify their method to skip steps 1 and 2 (including the map segmentation, supposedly a major strength of the method) due to the low resolution of the map. The motivation for using a Minimum Spanning Tree in the method is not well described, especially in the absence of significantly improved results relative to existing methods. This brings into question over what resolution range this method is appropriate. It is also concerning that the authors report building an atomistic model of TssF and TssG components, but do not include the model in their manuscript or a statement about PDB deposition.

      The authors acknowledge the necessity of sequence information from numerous protein homologs in order to accurately create a contact prediction map, one of the required inputs in their method. It would be informative for the manuscript to include benchmarking that demonstrates how many sequences or contact couplings are necessary for accurate segmentation. This can be achieved by plotting the local correlation per amino acid residue against a varying number of input sequences. This section could also be strengthened by a discussion in the manuscript of how sequence co-evolution information relies on sequence conservation and how this information may not be useful if homologs are highly divergent.

      Finally, we encourage the authors to make the code for their method publicly available so it may be more widely useful and transparently understood.

      Minor Points<br /> There is no comparison of the author’s proposed method and the automation tools provided within Coot, yet the authors declare that Coot’s tools are “largely manual, time-consuming, and prone to subjectivity.” There is also no comparison to Rosetta or Phenix. For example, running: phenix.map_to_model emd_2513.map 4ci0.fasta.txt resolution=3.4 find_symmetry=True does a pretty good job tracing the majority of the chains (in about 3 hours on a laptop), but does not perform as well as what is potentially described here. Comparison would be helpful for demonstrating the proposed method’s added capabilities, if any, or establishing the method as a less or more computationally intensive alternative to Coot, Rosetta or Phenix.

      The Results section contains a great deal of detail that would be better laid out in the Methods section, not only for organizational purposes but also to make explicit what steps are intended to be routine. As written, the manuscript suggests a nontrivial number of judgement calls will be involved in applying the method, detracting from its usability by a wider audience.

      Figures<br /> Fig 3 - Predicted contact maps vs final contact maps<br /> This figure does not clearly compare predicted and final residue contacts.

      Fig 4 - Alignment and comparison with reference model<br /> It would be easier for readers to interpret comparisons if presented with individual chains. The reference model could be overlaid with the final model for easy visual comparison (similar to the visualization of multi-conformers).

      Fig 6 - Method applied to wedge complex of bacT6SS baseplate<br /> 6A - Can the authors compare their model to a deposited model (in addition to other benchmarks listed above)?

      We review non-anonymously, James Fraser, Iris Young, and Roberto Efrain Diaz (UCSF) and have posted this review on BioRxiv as a comment.

    1. On 2020-01-31 21:55:30, user Alex Crits-Christoph wrote:

      All four of the identified amino acid insertions are extremely short and are found in the genomes of many other organisms, not just HIV. In other words, the primary finding of this work are entirely a highly expected coincidence.

      All organisms contain a DNA code that has the genetic instructions for development, functioning, and growth - this is known as the "genome". You can imagine each genome as a book of instructions. What these authors did is look in the genome book of the 2019 novel coronavirus and identified 4 sets of letters that aren't found in the genome book of SARs, a related coronavirus. They then compared these letters to the genome book of HIV, and found some places where they looked somewhat similar - but not even identical. However, because these sets of letters were so short, they are often found in many genome books by chance - they way you might search for the phrase "can be there" in Google Books and find that thousands of books contain those words - but this is not an example of plagarism.

      Note here: We call these sets of letters "insertions" because they are in one genome, but not in a close relative - "insertion" does not imply human interference or engineering - it is an evolutionary term and refers to a natural evolutionary mutation.

      Here are the four insertions:

      TNGTKR

      HKNNKS

      RSYLTPGDSSSG

      QTNSPRRA

      These four insertions are protein sequences, that are encoded by a DNA sequence (which you may know uses molecular "letters" of A, G, C, and T to encode for proteins, which uses 20 molecular amino acid "letters").

      You, dear reader, do not have to take anybody's word for it that these letters are a concidence - you can do the bioinformatics yourself!

      If you would go to:<br /> https://blast.ncbi.nlm.nih....

      You will arrive at a search engine for these genome books, kind of like the Google of biology. Click on "Protein Blast", because are going to search for these protein sequences.

      Under where it says "Enter accession number", you can paste any one of the four sequences above.

      And then you can hit the "BLAST" button at the bottom of the screen. In a few minutes you will get a set of results.

      Let's go through the results for the longest sequence, "RSYLTPGDSSSG", together.

      Under the "Description" field you can see resulting hits. The first hit you see is to "spike glycoprotein [Wuhan seafood market pneumonia virus]" - this is good, because we know that this sequence came from this genome. Under "Per. Id" you can see the similarity of this sequence to other hits - in this case, you can start by seeing that this sequence is also found in Bat coronavirus, so isn't actually novel at all! And there are many comparative hits that as equally as good, or often better, than the HIV comparison.

      Let's then take a look at the second sequence, "HKNNKS", together.

      If you go through the same search process for this sequence and look again at the results, you can see hundreds of perfectly identical matches. Maybe you see Sipha flava - that's an Aphid, or Tetrahymena - that's an Amoeba. Drosophila is a fruit fly. Clearly this sequence is found in thousands of genomes.

      Fortunately, the search has a built in way of answering the question "How likely was this result to have occurred by chance?". It is called the E-value, or Expect Value - the number of times we'd expect to see this result purely by chance. As you can see here, many of the E-values listed on this page are greater than 7829 - so we'd have expected to see 7829 instances of matches like these completely by chance! This is not evidence for gene transfer or gene similarity - it's simply a coincidence. As you now search for the other insertions described by this paper, you'll see that all of them hit hundreds of other genomes simply by chance. It is no surprise at all that they could have matches with some similarity in the HIV genome.

      Congratulations! You are now a more careful and proficient bioinformatician than the authors of this paper.

    1. On 2020-01-27 19:02:57, user Simon Frost wrote:

      Some quick comments:<br /> 1. Add acknowledgements to the people who generated the data.<br /> 2. Include the data as analysed in the manuscript - the data themselves are not accessible directly via the link, and are rapidly being updated.<br /> 3. Include the code used to fit the model.<br /> 4. Describe how confidence intervals were obtained.<br /> 5. Discuss how/why the estimates of R0 are higher than those obtained by other groups.

    1. On 2019-12-31 00:29:29, user Charles Warden wrote:

      Hi,

      Thank you for posting this pre-print.

      I looked into something somewhat similar in yeast, but this was predicted to occur more often in humans and flies.

      One measure by which that would be true is with EvoFold predictions. While I think I have encountered some issues currently finding the source code, you can still see the predictions as a track for hg19:

      https://genome.ucsc.edu/cgi...

      It looks like you mostly use thermodynamic measures, but I think comparing overlap with conserved co-evolution of sites may also be interesting (which is what the EvoFold hg19 track may help accomplish).

      There is some mention of human coding fRNAs in the original EvoFold paper (such as in the "New Coding fRNAs" section):

      https://journals.plos.org/p...

      That paper also mentions some recovery of some specific examples (like A-to-I RNA editing), which I don't think I saw in this paper (and recovery of those types of annotations may provide some useful validation).

      EvoFold was also able to identify a novel (non-coding) human gene (back in 2006):

      https://www.nature.com/arti...

      So, I think it may be worth looking into.

      Best Wishes,<br /> Charles

    1. On 2019-12-30 18:12:36, user Fraser Lab wrote:

      The major goal of this paper is to use diffuse scattering data to inform models of collective protein motions. This is a landmark paper that unites many disparate observations in the field and pushes the state of the art forward much more so than any paper since Wall et al, 1997 PNAS.

      Through careful data collection, the authors are able to separate Bragg and diffuse scattering. The major experimental advance over previous work is that their fine-scale analysis enables them to integrate diffuse halos surrounding the Bragg peaks. This data yields the observations needed to model lattice dynamics. They find that lattice dynamics explain a significant fraction of the diffuse scattering data. Nonetheless, the authors noticed residual B-factors and turned to internal protein motions to explain the remaining disorder, which leaves signals both around the Bragg peaks and in hazy streaks and clouds between them.

      To explain these residual features, they tested both normal modes analysis (NMA) and full molecular dynamics (MD). Furthermore, they were able to use Patterson analysis to choose between redundant NMA models, conquering an outstanding challenge in the field of macromolecular diffuse scattering. Surprisingly, the NM model that accounts for lattice motions and internal protein motions matches the data better than a crystalline MD model. What does this mean for MD that a reduced representation fits better?

      Overall, the data collection and processing are extremely thorough. Opening up these analytical methods to the community is the next step - and publishing their code is the only essential revision we would request prior to publication.

      Despite our enthusiastically positive interpretation, we do have a few minor questions and requests for clarification:

      While examining the exponential decay in halos around the Bragg peaks, why are the 100 most intense peaks between 2 Å and 10 Å focused on? In Figure 2 it appears that there is a skew in the distribution of exponents toward a sharper decay (n > 2). How do the histograms look when more halos are sampled? Is it possible that this sharp decay could be explained by Bragg peaks that are leaking into adjacent voxels?

      The authors are rigorous and explicit in their modeling efforts and make impressive strides forward. Still, we are left with questions about these models. For refinement of the lattice dynamics model, a small fraction of halos were chosen. Why did the authors not use all the halos? Why was the angular range of 2 Å to 2.5 Å chosen for refinement? Why does this resolution range differ from the analysis of halo decay ( 2 Å to 10 Å)?

      As we commented above, we were surprised to see that a NMA model matched the diffuse intensities better than a crystalline MD model. We wonder whether incorporating the isotropic component of the diffuse scatter would alter this interpretation? Furthermore, since the authors scrupulously subtracted sources of isotropic background scatter, why was the remaining isotropic portion of diffuse scattering not used for refinement of the NMA and MD models?

      Using diffuse scattering data to distinguish between competing models of motion has been a longstanding challenge in the field of macromolecular diffuse scattering, and we are impressed with the authors’ work in this regard. This is really a breakthrough! We were surprised to see how subtle the effects of restraining domain motions were upon the ΔPDF in Figure S17, can the authors comment on the statistical significance of this difference? What is the uncertainty in the Patterson map, and how does this play into the interpretation of the best model?

      We have no major stylistic recommendations. The figures are elegant and clearly represent the main points of the paper. Similarly, the text is clear and concise, with thorough expansion in the supplemental material.

      On a final note, this paper pushes the field forward, and we believe there is room for further speculation. A few areas to consider:<br /> How might crystallographers who encounter more mosaic Bragg peaks (these are some of the least mosaic crystals in existence!) separate the Bragg signal from the diffuse signal to analyze halos? <br /> In what ways can NMA models and MD be further improved to match diffuse scattering data? <br /> What complications might arise in crystals with more complex unit cells, and how can this be overcome? <br /> How do they reconcile the results of ref 18 with their analysis of the lattice dynamics (different systems obviously)?

      The authors have done an excellent job of carefully collecting data, thoroughly analyzing it, and clearly explaining their work. We think that digging into the questions above may add to the already substantial impact of this paper, and look forward to their replies. Nonetheless, we think this important paper is worthy of publication as is (noting the caveat of code release).

      We review non-anonymously, James Fraser and Alex Wolff (UCSF)

    1. On 2019-12-02 16:17:02, user stephens999 wrote:

      Review of Yurko et al, Matthew Stephens

      This paper applies recently-developed statistical methods<br /> ("adaptive p-value thresholding", AdaPT) to Genome-wide association studies (GWAS),<br /> with a particular focus on a Schizophrenia (SCZ).<br /> The key feature of AdaPT is that it facilitates analyses that control FDR<br /> while taking account of independent prior information ("covariates") on each test -- here,<br /> information on each SNP, such as its significance in studies of related phenotypes,<br /> and its status as an expression QTL in eQTL studies. The AdaPT framework is very flexible,<br /> and Yurko et al exploit this flexibility to use flexible models,<br /> specifically Gradient Boosted Trees, to model the effects of each covariate.<br /> The results show a substantial increase in power to detect significant associations,<br /> at least in the SCZ study.

      I like the idea, espoused and exemplified here,<br /> of applying these flexible FDR frameworks to GWAS. And I can see the potential<br /> for this type of analysis to have widespread use in genomics beyond the specific<br /> studies considered here. However, I do have one major concern --<br /> about the effects of linkage disequilibrium (LD) -- that I feel<br /> is not adequately addressed by the submitted manuscript.<br /> These issues need addressing because i) they<br /> potentially undermine some of the main conclusions about<br /> biological significance here; and ii) similar issues related to<br /> LD will arise in many genomic studies, and so they represent a potential<br /> barrier to these methods fulfilling their potential more generally in genomics.

      Major issue:

      My concern about LD comes from the strong clustering of discoveries<br /> (orange) in Figure 2, panels B and C, which seems likely to be due,<br /> at least in part, to LD among SNPs.<br /> (I would be happy to be corrected about this --<br /> it is hard to be sure from the plot.) The results on Chromosome 6, presumably near the MHC,<br /> are the most obvious cluster, but more generally the positive discoveries seem to be<br /> generally characterized by almost-vertical sets of points quite near to one another.<br /> This visual impression is strengthened by the result that the 843 discoveries in<br /> panel C apparently involve eQTLs for only 136 distinct genes (line 341).

      Let me emphasize that my concern is not about failure of FDR control for correlated tests:<br /> the authors do simulations to check for this, so let us assume that FDR is indeed controlled<br /> at the SNP level (although I note that this would not imply<br /> FDR control at the gene level).

      Rather, my concern is that, in the presence of LD, every actual "causal" association (ie<br /> an association between a SNP and phenotype due to the SNP affecting phenotype) will<br /> come with a cluster of "non-causal" associations at other nearby SNPs. These<br /> non-causal associations are real associations<br /> (ie they are present in the population, not only the sample), but do not add<br /> biological insight. For this reason,<br /> most GWAS studies only report one "lead" SNP in each cluster of<br /> associations. If indeed many of the associations being reported here are due to LD<br /> then it seems unreasonable to simply count the number of discoveries in each<br /> analysis as a measure of biological significance of the SNPs or covariates used.

      Eg for Fig 2 B,C the comparison of 843 vs 361 discoveries<br /> seems not helpful if the 843 are simply in LD with the original 361.<br /> It also potentially undermines subsequent attempts to glean biological insights<br /> from the associations (eg Fig 3).

      The simplest way to address this seems to be to perform more aggressive<br /> LD pruning of SNPs before running the analysis. This should maintain<br /> adequate FDR control, but would lose power<br /> as one does not know which SNP in each cluster to keep in advance.<br /> Unfortunately, simply pruning for LD post analysis seems like it would<br /> violate FDR guarantees. I am not sure whether pruning within the analysis<br /> (eg counting multiple SNPs in LD as just one finding when computing |R_t|<br /> and |A_t|) could work. Anyway, unless my impression from Fig 2 is mistaken,<br /> something needs to be done to address this issue.<br /> (In the longer term it could be interesting to investigate combining<br /> the covariate-based methods with ideas like the knock-off filter<br /> to deal with LD, but this clearly lies outside the scope of this paper.)

      Other comments/ questions:

      • I find the "selective inference" label rather uninformative.<br /> I think "covariate-informed FDR" or "covariate-moderated FDR" better describes<br /> the analyses here; maybe there are other informative names you could use.

      • Why select the eQTLs prior to analysis,<br /> rather than treat eQTL status<br /> as a covariate and learn whether it is relevant, just like the other covariates?

      • l28 "fully explore" is maybe overly strong.

      • l47: can you be sure that the BD results are entirely independent samples<br /> (in both cases and controls?) Does it matter?

      • l73: "greater support" - greater than what?

      • l123: is this a numeric issue? Or is the z score really exactly 0?

      • Fig 1: EM algorithm seemed to come from nowhere here.<br /> And gradient boosting not mentioned? Also the "Select SNPs" box is unclear.<br /> What kind of selection?

      • l222: is there usually only one gene in the set? Or do you often see<br /> one SNP is an eQTL for multiple genes?

      • l234: Any particularly reason to use WGCNA rather than carefully-curated<br /> gene sets or pathway databases?

      • The section to justify "only" 55% replication seemed excessive to me --<br /> you could just point out that incomplete power is entirely expected.<br /> I did not really understand the reference to Winner's curse, which relates<br /> more to effect sizes.

      • l382 although it is clear that once one reaches 100% power<br /> the addition of covariates cannot help, I'm not convinced that<br /> power is anywhere near high enough in GWAS to make this argument. Across<br /> GWAS we have seen is that as sample sizes increase we find more and more associations<br /> and it seems unlikely that power is near to saturation.

      • Is intercept-only AdaPT similar to BH?

      • providing the code for the modified adaptMT is a good start,<br /> but code for the analyses performed here would also be necessary<br /> for readers to reproduce results.

    1. On 2019-11-14 22:42:18, user Jubin Rodriguez wrote:

      In my opinion, an incomplete tutorial; for example, where is the R code for generating the volcano plots. Also, it is important to show on the plot or as a separate heatmap the names of some of the top upregulated or downregulated genes in the dataset under study.

    1. On 2019-09-24 02:02:00, user Fraser Lab wrote:

      The major goal of this paper is to put electron density maps on an absolute scale. Ideally, this would rid the world of “sigma” scaling and allow for electron density contours to take on a meaning that could map between different datasets or even over the course of refinement. This is also something that has been attempted previously, most notably (and with obvious conflict of interest on our end) by Lang...Alber, PNAS, 2014. Other important papers that have similar elements include the computational analysis by Shapovalov and Dunbrack, Proteins, 2009 (which examines the relationship between density, atom-type, and B-factor see Fig 4) and experimental work by Brian Matthews (Quillin PNAS 2004 and Liu PNAS 2006, reviewed in https://www.ncbi.nlm.nih.go.... What is exciting about this work is that it is a fresh start to the problem and it is optimistic that structural biologists and other users are eager for an “absolute scale”. However, the major reservations that we have about this paper are that it fails to build on or incorporate some of the lessons of these papers:

      For example - we think they are downloading 2mFo-DFc maps, but fail to account for FOM weighting to get an absolute scale - see Matthews work for a guide on how to do this. The F000 corrections they outline are missing the bulk solvent contribution - this is tricky and dealt with in the Lang/Alber paper. Their B-factor normalization scheme is difficult to follow and seems ad hoc, whereas the Dunbrack paper at least outlines a relationship to the physical meaning of B-factor to accomplish a similar normalization. Finally, when recalculating Fo-Fc maps (or mFo-DFc maps after accounting for FOM weighting), there is no need to normalize as it is already on an absolute scale when “volume” scaling is applied in phenix or (I recall) by default in REFMAC.

      Moreover, despite developing a method to convert electron density values into units of electrons the examples are all based on comparisons within a map where the rank order of strength of voxels does not change. While we applaud their idealism to move the community, an absolute scale is just part of the move beyond sigma scaling, we also need to think about a “confidence” metric (the RAPID part of the Lang paper or the EDIA metric in Meyder et al 2017 that they did not really respond to in the previous review or Beckers et al IUCRJ 2019 for an interesting alternative approach). We haven't reviewed the code, but it is really great that they have put their code up on github and it appears well documented.

      Minor point: The authors switch between using “chain deviation fraction” “chain fraction”, “chain density ratio”, median chain deviation fraction, median chain density ratio, chain median, median of chain density ratio, etc...

      We review non-anonymously, James Fraser and Roberto Efrain Diaz (UCSF)

    1. On 2019-09-15 03:21:19, user Fraser Lab wrote:

      The major goal of this paper is to develop and benchmark a new metric, Q-score, which aims to quantify the resolvability of atoms, residues, and entire macromolecules in density maps emerging from electron microscopy. The major strength of the paper is the model-directed approach that allowed them to get local metrics for map quality and resolvability with residue- or even atom-scale resolution. Of course, as noted in the manuscript, the method is also applicable to X-ray maps, where the need is less urgent - because of the standard practice of refining B-factors, which have a straightforward physical interpretation in crystallography. B-factor refinement has less of an established grounding in EM, as the single particles themselves cannot have a true “B factor”. Several refinement packages refine a “B-factor”, which ideally would model the imprecision in alignment and genuine conformational differences between particles that cannot be classified away. For example, phenix.real\_space\_refine refines residue-level ADPs by default, and Rosetta and REFMAC can both refine atomic B factors. Regardless of the source of the target map (or X-ray structure factors), a refined B-factor is a function of the conformational variability of the atom, noise, modelling errors, and the refinement restraints. The lack of a thorough accounting of how Q-scores compare and contrast with refined B-factors is the major weakness of this manuscript. Our interpretation of the Q-score metric is that it largely accomplishes the same goal as the B-factor, replacing the process of fitting a Gaussian to the density around an atom with measuring the quality of fit of the Gaussian-like curve of the density to a reference Gaussian. It does so entirely in real space like Rosetta, whereas the Phenix and REFMAC calculations are performed in reciprocal space. A particular strength of this kind of Gaussian-comparison approach is that the results should be largely unaffected by atom or residue ID for non-heavy atoms.

      It is not clear from reading the paper whether significant additional information is provided by the Q-score that is not captured by refined B factors. The authors could address this concern in a few ways. First, we would like to see correlation of calculated B factors with Q-score, both globally and in the case of refined atomic B factors or ADPs. Other resolvability estimates, including local resolution estimation and alternate model-directed metrics, such as the MDFF RMSF (Singharoy et al., 2016) and multimodel convergence RMSD (Herzik et al., 2019). That said, the metric is valuable and probably provides information that is not straightforward to abstract from the refined B-factors of deposited structures. It is also probably faster to recalculate Q-scores than to re-refine B-factors and the Q-score is likely less subject to the biases and restraints of individual refinement programs.

      We also identified a potential implementational weakness of the approach in examining the paper. The method by which the reference Gaussian is calculated in Equations 2-5 relies on the average and standard deviation of all voxel values in the map. Voxel values represent Coulombic potential, but the scalar affecting this relationship is highly variable in cryoEM, with very little consistency. Our concern is that both the mean and standard deviation of the map, which are used to generate the upper and lower bounds of the Gaussian, also may be strongly influenced by factors including masking and, in unmasked maps, the solvent content in the box. Since box sizes are chosen semi-arbitrarily in electron microscopy reconstruction, we are concerned that this selection of a reference Gaussian might result in less robust results when comparing maps that are not all generated with the same solvent content. We would be interested to see the results of comparisons of masked and unmasked maps, as well as maps of varying box sizes relative to the protein mass in the same resolution range, to confirm that the chosen reference Gaussian does not interfere with the ability to use Q-scores effectively in such cases. Similarly, we would expect that B-factor sharpening would have a dramatic effect on the Q-scores calculated for a given map, and would want to see to what degree the resolution vs Q-score correlation is altered by B-factor sharpening. This may be complicated by the incomplete reporting of whether sharpening was performed prior to map deposition in EMDB, so use of highly controlled data on their part may be preferable.

      This method could prove to be a valuable supplement to the local metrics available for assessing map and model quality. In our opinion, the ability of Q-factors to report on poor model quality fit is perhaps underemphasized in this manuscript. Deviations in Figure 5 might indicate poor fit for example. In particular, we would be curious to see how well it would work in cases of model error, in contrast to its use for studying map quality. It would be valuable to determine whether incorrect or underdetermined models can be identified with this approach to measuring the local density around atom. This tool seems able to give statistically useful results for single-atom and single-residue samples, and this could prove useful for identifying poorly modeled regions for further refinement. The preliminary data for this use case presented in the 2019 EM Model challenge ( http://model-compare.emdata... ) is promising, and this may be a worthwhile application of the Q-score in this manuscript or a future one. Conversely, in the absence of discussion of assessing model correctness, we would like to know how sensitive the Q-score metric will be to model errors - both minor displacements and larger registration errors are still relatively common in medium-resolution cryoEM, and ideally a tool used for assessment of local resolvability will be robust to such model errors. Due to the method by which voxels which are closer to other atoms are excluded from Q-score calculations, we expect that registration errors in particular may lead to particularly challenging to parse errors in Q-score.

      Minor points:<br /> The sentence on line 62-64 suggests that masking does the evaluation, rather than allowing the FSC to do so. <br /> The selection of 0.6Å as a standard reference standard deviation based on the reference 1.54Å structure may be problematic as structures get to even higher resolution and lower global B-factors - if the peak becomes sharper than that of the reference Gaussian, scores will decrease, despite the map becoming even more resolvable. Admittedly this is a problem on the distant horizon, as significant resolution improvements beyond 1.54 Å for novel samples may prove very difficult to achieve. However, the <br /> In lines 191 to 198, the authors speculate about the differential rate of dropoff of Q-score with resolution, and both suggestions they raise seem reasonably testable. First, the question of radiation damage might be more readily examined using reconstructions with differential dose, rather than number of particles. Second, the effects of local environment on disorder could be examined by quantifying Q-score as a factor of solvent accessibility.<br /> In table 1 on the right panel, entry 4, the Q-score reported is much larger (2.8) than any other reported score. Should this entry be 0.8?<br /> For the Q-score comparisons of solvents in X-ray and CryoEM maps, the authors argue that X-ray map has better resolved waters. However, there are a number of alternative explanations. One is that X-ray maps typically have lower global B factors than EM maps at the same resolution. This effect would result in a global dampening of scores for cryoEM maps that could be counteracted by B-factor sharpening. <br /> The authors describe doing real space refinement on the solvent atoms in the x-ray map. It would be useful to report the effect this refinement had on the R/Rfree. If this real space refinement resulted in overfitting the atomic positions, the subsequent position and radial distribution comparisons may have been overfit as well.<br /> In general, comparisons of solvent positions between X-ray and CryoEM closely resemble discussion more broadly of solvent radial distribution in bulk vs ordered solvent, and some comparisons to that literature might help to place this comparison in context - does the solvent in cryoEM behave more like bulk solvent compared to the ordered solvent shells seen in crystallography? Does it behave more like solvent at room temperature or under cryogenic conditions?<br /> The authors suggest the use of Q-scores to validate solvent molecule placement by refinement strategies. However, as existing tools such as phenix largely place solvent in atoms at Gaussian peaks in density, the Q-score may not provide orthogonal validation information. It may prove to be a useful estimate of confidence in this context, but this may require additional testing. Again, is there any information added over the individual B-factors refined for the solvent “O” atoms or are they well correlated?<br /> An alternative, but similar metric that is also highly correlated with B-factors and has similar approaches to Qscores with regard to atom “ownership” of voxels is EDIA (Meyder et al., 2017) - this work should be contrasted and cited. Ideally, correlations between individual B-factors, EDIA and Q-factors should be calculated for some standard model/map pairs to ground the discussion of specific situations where the correlations break down. Another related preprint has also been posted recently, but we have not reviewed it in detail: https://www.biorxiv.org/con... <br /> The authors should be commended for their tutorial, open code, and vetting with multiple OSes. Bravo!

      We review non-anonymously, James Fraser (UCSF) and Ben Barad (formerly UCSF, now Genentech, soon to be Scripps)

      References:<br /> Herzik, M.A., Jr, Fraser, J.S., and Lander, G.C. (2019). A Multi-model Approach to Assessing Local and Global Cryo-EM Map Quality. Structure 27, 344–358.e3, PMCID: PMC6365196.<br /> Meyder, A., Nittinger, E., Lange, G., Klein, R., and Rarey, M. (2017). Estimating Electron Density Support for Individual Atoms and Molecular Fragments in X-ray Structures. J. Chem. Inf. Model. 57, 2437–2447.<br /> Singharoy, A., Teo, I., McGreevy, R., Stone, J.E., Zhao, J., and Schulten, K. (2016). Molecular dynamics-based refinement and validation for sub-5 Å cryo-electron microscopy maps. Elife 5, PMCID: PMC4990421.

    1. On 2019-08-26 19:10:48, user Richard White III wrote:

      This pre-print doesn't acknowledge the previous work from which it came from back in 2017.<br /> https://peerj.com/preprints...

      If you can take a pre-print and code from one pre-print server. Remove authors then not acknowledge the previous work for with a citation then it puts the whole pre-printing process in question.

    1. On 2019-08-24 12:50:04, user WJR wrote:

      Regarding the paper, "Sex solves Haldane's Dilemma" (currently unpublished), by Donal A. Hickey and G. Brian Golding. The following comments concern the paper and its accompanying computer simulation. These comments arise primarily from reading the software code, and may be less obvious from reading the paper.

      SUMMARY:

      The paper needs clarifications and expanded discussion on key points. (1) The simulation is biologically unrealistic in ways that lend to the paper's conclusions. (2) The simulation artificially (and completely) removes the advantages of asexuality, and also artificially decreases the disadvantages of sexuality. (3) The paper thereby reaches the (questionable) conclusion that sex provides faster evolution. The paper will need to clarify these matters, if it is to be successful.

      (a) SELECTIVE ADVANTAGE:

      The paper specifies that the beneficial alleles have a selective advantage of 0.02. However, the ambiguity of that wording might mislead readers. The authors ought explicitly clarify that they mean a homozygote will have an advantage of 0.04.

      That is significant here, because that figure is much higher, (between 4 and 40 times higher), than is typical of the textbooks/papers in this field. This high a selective advantage will need justification. Especially since this high selective advantage is used for each of 100 separate alleles simultaneously.

      (b) STARTING FREQUENCY:

      The simulation begins with a cloned population of identical genomes, and initializes these by randomly creating beneficial alleles at each locus. The starting frequency of these is set to 0.05, (which is 1 out of 20). In other words, each individual, at each diploid locus, has nearly **a ten-percent chance** of possessing a beneficial allele. And this high starting frequency occurs at each of 100 loci simultaneously. This unusually favorable starting situation needs more justification in the paper.

      (c) RANDOM GENETIC DRIFT:

      Sexuality uses a randomized recombination of alleles, while asexuality does not. Because of that, a sexual species experiences more random genetic drift than does an otherwise equivalent asexual species. And this excess genetic drift often eliminates beneficial alleles. These tend to be randomly eliminated when they are yet few in number. In a sexual population, this random genetic drift is like extra genetic 'noise', that can push a rare beneficial allele into extinction.

      In an extremely large sexual population, a newly-minted beneficial allele, will succeed only 2*s percent of the time. For example, a typical selection coefficient, with s=0.01, will be eliminated 98 times out of a hundred. (For s=0.001, it is eliminated 998 times out of a thousand.) The situation is worse for smaller population sizes, because genetic drift is stronger there.

      In the above-described way, genetic drift is a disadvantage to sex. But the simulation minimizes that disadvantage by using a large population size (=100,000), together with high initial frequency (=0.05), together with high selection coefficients (s=0.04). This setup virtually guarantees that none of the beneficial alleles will be lost through this genetic drift. Indeed that is the case, as seen in the posted results of the simulation. This artificially benefits the sexual population in the simulation.

      (d) MUTATION RATE:

      The simulation uses an unusual manner of mutation, where harmful mutations are entirely disallowed. Instead, only a specific type of back-mutation is allowed; where a beneficial allele reverts back to the original allele, (which has a multiplicative fitness contribution of 1.0). In this way, the simulation artificially eliminates the problem of error catastrophe (also known as mutational meltdown), since fitness is automatically never allowed to fall below 1.0.

      Also, the back-mutations occur at an extremely low rate, given by:

      Mutation_rate_per_progeny = MUT_RATE * number_of_loci * 2 * p

      where: <br /> MUT_RATE=1.0e-08, given as a mutation rate per gametic loci<br /> number_of_loci = {1, 2, 4, or 100}, <br /> p is the frequency of the beneficial alleles (which starts near 0.05 and ends near 1.0), <br /> the "2" is because each progeny is a diploid.

      The factor 'p' arises because the back-mutation merely converts an existing beneficial allele back to the original allele. Due to that handling, the mutation rate varies throughout the simulation; it starts low (for p=0.05), and slowly increases by a factor of twenty (for p=1.0). This varying mutation rate is peculiar.

      These back-mutations are the *only* mutations throughout the simulation. (Note: The simulation is hard-coded for 400 generations, with a population size of 100,000 progeny each generation.) Yet the mutation rate is so low that this entire simulation will sometimes experience not even one mutation. This low rate of mutation is trivial, and can be ignored.

      This must be compared with recent measurements of the human mutation rate, which is around 100 new mutations per progeny. That is over 50 million times higher than the highest rate employed in the simulation. The paper needs much more justification of it's handling of harmful mutation. An explicit attempt should be made. [Note: This issue runs far deeper than it first appears.]

      The remaining items (below) address the simulation's handling of sexuality versus asexuality.

      (e) FECUNDITY and REPRODUCTION RATE:

      In the simulation of sexual reproduction, the FECUNDITY is set to 2. That is, for males the FECUNDITY is 2, and for females the FECUNDITY is 2. The authors ought remind readers that such a female would need to produce 4 progeny. This arrangement correctly represents the fact that half the female's reproduction goes toward reproducing her mate's genetic material.

      However, in the simulation of asexuality, the FECUNDITY is likewise set to 2, which is a mistake. It should be 4. That way, the females produce 4 progeny in both cases (sexual versus asexual). We must compare apples to apples.

      Asexuality is twice as efficient at transmitting its genetic material into the next generation. But the simulation artificially cut the asexual reproduction rate in half, thereby disallowing this advantage of asexuality.

      (f) The SLOWING-EFFECT versus STARTING FREQUENCY:

      A human-like population has around 23 chromosome pairs. There is no linkage between alleles on different chromosomes, and such alleles segregate independently. (Also, a human-like population has a somewhat higher recombination rate than used in the simulation.) Because of those things, a collection of, say, 100 different alleles, (randomly distributed across the genome), would expect little or no linkage between them. To a first approximation, they would segregate independently. And this produces a well-known disadvantage of sexual reproduction. That is, yes, sex can bring favored alleles together into one progeny, but it tears them apart just as effectively. (Some theorists describe sexual reproduction as a genetic shredding machine, each generation shredding and re-mixing the genomes.)

      By 'tearing apart' the beneficial combinations of alleles, sex slows evolution. This slowing-effect is strongest when the beneficial alleles are yet rare, at low frequencies. Then, they can only fleetingly exert their combined selective effect, before sexual reproduction separates them again. This is all standard theory.

      This slowing-effect doesn't happen in asexual populations. Once a beneficial combination of alleles is obtained, it is not shredded or separated. Rather, it is inherited, intact, into the next generations.

      This slowing-effect ordinarily places a sexual population at a disadvantage. But the simulation minimizes that disadvantage by starting the beneficial alleles at an extraordinarily high frequency (=0.05), thereby artificially avoiding the worst of the slowing-effect.

      (g) EPISTASIS:

      The above-described slowing-effect is even stronger when there is epistasis. (Epistasis occurs when a group of alleles have a combined selective effect that is much stronger than the sum of their effects taken individually.)

      And the simulation employs strong epistasis. (The epistasis in this simulation occurs through its use of a multiplicative-fitness model with high selection coefficients over many loci.)

      The evolutionary genetics literature regards the following as a robust and firm result: Sex-with-epistasis makes evolution slower than asexuality-with-epistasis. So how does the simulation minimize this slowing-effect? See below.

      (h) CHROMOSOME NUMBER and RECOMBINATION RATE:

      The paper seeks to challenge that prevailing view and show that sex speeds evolution. The paper aims to prove it via simulation. Unfortunately, the simulation attempts it by artificially decreasing one of the classic disadvantages of sex. It does that by reducing the chromosome number to 1, (and also by slightly reducing the recombination rate). This allows the substituting alleles to (unrealistically) experience linkages that would be unexpected in a human-like population. In the simulation, the substituting alleles are all on *one* chromosome; with various groupings effectively linked together as one; transmitted together into progeny as one; exerting their combined selective effect as one; generation after generation. And this situation makes them substitute faster. In other words, the simulation artificially increases the speed under sexuality by mimicking asexuality.

      For this simulation to effectively challenge the prevailing view and resolve this question, a more life-like chromosome number would be needed, (say, 23 to 25). This would be a reasonably simple change to the software. [For example, take the simulation's model with 4 loci, and increase it to 25 chromosomes. Easier still, just let all the alleles segregate independently. There would still be 100 alleles, and the computer run-time would be about the same.]

      (i) THE HORSE RACE:

      After initializing the simulation, no further beneficial alleles are added throughout the duration. You can think of this as lining up many race horses together at a starting gate, then after the start, no further horses are added to the race. In the simulation, (with all the horses lined up at the starting gate), all the beneficial alleles are guaranteed to eventually join-up together within the sexual individuals. But that is forbidden in an asexual population. That is the advantage of sexuality.

      But that setup artificially disallows a major advantage of asexuality. That is, new horses (i.e., new beneficial alleles) are added to the race throughout time, continuously, through mutation. Then an asexual species can more rapidly acquire those. How? As mentioned above, an asexual female's genome effectively has double the reproduction rate of its sexual peers. This allows a fit asexual female to more rapidly increase its sub-population size, and thereby (through having a larger size) more rapidly 'receive' its next beneficial mutation. (For example, if a sub-population is ten times larger, then that group receives its next beneficial mutation ten times sooner, and then the cycle begins anew.) This real advantage of asexuality is explicitly disallowed in the simulation.

      That fact undermines the legitimacy of the simulation for comparing sexual and asexual populations. Fixing this would require substantial alterations to the simulation and paper.

      CONCLUSION:

      There are at least eight distinct ways this simulation is biologically unrealistic, and these give the uncanny appearance of having been tuned to support the authors' conclusions. That is an undesirable result, as we all want a simulation we can rely on, and believe in. I encourage the authors to continue their work (with software upgrades and such), as I believe it can lead to a useful research tool.

    2. On 2019-08-21 18:59:50, user Donal Hickey wrote:

      Yes the comment by WJR correctly surmises that the structure 'int' named<br /> .sex is a legacy variable. But no, while the code is mildly inefficient<br /> it is not grossly inefficient and nor is it incorrect. Nor is there a<br /> great deal of over-writing being done. In fact the code is multipurpose<br /> and for the uses in this paper, there is no need for the .sex flag.<br /> Nor is the quicksort subroutine ever called.

      Because the code is multipurpose, the remnant of the .sex flag can<br /> be confusing. The comment by WJR that lines 233 to 252 are slightly<br /> inefficient is correct but the comment that they are very inefficient<br /> is wrong. In this section of the code, each individual gets to produce<br /> FECUNDITY offspring. In this case FECUNDITY is 2. The individual's two<br /> offspring can be from two matings with different individuals or twice<br /> with another single individual (but not with self mating).

      The break statement at line 246 ensures that these matings occur<br /> exactly FECUNDITY times. As soon as the mating is accomplished the<br /> break statement causes the program to stop and then continue with the<br /> next mating. The progeny from each mating (of parent n[i] is stored in a<br /> new structure (m[.]) so that there is no overwriting done. It is common<br /> practice to create a second temporary structure/variable to momentarily<br /> store results.

      So, no there is no overwriting done except when the temporary<br /> structure/variable is copied back to the original. Every individual<br /> will mate at least FECUNDITY times.

      It is true that the three "if" statements that query the .sex variable<br /> are inefficient. It would be possible to rewrite the code so that it no<br /> longer has remnants of it's multi-functionality and this would slightly<br /> increase speed. But the program is already very fast and there seems no<br /> reason at all to increase the speed of a program that only takes minutes<br /> to run and by doing so would reduce the functionality of the code.

      Also I would guess that deleting the -g debug flag would increase speed<br /> a great deal more. But again there is no need.

      Brian

      Uncomment line 61 so that idum has a fixed value and you will<br /> reproducibly get the results below.

      Recompile

      run with breaks at these lines ...

      line 230: m[0]=n[0]<br /> line 231: k=88604<br /> line 244: m[0].maternal=n[88604].maternal<br /> line 246: break -> line 229


      line 230: m[1]=n[0]<br /> line 231: k=66537<br /> line 240: m[1].paternal=n[66537].maternal<br /> line 246: break -> line 229


      line 230: m[2]=n[1]<br /> line 231: k=55470<br /> line 240: m[2].paternal=n[55470].maternal<br /> line 246: break -> line 229


      line 230: m[3]=n[1]<br /> line 231: k=11152<br /> line 238: m[3].paternal=n[11152].paternal<br /> line 246: break -> line 229


      line 230: m[4]=n[2]

    3. On 2019-08-21 11:31:21, user WJR wrote:

      My goal here is that G.B. Golding's simulation ought be fashioned into a widely useable research tool. To do that, it needs to be: (1) computationally efficient (in terms of computer time and computer memory), and (2) widely understandable. The simulation can be improved along both those lines.

      Toward that goal, I offer the following comments:

      1) As noted in my previous post, the 'reproduction' portion of the simulation is exceedingly inefficient. (And awkward to understand.) In effect, the simulation produces a specific number of progeny (= FECUNDITY times number_of_parents), by producing EACH individual progeny through a random mating among the parents. If that is the goal, then this code (as written) is both inefficient and awkward to understand. Also, compared to my direct wording above, the paper's description is needlessly cryptic, "The number of offspring produced by each individual was determined by sampling a Poisson distribution with a mean of 2. All parents were diploid and had the same average fertility." (Further note about this simulation: For matings, a parent can switch back-and-forth between male and female. The software merely picks two random individuals and 'mates' them. This is not necessarily fatal to the simulation's validity, but it ought be clarified in the text.)

      2) The following items appear in the code, but are never used: RUNS, ITERATES, RECOMB_RATE, l, iter, poindev(), gammln(), and quicksort(). These can be eliminated, with no loss.

      3) The structure variable, '.sex', is not needed; is confusing; and increases the inefficiency of the simulation. The sexual and asexual versions of this simulation are already different, so nothing is gained by having this (inefficient) flag that indicates whether a specific individual is sexual versus asexual. They are ALL sexual or ALL asexual, and this is already accomplished by the two different versions of software.

      4) The parameter, MAX_SIZE, sets the size of various arrays large enough to accommodate the population size. It is currently set to 1 million, (for one-million individual progeny). However, this is nine- to ten-times larger than it needs to be, (as the size of the progeny population is only 100,000). This needlessly wastes computer memory space.

    4. On 2019-08-16 08:02:18, user WJR wrote:

      Bug Report:

      This (preprint) paper by Hickey and Golding relies on a software simulation. (Available at: https://github.com/gbgolding/evolutionSex)

      There is a bug in the currently posted version of that software. It occurs in the file "sexual7.c" (seen August 16, 2019). The bug strongly decreases the computational efficiency of the simulation, and slightly affects the results.

      In that file, there is a structure integer variable named '.sex', which is initialized to zero, and tested several times by IF-statements, but it is never changed. In other words, it does nothing. That is the first clue something is awry.

      (Note: For some reason, when I try to post the offending code here, it gets displayed as an unreadable mess. I don't know why. I tried placing the appropriate code formatting brackets around it. So, instead I must here describe where the bug occurs.)

      Lines 233 through 252 comprise a While-loop, which creates each progeny by a mating of some parent_i with some other parent. However, the loop then proceeds to OVERWRITE that progeny's allele data by a mating with EVERY OTHER parent. The original parent_i's allele data gets obliterated within that progeny, because it gets over-written many times. Each and every progeny is computed by mating each parent with EVERY OTHER parent. But only the last matings count, as the previous matings get over-written. This is exceedingly inefficient computationally. If the population size is n, then the computer time increases with n-squared, rather than n.

      The end result is that each progeny is indeed the result of a random pairing of parents, but not the ones the simulation-writers intended. Moreover, the simulation aims that each parent be involved in AT LEAST a minimum number of reproductive events (given in the code by "FECUNDITY"), but that goal is not achieved. Due to randomness, some parents can mate numerous times, while others don't mate at all -- contrary to the stated design of the simulation. There is a disparity between what the code intends to do, and what it actually does, and this can affect the results.

      I'm guessing the variable .sex is a vestigial remnant of code (now largely absent) that had originally matched each parent with exactly one other parent (i.e., for an obligate monogamy model). I imagine .sex was originally a flag, used to indicate that a given individual has (or has not) been used yet as a parent. This approach would be one of the simplest ways of guaranteeing that each parent has the fecundity claimed in the paper.

    1. On 2019-08-01 15:07:03, user stephens999 wrote:

      This interesting and impressive<br /> paper presents extensions, implementation and application of a recently-developed<br /> statistical methodology (the knockoff filter) to large GWAS (UK Biobank).<br /> The methods provide guaranteed control of False Discovery Rates when<br /> testing pre-specified contiguous groups of SNPs (or other variants).<br /> Importantly, the null hypothesis being tested here<br /> is not the commonly-used null that the group of SNPs is *marginally* unassociated<br /> with the trait; instead the null is that the group is<br /> *conditionally* unassociated with the<br /> trait given all other observed SNPs. This conditional test<br /> is in many ways more informative than conventional marginal tests<br /> because it ensures that a significant group cannot be<br /> explained by linkage disequilibrium (LD) with other measured SNPs outside the group.<br /> Thus the conditional test comes closer to identifying groups of<br /> potentially-causal SNPs than do conventional marginal tests.

      The paper is very well presented, and the results and comparisons with other methods<br /> seem generally appropriate and interesting. My main request<br /> is that the paper should better highlight the limitations of the method --<br /> specifically, at high resolution ("fine-mapping")<br /> the need to confine tests to pre-specified contiguous groups of SNPs<br /> seems a clear disadvantage compared with existing fine-mapping methods.<br /> This is not to take away from the other important contributions of this work.

      Major Comment

      As mentioned above, the main limitation of the current implementation<br /> (and perhaps the whole framework?) is the requirement<br /> that groups of tested markers be both contiguous and pre-specified.<br /> At coarser resolutions, where the<br /> main goal is to identify genomic regions (conditionally) associated with the trait,<br /> these requirements are not a major limitation. However<br /> at fine-scale resolutions, where one is trying to get down to<br /> the likely causal markers, these requirements becomes more bothersome.<br /> For example suppose we have 4 SNPs, in order, A-B-C-D, and A and D<br /> are in very strong LD with each other (say LD of 1 for concreteness),<br /> but not in strong LD with B or C, and A is the causal SNP. Then<br /> the contiguity requirement of knockoffZoom will not allow<br /> it to refine the association beyond the entire group (A-D),<br /> even though in principle one could narrow it down further to SNPs A and D.<br /> Existing fine-mapping methods do not have this limitation<br /> and could report (A,D) as the set of potential causal markers.<br /> Further, even if the contiguity requirement were relaxed<br /> (e.g. to allow prespecified non-continguous groups), the need to<br /> prespecify groups to be tested may still limit the resolution<br /> to which associations can be refined.

      For this reason I think it is premature to claim<br /> "...KnockoffZoom unifies locus discovery and fine-mapping into a<br /> coherent statistical framework" (p15). Specifically, I think its<br /> abilities to solve the fine-mapping problem are not<br /> yet adequate to make this claim, and that studies interested<br /> in fine mapping will continue to want to use<br /> existing Bayesian fine-mapping methods like SUSIE (quite possibly as a complement to knockoffZoom)<br /> to refine associations as far as possible.<br /> In any case, the limited resolution that comes with testing contiguous pre-specified marker<br /> groups should be better highlighted in the text.

      Besides better highlighting this limitation in text, the<br /> comparisons with fine-mapping methods should be extended to<br /> quantify the effect. Currently the comparisons show<br /> the "width" of region identified by each method (Figure 4, right panel).<br /> However, fine-mapping methods do not strictly identify a region but a set of<br /> SNPs, so the figure should also compare the number of SNPs identified<br /> by each method. It would also be informative to show<br /> the minimum pairwise LD between the markers identified -- does knockoffZoom sometimes<br /> report markers not in high LD with one another due to the contiguity<br /> constraint? (Incidentally, the y axis on this figure is too large to<br /> see the interesting region, which for fine-mapping is <0.1 Mb.<br /> Getting to a region of 0.5 Mb is not really fine mapping in my opinion.)

      It would also be interesting to get the authors' perspective on how easy<br /> or difficult it might be for the contiguity<br /> requirement to be relaxed in the future. (Also the pre-specification requirement,<br /> although this seems more fundamental.)

      Other main comments

      • Some aspects of Table 1 are surprising to me. Eg the<br /> number of bmi findings going from 24 -> 0 -> 15 as resolution increases.<br /> Shouldn't power increase as larger groups are tested? (I realize<br /> there are fewer tests as groups get bigger...so this is not a simple<br /> issue.) The hypothyroidism results are perhaps even weirder. Can you<br /> provide any intuitive explanation for why this might occur? Is it simply<br /> chance, since the knockoff procedure can produce different results if run<br /> multiple times?

      • The introduction criticizes the<br /> two-step approach as "not fully satisfactory because it requires<br /> switching models and assumptions in the middle of the analysis,<br /> obfuscating the interpretation of the findings and possibly<br /> invalidating type-I error guarantees." However, from Table 1 (see above comment),<br /> performing separate analyses at different resolutions appears to have similar problems<br /> regarding interpretation. The method that<br /> avoids "floating" discoveries at high resolution (Supplement S1B)<br /> seems to address this, but at a cost in power. What is that cost in power<br /> for the analyses here? How does Table 1 look if you apply that method?<br /> (with or without the 1.93 factor mentioned in the supplement).

      • As I understand it the output at each resolution depends on a single<br /> generation of the knockoff variables, and so the method will report<br /> different significant results each time it is run? Is this correct?<br /> If so, how different/similar are the results if you run things a second time<br /> with another knockoff realization? (It could suffice to do one trait twice<br /> to illustrate this)

      • The notation (X,Xtilde) suggests that the knockoffs are always included after the<br /> real variables in the input file to the lasso/bigsnpr. In principle the location<br /> of the knockoffs in the input file should not matter when a convex method like lasso<br /> is being applied (with the exception of variables with<br /> LD=1, which is already dealt with here as a special case). However, if one were to replace<br /> the lasso with non-convex methods the non-random order of the markers<br /> into the method could lead to failure to control FDR (eg if the method<br /> has a bias towards choosing columns earlier in the list of covariates).<br /> Further, even for convex methods, there is some concern numerical issues<br /> could arise to create this bias. As a safety check I suggest<br /> running the method with randomly ordered columns, or if that is<br /> too much of a pain simply reversing (Xtilde,X) to check it makes no difference.

      • I found the references to the Li-Stephens model vs fastPHASE<br /> model in the Supplement confusing. The description of Li-Stephens<br /> as "This HMM describes the distribution of genotypes as a<br /> patchwork of latent ancestral motifs"<br /> is incorrect - this describes the fastPHASE model.<br /> The Li-Stephens model describes each<br /> haplotype as a patchwork of<br /> other observed haplotypes, not latent motifs.<br /> As I understand the text all the models here are<br /> essentially fastPHASE models not Li-Stephens models.<br /> Please clarify.

      • The results in the supplement that reduce forward-backward calculations<br /> to O(K) and O(K^2) look similar to results that are already well<br /> established (e.g. Fearnhead and Donnelly,<br /> 2001, Estimating recombination rates from population genetic data, Genetics).<br /> Is there anything new here?

      • Please provide more details about the comparisons with other methods,<br /> including versions of software and the settings used.<br /> Ideally the code used to run the comparisons with other methods<br /> should be made available - even without documentation this can<br /> be invaluable for others to see what was done.

      Other comments/questions:

      • Getting the method working on problems of UK biobank scale is<br /> impressive, even though limited to "only" 591k SNPs.<br /> Would applying to ~50 million SNPs be feasible, and<br /> require about 100 times the computation?<br /> For coarse resolution it might not matter much to include the extra<br /> SNPs, but for fine-mapping it ultimately<br /> seems important to include as many SNPs as possible.

      • The paper discards tests where the knockoffs are very highly<br /> correlated with the original variables (which makes sense as<br /> they have no power). For intuition I would be interested to see the<br /> distribution of the correlation of knockoffs with the original variables<br /> (say at the finest resolution).

      • What is the MAF distribution of the variants analyzed here?<br /> Does the method work equally well for common vs rare variants?<br /> (I ask because the LD models may tend to work best for common variants.)

      • It would perhaps be helpful to cite (and contrast with) previous work that attempts<br /> to control error rates of conditional tests of groups of variables<br /> (eg work on hierarchical testing by Yekutieli, Meinhausen, Bu\"hlmann etc).

      Minor:

      p6: "by likelihood of the trait" -> "in distribution of the trait"

      p11: "its intrinsic limitations discussed above" - I do not see where they<br /> were discussed.

      p12: "As the resolution increases, we report fewer findings" - not always!

      Table 1: I suggest giving resolution in terms of kb instead of Mb. Is 0.000 down to<br /> single SNP resolution?

      p16: "possible *to* construct"

      refs: markov -> Markov ; uk -> UK

    1. On 2019-07-18 20:58:17, user Morten S. Dueholm wrote:

      Dear Robin,

      We appreciate very much that you have taken the time to review our manuscript. We are happy that you can see the potential of the approach and we appreciate very much your comment and suggestions, some of which we surely will implement in our manuscript before submission. We have provided comments to the points raised in your review below in italic. If you have further comments or questions, you are very welcome to get back to us.

      Best regards,<br /> Morten and Per

      AutoTax BioRxiv Review, Jul 03, 2019, Robin Rohwer

      This manuscript describes a new method, AutoTax, to create databases out of full-length 16S rRNA gene sequences that can be used for taxonomy assignment of 16S rRNA gene amplicon data. This is a valuable contribution, because availability of full-length 16S rRNA gene sequences is increasing, and ecosystem-specific databases improve classification of amplicon data. Combining high-throughput generation of full-length 16S rRNA gene references with high-throughput database creation would be a major advancement. This manuscript also describes the application of AutoTax databases to the design of FISH primers, but I will focus my review on taxonomy assignment because that is my area of expertise.

      I believe this tool will be a valuable resource, which is why I have taken the time to review it carefully. I hope these comments will be addressed in the published version of this paper. I have three main concerns with this manuscript.

      First, it needs more detailed descriptions of the underlying methods that AutoTax uses for alignment and identity calculations. The manuscript refers readers to a github repo and supplemental documents for details, but this manuscript is introducing a new method so the algorithms/tools it employs should be clear in the main text.

      We have elaborated on the methods in the main manuscript, however we keep the detailed information on settings etc. in the supplementary as we believe that this information is only relevant for a small subset of the target audience and such information would remove focus from the practical aspects.

      Second, the taxon levels in the AutoTax database are assigned based on identity thresholds. The authors acknowledge briefly that this means their AutoTax databases lack phylogenetic information, but this is a major limitation so they should justify in more detail why they chose this method and why the resulting taxon names are still meaningful. They also cite a 2018 paper by Edgar (https://peerj.com/articles/... several times to support other claims, but do not mention that the main finding of this paper is that percent identity is a poor predictor of taxon level.

      Many of our sequences have close relatives in the SILVA database and obtain their taxonomy directly from this reference database. These taxa are therefore supported by phylogenic information. The denovo taxa are constructed based on sequence identity alone. This is a simple solution, which compared to phylogenetics, can be reproduced. Phylogenetics is especially problematic with the large dataset, which will become the standard in the future, as heuristic approaches are required for processing the data.

      The conclusion that "Percent identity is a poor predictor of taxon levels" in the 2018 paper by Edgar relates to V4 amplicons, which are known to have sparse phylogenetic information. That is why we in our studies only perform clustering of full-length 16S rRNA sequences, for which taxon thresholds are supported by statistics.

      Third, the importance of AutoTax would be more clear if the manuscript discussed how it fits in with similar work on improving taxonomy classifications. Many previous studies (examples below) have also found improvements in taxonomy assignment when databases are improved, yet here this is presented as a novel finding. Many other methods also exist for creating custom databases, yet they are also not discussed. I believe AutoTax is a novel contribution, but its value will only be clear and meaningful in the context of previous work.

      We do not consider that “better databases provide better classifications” is a novelty. This is already clear from our previous versions of the MiDAS database (and other databases), which is a manually curated version of the SILVA database. However, we will include some of the references in our manuscript as support for the need for improved taxonomy.

      Specific Comments:

      line 68- Great explanation of why taxonomy is needed for cross-study comparisons, there is a common confusion that ASV's have solved the problem.

      line 102- Since you are introducing an alternative tool to build databases, you should include more detail on the existing ways that databases are created and the comparative improvements and drawbacks when using AutoTax. (Perhaps in discussion instead of right here in introduction.) For example, how does AutoTax compare to the results from SINA (https://academic.oup.com/bi... and RAxML (https://academic.oup.com/bi... or FastTree (https://journals.plos.org/p..., or to manual curation using Arb (https://academic.oup.com/na...

      The references above do not link to tools which can be used to build databases, instead they represent tools that can be used to align/classify sequences (SINA) or infer the phylogenetic relationship of sequences (RAxML, FastTree, arb). We have previously used the SINA for sequence alignment and calculation of percent identity, however the algorithms in SINA is not suited for this purpose (see discussion with the author: https://github.com/epruesse..., which is why we decided to use usearch.

      line 109- Instead of (or in addition to) citing the original field guide (your ref #22), please cite TaxAss as the reference for the FreshTrain. The original Newton citation is appropriate when referencing the phylogeny, vocabulary, or ecology of the included Freshwater bacteria, but the TaxAss citation is appropriate when referencing the current updated version of the database that can be used for taxonomy assignment. (https://msphere.asm.org/con...

      We will add this reference along with the one for the original FreshTrain database.

      line 123, line 384- As you re-make your custom AutoTax taxonomy with a new version of SILVA, how does AutoTax prevent changes to the ecosystem-specific names in the first AutoTax version? It is clear in line 384 that you can add more custom sequences without messing up the existing taxonomy, but what if some of your sequences are added to SILVA itself? Couldn't that then shift all of your ESV centroids? And if you avoid that shift by leaving the duplicated sequence in the custom set, wouldn't AutoTax then ignore any added phylogenetic information that came from the additional SILVA curation, since preference is given to ESV centroids in the case of conflicts?

      Firstly, we do not include any of the sequences from SILVA into our ecosystem-specific database. When our own full-length sequences are added to SILVA this will therefore not have any influence on our ESV numbering and identification of ESV centroids. However, when the sequences are added we will get better support for the taxonomy assignment by SILVA, which will improve classification. So with time, when our’s and other’s full-length high quality sequences are added to SILVA, this will help to improve the taxonomic classification provided by SILVA and thus also improve our ecosystem-specific database.

      line 129- I found mixing the abbreviations ASV and ESV very confusing. For example, ASV's could also be considered "exact" sequence variants, since they are unclustered, and to add to the confusion you DO cluster the ESVs later in the method. Choosing different terms would help clarify when you are talking about full-length vs. amplicon sequences.

      We understand that it may not be clear from the abbreviations that ESVs refer to full-length 16S rRNA sequences, whereas ASVs refer to shorter amplicons. To avoid confusion we will use the term full-length ESVs in the manuscript.

      line 200- This "in brief" description of methods is inadequate, especially for a paper that is introducing the method for the first time. What algorithm does the script use to identify an ESV's closest relative- RDP classifier, SINTAX, BLAST? What algorithm does the script use to obtain taxonomy? What algorithm is used to calculate sequence identity? These choices have major impacts on your results, and for a paper that is introducing a method the workings of the method should not be hidden in supplemental materials or within code.

      We use a comprehensive usearch (-maxrejects 0) for identification of the closest relatives in the SILVA database and for calculation of the percent identity. We do not use any classifiers. We will make this more clear the in the next version of the manuscript.

      line 332- I would believe the results of a kmer-based method like the standard RDP/Wang classifier more than sequence identity. You are using the full SILVA database in your classification, so overclassification should not be a major problem. If you are worried overclassifications might be masking the gains of your method, you can double check by looking at how many classifications change when you use the new database. The main point of your reference #10 is that sequence identity is a poor predictor of taxonomic rank, and Edgar even states in the abstract that "95% identity was found to be a twilight zone where taxonomy is highly ambiguous because the probabilities that the lowest shared rank between pairs of sequences is genus, family, order or class are approximately equal."

      Sequence identity is actually a fairly good predictor of taxonomic ranks, when full-length 16S rRNA sequences are analyzed (Yarza et al. 2014, Kim et al. 2014). The statement that "95% identity was found to be a twilight zone where taxonomy is highly ambiguous because the probabilities that the lowest shared rank between pairs of sequences is genus, family, order or class are approximately equal." relates to V4 amplicons and not full-length sequences. Another key point is that the NCBI was used as the ground truth, and we know that this database contains many errors according to both conserved marker gene phylogenetics and ANI (Park et al. 2018, Ciufo et al. 2018).

      Yarza P, Yilmaz P, Pruesse E, Glöckner FO, Ludwig W, Schleifer K-H, et al. (2014). Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat Rev Microbiol 12: 635–645.

      Kim M, Oh H-SS, Park S-CC, Chun J. (2014). Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. Int J Syst Evol Microbiol 64: 346–351.

      Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil P-A, et al. (2018). A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36: 996–1004.

      Ciufo S, Kannan S, Sharma S, Badretdin A, Clark K, Turner S, et al. (2018). Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI. Int J Syst Evol Microbiol 68: 2386–2392.

      line 367- What method specifically is used to "map" the ESV's?

      We used usearch11 -usearch_global -maxrejects 0 -id 0. This information is provided in the materials and methods.

      line 378- How do you know these classifications are correct when there is nothing to compare them to and no way to test for accuracy? Perhaps you could change it to "This approach will distinguish species-level classifications."

      Excellent point. This part does not include any taxonomy. We will rewrite or remove this sentence.

      line 387- When you defer to a centroid name over Silva, what happens to the Silva name? Is it lost in favor of the placeholder name? Could this result in a situation where known organisms are missed because they get classified as a new placeholder name instead?

      If the centroid of e.g. a species falls outside the criteria for a SILVA genus, all sequences within that species will obtain a denovo name, even though some sequences may fall within the threshold. However, there will most likely also be species, whose centroids fall within that genus, in which case it will remain. AutoTax creates a log of all these changes, which means that you can search for missing taxa if relevant.

      line 395- This "striking" finding is not novel. It is well supported in the literature that custom databases will dramatically improve classification and you should include that context when you describe your result. See for example: <br /> https://msphere.asm.org/con..., Fig 3<br /> http://journals.plos.org/pl..., Fig 5<br /> https://doi.org/10.1186/147..., Table 1<br /> http://dx.plos.org/10.1371/..., Table 4<br /> https://peerj.com/articles/494, Fig 3<br /> https://academic.oup.com/da..., Table 1<br /> http://www.sciencedirect.co..., Fig 6<br /> http://www.biomedcentral.co..., Fig 4<br /> https://peerj.com/articles/..., Fig 1

      Thanks for providing these references, they support why the method is relevant and will be included in the article.

      line 410- The improvement over MiDAS is one of your most compelling findings. You should elaborate on it more! For example, a little background on how MiDAS was created for your non-wastewater audience, and then you can emphasize how the volume of sequences is really important. This is the most compelling evidence for scientists to adopt your method when they already have small custom databases, and it is also the most appropriate test of improvement since MiDAS is the current standard for activated sludge community classifications.

      We will elaborate on this in the manuscript. The MiDAS database is actually a curated version of SILVA, which includes all SILVA sequences, so it is not a "small" custom database.

      line 417- This analysis is certainly useful to the wastewater research community, but it tests the resolution of primer regions, not the validity of database. To test how your database performs, you should use known amplicon sequences that do not already have an exact match in the database. For example, by creating amplicons from unincluded ESV's.

      We disagree with this. We show in the paper that we can make a database, which includes near-perfect references for almost all bacteria in an ecosystem. Therefore, it makes sense to test the resolution of the primer regions when there is an exact match in the database.

      line 490- How can you state that the AutoTax databases are near-complete when you haven't performed any completeness estimates? They are certainly "more complete"...

      We actually determine the completeness of the database in Figure 1. The results are based on mapping of amplicon data to the ESVs database and calculate how many of the amplicons have high-identity references in the database. The analysis is of course biased by the amplicon primers, but it is the best we can do.

      line 491- These public databases are not "much larger" than your database because your AutoTax database combines your new sequences with Silva, and this combination is therefore slightly bigger than Silva. This statement could be misleading to a less careful reader, because some custom databases are used alone, without being combined with Silva/Greengenes.

      We do not include any sequences from SILVA into our database. SILVA is only used for classification. The statement is therefore correct. We will make this clearer in the manuscript.

      line 496- How can you claim sub-species level classifications? You have explicitly stated that AutoTax uses a 7-level taxonomy.

      We are able to resolve multiple ASVs for each species. Therefore, we clam that the microbial community can be analyzed at the sub-species level.

      line 506- I like seeing a time estimate, but it is meaningless without some broad description of the computational platform used. "A few hours," is great, but was that on a standard laptop or a high throughput computing center?

      This is important and relevant information and we will add it to the manuscript.

      line 513- "Although the sequence similarity clustering does not necessarily reflect the true evolutionary history of the microbes or the phenotypic characteristics..." This is the biggest weakness of your method, and a major concern. It deserves more in-depth discussion. For example, you show improvement over a smaller custom database (MiDAS), but you define improvement based on how many ASVs were named. Is it really an improvement to end up with more names if those names are less meaningful? How valuable is a placeholder name when it lacks phylogenetic context? Also, you need to discuss the limits of sequence identity for defining taxonomic rank. You cite a paper by Edgar (ref # 10) multiple times, yet do not discuss its main finding that sequence identity is a poor predictor of taxonomic rank.

      Many of point here have been discussed above. An important point is that AutoTax uses the most recent phylogenetic information when possible (the SILVA taxonomy). The placeholder names serve as robust reference points until true taxonomies are made. They have the same diversity as true taxa and are therefore good substitutes until the taxonomy has been curated by phylogenetic experts or by genome-based methods such as those used in the genome-based taxonomic database (GTDB) and are being used to curate the NCBI taxonomy. The denovo names will be replaced by true taxonomic names as the databases are curated. We can easily keep track of these chances then AutoTax based databases are updated.

      line 530- "...will provide a common language for scientific communities..." How will AutoTax accommodate existing and future manually curated taxonomies that include phylogenetic information? How do you prevent dueling frameworks, the "more complete" vs. the "more correct." Can AutoTax be incorporated into existing manual curation efforts, or is it purely a separate approach?

      See comment above.

    2. On 2019-07-12 15:45:14, user Robin Rohwer wrote:

      This manuscript describes a new method, AutoTax, to create databases out of full-length 16S rRNA gene sequences that can be used for taxonomy assignment of 16S rRNA gene amplicon data. This is a valuable contribution, because availability of full-length 16S rRNA gene sequences is increasing, and ecosystem-specific databases improve classification of amplicon data. Combining high-throughput generation of full-length 16S rRNA gene references with high-throughput database creation would be a major advancement. This manuscript also describes the application of AutoTax databases to the design of FISH primers, but I will focus my review on taxonomy assignment because that is my area of expertise.

      I believe this tool will be a valuable resource, which is why I have taken the time to review it carefully. I hope these comments will be addressed in the published version of this paper. I have three main concerns with this manuscript.

      First, it needs more detailed descriptions of the underlying methods that AutoTax uses for alignment and identity calculations. The manuscript refers readers to a github repo and supplemental documents for details, but this manuscript is introducing a new method so the algorithms/tools it employs should be clear in the main text.

      Second, the taxon levels in the AutoTax database are assigned based on identity thresholds. The authors acknowledge briefly that this means their AutoTax databases lack phylogenetic information, but this is a major limitation so they should justify in more detail why they chose this method and why the resulting taxon names are still meaningful. They also cite a 2018 paper by Edgar (https://peerj.com/articles/... several times to support other claims, but do not mention that the main finding of this paper is that percent identity is a poor predictor of taxon level.

      Third, the importance of AutoTax would be more clear if the manuscript discussed how it fits in with similar work on improving taxonomy classifications. Many previous studies (examples below) have also found improvements in taxonomy assignment when databases are improved, yet here this is presented as a novel finding. Many other methods also exist for creating custom databases, yet they are also not discussed. I believe AutoTax is a novel contribution, but its value will only be clear and meaningful in the context of previous work.

      Specific Comments:

      line 68- Great explanation of why taxonomy is needed for cross-study comparisons, there is a common confusion that ASV's have solved the problem.

      line 102- Since you are introducing an alternative tool to build databases, you should include more detail on the existing ways that databases are created and the comparative improvements and drawbacks when using AutoTax. (Perhaps in discussion instead of right here in introduction.) For example, how does AutoTax compare to the results from SINA (https://academic.oup.com/bi... and RAxML (https://academic.oup.com/bi... or FastTree (https://journals.plos.org/p..., or to manual curation using Arb (https://academic.oup.com/na...

      line 109- Instead of (or in addition to) citing the original field guide (your ref #22), please cite TaxAss as the reference for the FreshTrain. The original Newton citation is appropriate when referencing the phylogeny, vocabulary, or ecology of the included Freshwater bacteria, but the TaxAss citation is appropriate when referencing the current updated version of the database that can be used for taxonomy assignment. (https://msphere.asm.org/con...

      line 123, line 384- As you re-make your custom AutoTax taxonomy with a new version of SILVA, how does AutoTax prevent changes to the ecosystem-specific names in the first AutoTax version? It is clear in line 384 that you can add more custom sequences without messing up the existing taxonomy, but what if some of your sequences are added to SILVA itself? Couldn't that then shift all of your ESV centroids? And if you avoid that shift by leaving the duplicated sequence in the custom set, wouldn't AutoTax then ignore any added phylogenetic information that came from the additional SILVA curation, since preference is given to ESV centroids in the case of conflicts?

      line 129- I found mixing the abbreviations ASV and ESV very confusing. For example, ASV's could also be considered "exact" sequence variants, since they are unclustered, and to add to the confusion you DO cluster the ESVs later in the method. Choosing different terms would help clarify when you are talking about full-length vs. amplicon sequences.

      line 200- This "in brief" description of methods is inadequate, especially for a paper that is introducing the method for the first time. What algorithm does the script use to identify an ESV's closest relative- RDP classifier, SINTAX, BLAST? What algorithm does the script use to obtain taxonomy? What algorithm is used to calculate sequence identity? These choices have major impacts on your results, and for a paper that is introducing a method the workings of the method should not be hidden in supplemental materials or within code.

      line 332- I would believe the results of a kmer-based method like the standard RDP/Wang classifier more than sequence identity. You are using the full SILVA database in your classification, so overclassification should not be a major problem. If you are worried overclassifications might be masking the gains of your method, you can double check by looking at how many classifications change when you use the new database. The main point of your reference #10 is that sequence identity is a poor predictor of taxonomic rank, and Edgar even states in the abstract that "95% identity was found to be a twilight zone where taxonomy is highly ambiguous because the probabilities that the lowest shared rank between pairs of sequences is genus, family, order or class are approximately equal."

      line 367- What method specifically is used to "map" the ESV's?

      line 378- How do you know these classifications are correct when there is nothing to compare them to and no way to test for accuracy? Perhaps you could change it to "This approach will distinguish species-level classifications."

      line 387- When you defer to a centroid name over Silva, what happens to the Silva name? Is it lost in favor of the placeholder name? Could this result in a situation where known organisms are missed because they get classified as a new placeholder name instead?

      line 395- This "striking" finding is not novel. It is well supported in the literature that custom databases will dramatically improve classification and you should include that context when you describe your result. See for example: <br /> https://msphere.asm.org/con..., Fig 3<br /> http://journals.plos.org/pl..., Fig 5<br /> https://doi.org/10.1186/147..., Table 1<br /> http://dx.plos.org/10.1371/..., Table 4<br /> https://peerj.com/articles/494, Fig 3<br /> https://academic.oup.com/da..., Table 1<br /> http://www.sciencedirect.co..., Fig 6<br /> http://www.biomedcentral.co..., Fig 4<br /> https://peerj.com/articles/..., Fig 1

      line 410- The improvement over MiDAS is one of your most compelling findings. You should elaborate on it more! For example, a little background on how MiDAS was created for your non-wastewater audience, and then you can emphasize how the volume of sequences is really important. This is the most compelling evidence for scientists to adopt your method when they already have small custom databases, and it is also the most appropriate test of improvement since MiDAS is the current standard for activated sludge community classifications.

      line 417- This analysis is certainly useful to the wastewater research community, but it tests the resolution of primer regions, not the validity of database. To test how your database performs, you should use known amplicon sequences that do not already have an exact match in the database. For example, by creating amplicons from unincluded ESV's.

      line 490- How can you state that the AutoTax databases are near-complete when you haven't performed any completeness estimates? They are certainly "more complete"...

      line 491- These public databases are not "much larger" than your database because your AutoTax database combines your new sequences with Silva, and this combination is therefore slightly bigger than Silva. This statement could be misleading to a less careful reader, because some custom databases are used alone, without being combined with Silva/Greengenes.

      line 496- How can you claim sub-species level classifications? You have explicitly stated that AutoTax uses a 7-level taxonomy.

      line 506- I like seeing a time estimate, but it is meaningless without some broad description of the computational platform used. "A few hours," is great, but was that on a standard laptop or a high throughput computing center?

      line 513- "Although the sequence similarity clustering does not necessarily reflect the true evolutionary history of the microbes or the phenotypic characteristics..." This is the biggest weakness of your method, and a major concern. It deserves more in-depth discussion. For example, you show improvement over a smaller custom database (MiDAS), but you define improvement based on how many ASVs were named. Is it really an improvement to end up with more names if those names are less meaningful? How valuable is a placeholder name when it lacks phylogenetic context? Also, you need to discuss the limits of sequence identity for defining taxonomic rank. You cite a paper by Edgar (ref # 10) multiple times, yet do not discuss its main finding that sequence identity is a poor predictor of taxonomic rank.

      line 530- "...will provide a common language for scientific communities..." How will AutoTax accommodate existing and future manually curated taxonomies that include phylogenetic information? How do you prevent dueling frameworks, the "more complete" vs. the "more correct." Can AutoTax be incorporated into existing manual curation efforts, or is it purely a separate approach?

    1. On 2019-07-08 04:08:39, user Fraser Lab wrote:

      This work describes the implementation of a data processing pipeline for acquiring high-resolution maps of microtubules (MTs) from cryo-electron microscopy (cryoEM) data using the RELION software. As in other pipelines for processing microtubule EM data, this implementation requires extensive custom processing because of the pseudosymmetric nature of most MTs assembled in vitro (also observed in vivo): a "seam" down the length of the assembly disrupts the otherwise helical symmetry. The broken symmetry means that existing methods for processing purely helical particles equate nonequivalent positions and produce low-quality reconstructions. The authors implement a treatment of these particles that accounts for the seam and produces high-resolution structures of the MT α and β asymmetric units. It builds on implementations of similar pipelines for the same purpose using other software, with the key advantage of conducting all steps in a single program that most cryoEM users are already familiar with. The pipeline consists of a set of scripts and a series of steps the user should complete in the RELION graphical user interface (gui) in order to obtain the asymmetric unit reconstructions. The authors test their pipeline on three example datasets with different decorators on the α or β subunits that aid in initial alignment and discrimination between the two, and note that they have successfully (with minor modifications) applied the pipeline to a more challenging dataset with both subunits decorated.

      The major success of this paper is the clear and thorough description of the steps necessary to produce high-resolution α and β subunit reconstructions, complete with clear justification for each step and descriptions of expected results so that an advanced user can intervene when intermediate results deviate from expectations. This tool meets an immediate need in the structural biology community for analysis of MTs, which are inadequately reconstructed from cryoEM data by existing helical or strictly single-particle methods, and which play an important role in the cell interacting with a variety of other molecular machines. Ideally, it would be benchmarked against the other pipelines mentioned in the paper (e.g. https://github.com/nogalesl... from: https://www.ncbi.nlm.nih.go... but as one of us (JSF) knows from personal experience that it can be tricky to set up the necessary EMAN and frealign environments correctly to do such benchmarking properly. Here, the ability to complete this analysis without exporting steps to other programs could be a major boost in accessibility. Moreover, as the authors have built upon a popular cryoEM image processing program that has a gui highly accessible to novice and intermediate users as well as command-line tools that expert users may use for more advanced customizations and interventions, we anticipate this pipeline will be enthusiastically adopted by many users.

      We also applaud the authors' choice to make the scripts open-source and publicly available on github, which will facilitate the active conversation between users and developers (and sometimes software developers who did not author the original work) that lead to major breakthroughs and advancements in later versions. However, we cannot comment on the scripts themselves yet as a full path to the source code is not provided in the manuscript and a search for “MiRP relion github” didn’t yield anything informative. We would like to request the authors include it in the revised manuscript and also provide it to us during the review process so we may evaluate this important component of the work. We recommend using Zenodo (https://zenodo.org/) to generate a DOI for a snapshot of the repository, which will also produce a timestamp and facilitate formal versioning.

      We identify a few major weaknesses in the manuscript in its current form, all of which we hope the authors can address in a revision. First, the final, high-resolution reconstruction is the αβ dimer, not the C1 reconstruction of a full helical turn, which may not serve the goals of all users. The authors identify the final averaging step that disrupts the density for all but the αβ dimer directly opposite the seam and describe alternative approaches, including one implemented by the Carter group while the manuscript was in preparation. We would strongly encourage them to implement one of these approaches so that biological questions that require examination of the whole MT can also be addressed. We are also unclear on how the present implementation would preocess both the microtubule and fiducial protein for datasets with dynein or EB3 bound, and would like to see this explicitly discussed (or better, tested if EMPIAR datasets are available).

      Second, at times the authors describe what they expect from data they have not processed, for example on page 14 lines 1-4. Given that they have the necessary tools in-hand and this work describes the method, we would press them to test this type of claim and describe the supporting evidence. They have also described processing a dataset with fiducial markers on both α and β monomers but not described the modifications they had to make to the pipeline for this dataset, and they have not yet (to our knowledge) used the method to process undecorated MTs. As they cite successful processing of undecorated MTs by the Nogales group, proof-of-concept processing of undecorated MTs would be an important component of making this pipeline at least as useful as existing methods. The other case where we would strongly prefer to see the authors test their claims is on page 12 lines 46-48, where they speculate on the effects of excluding some MTs from further analysis, and although the authors do not make any claims about performance using automated MT picking, we would be very interested to see this tested (even if the result is that it is discouraged).

      Third, while we agree with the statement: “strikingly, although application of MiRP compared to standard helical processing has a negligible effect on the reported reconstruction resolution by FSC (Fig. 6c), the structural details are clearly superior in quality,” based on the snapshots of density shown at a specific contour in Fig 6 and Supp Fig 3, it is possible to use tubulin model refinement and other quantitative evaluations of the map to validate this statement. For example, compared to the standard helical map we would expect a higher quality reconstruction to have a smaller convergence rmsd when multiple independent models are rebuilt into the map (as in: https://www.ncbi.nlm.nih.go... a better EMRinger score (https://www.ncbi.nlm.nih.go... when evaluated with the same starting model rigid body fitted into the map; and lower B-factors/better model geometry when an atomic model is refined.

      Fourth, reiterating that this is a methods paper, we find it critical that the raw data be made available on EMDB for all datasets described in the manuscript, not just the C1 reconstructions and symmetrized asymmetric units. This is important for reproducibility, open science, and the development of exciting new methods like this one using publicly available test data.

      In sum, we find this an important piece of work that will immediately improve the ability of groups working on MTs to recover high-resolution structures, pending our several major reservations that we hope the authors will resolve in a revision. We also identify several minor points that could be improved, mostly regarding readability, and a few suggestions for alternative implementations of some steps in this or a future version of the pipeline. As the line numbers do not appear to be spaced the same as the text, for these points, we have indicated the closest line number to the line when printed.

      Some figures or tables referenced in the main text are not referenced correctly, such as on page 9 line 31, Fig. 6aii (referencing a panel that does not exist).

      Table 1 should include accession codes in the EMDB.

      The authors might comment on the biological relevance of MTs with seams, given that these are more often encountered in vitro and much less often in vivo, preferably in the introduction.

      We suggest a figure that visually highlights the symptoms of misalignment of the seam and/or helical averaging of MTs with seams. Including correlation coefficients with this figure could help illustrate the challenge this pipeline overcomes. This could also illustrate the signal boost of the superaverage and the symptoms of out-of-register units. The figure could be referenced at several points later in the text to explain why certain steps are necessary.

      At the end of the first paragraph on page 6, the authors describe "structural constraints of MT polymers" but apply restraints in orientational and translational searches. It would be helpful to expand on the rigidity of these restraints and whether it varies with distance to further neighbors, if applicable. Ideally (or possibly in a future version), the authors could consider restraining each α or β monomer relative to its immediate neighbors and using this approach in combination with variable restraint rigidity to aid in reconstructions of monomers at the seam and in distorted regions.

      Several points regarding resolution starting at the end of page 6 and continuing in the first paragraph on page 7 describe increments of resolution or changing pixel size by binning, which have no meaning in isolation (e.g. a difference of 0.2 Å or binning x 4). The starting or ending absolute quantities should be included (e.g. improvement of the resolution to 3.2 Å or a final pixel size of 3 Å). This is repeated on page 11 lines 11-12.

      The authors describe that "there is a clear bias towards a certain range of Rot angles" on page 8 line 16, but as this is the expected behavior and not a ground truth, it should be described as such.

      Similarly, on page 9 line 19, the authors intermix behavior on their test data ("As expected") with description of the method, and should more clearly separate these.

      On page 8 lines 18-22, the authors describe their approach for using the bias toward one Rot angle to select the correct seam location. We recommend testing the alternative method of a grid search over correlation coefficients, or describing how this is effectively accomplished during the global search step.

      On page 8 lines 21-22, the result of the Rot search is described as an approximation. It would be helpful to clarify whether this result is precise but sometimes inaccurate, or accurate but known to be imprecise.

      On page 8 in the section on X-Y shift smoothing, the authors describe a remedy for out-of-register asymmetric units involving resetting excessively large shifts to zero and re-refining. We propose an alternative method by analysis of the distribution of X-Y shifts that identifies the out-of-register shift vector and adjusts excessively large shifts by modulo arithmetic. This would reduce the error in the reset shifts.

      As part of the same description on page 8 lines 47-48, the authors describe enforcing all X/Y shifts in a MT to follow a single slope and intercept, and should clarify whether this is constrained or restrained.

      The authors could expand on the process of 'segment average' image generation on page 4 line 29 and the source of known helical parameters on page 4 line 33.

      The sample preparation for cryoEM section starting on page 3 could include greater detail, e.g. Vitrobot parameters during blotting and freezing.

      The authors could clarify the difference between defects and switches in PF number on page 7, lines 27-28.

      The description of a 'clean' seam on page 10 line 48 is confusing. Describing this as a MT with no seam might be clearer, if that is the correct interpretation.

      There are a couple creative uses of the word "allocation" — on page 9 line 6 we suggest substituting with "positioning" and on page 13 line 22 we suggest substituting with "assignment".

      The wording on page 5 line 52 seems to imply the RELION nomenclature preceded the Euler angle nomenclature used in many other applications, so we recommend dropping the modifier "former".

      The wording on page 7 line 5 implies the previously implemented approaches are lacking in some way, and we recommend dropping the modifier "albeit". This is repeated on page 12 line 42.

      The authors could describe which of the operations through the gui could alternatively be run on the command line on page 5, lines 59-60.

      On page 13 line 21, the authors claim their implementation is "the only way to avoid introducing artefacts." This may be an overly bold claim.

      We are unsure what the authors mean to do with the "41 Å shifted positions" on page 9 line 7.

      Some of the word choices could be made more accessible to all readers, for example by using the more common and equivalent "while" instead of "whilst" in several instances.

      The sentence "Initial Tilt ... picking coordinates" on page 6 lines 54-56 is unwieldy and could be rephrased.

      We find the sentence beginning "In other words" on page 8 lines 2-4 redundant and unnecessary.

      The qualifier "data collection parameters" should not accompany ice thickness on page 9 line 42.

      We would prefer sticking to one set of units on page 10 line 1 and substituting "sub-10 Å" for "subnanometer".

      The abbreviation "(DQE)" on page 9 line 43 is not used again and may be omitted.

      There are several spacing errors throughout the text. Between a numeral and a unit, there should be no space, except where the next character is º or %.

      On page 8 line 4 and in several other instances, where "however" is an interjection, it should also be preceded by a comma, e.g. "this register is, however, very error prone".

      There is a typo on page 3 line 52 (1 mg/ml --> 1 mg/mL), an incorrect abbreviation on page 4 line 6 (sec --> s), an overly dense abbreviation on page 4 line 59 (4xbin), a typo on page 4 line 53 (smoothened --> smoothed), a typo on page 5 line 14 (psi/tilt/ranges --> Psi/Tilt ranges), use of redundant "around" and "~" modifiers on the same quantities on page 6 line 53, incorrect pluralisation on page 7 lines 33-34 (confidence --> confidences), an unnecessary word "score" on page 8 line 7, a missing word "good" in "good signal to noise and good angular distribution" on page 9 line 33, an unnecessary hyphen in "ice-thickness" on page 9 line 42, and an unnecessary comma in "reconstructions, remains" on page 9 line 56.

      We review non-anonymously, Iris Young and James Fraser (UCSF).

    1. On 2019-06-15 15:51:16, user Mitchell Thompson wrote:

      We are currently getting an Open Access License for the code described in our preprint as required by DOE. As soon as the license has been approved we will provide the repository link.

    1. On 2019-06-07 20:30:44, user J Wallace wrote:

      Is anyone else concerned that Equation 2 puts the PCAs _after_ the SNP? (I checked the R code; it does it too, then appears to take simple Type I SS for p-values.) That seems like a basic methodological error. SNP effects should always be fit _after_ population structure. The github repository was missing the PCA file so I couldn't check it on the supplied code, but I bet fixing that would resolve most/all of the "problems" mentioned in this paper. (If the issue persists, I'd like to see if/how it relates to using kinship matrices, which are just as common as PCs.)

    1. On 2019-05-30 08:21:59, user Martin Modrák wrote:

      Liked the preprint and great that all the code is available. Good job! Should my main takeaway be that amplicon sequencing of the V4 (or other regions) in 16S is not the best investment of time and money and we should change to other marker genes? Or would you caution against that interpretation of your data?

      A few nitpicky suggestions follow:

      • I would use heatmaps or contour plots in Figure 1 instead of the 3D surface which (to me) was very hard to decipher.

      • I feel that when you combine recall and precision as in Figure 3b, it might be better to use F1 than their sum (not that would likely change your conclusions much) - if that was a conscious decision, might be worth discussing in the paper.

      • One thing that is not clear is whether you used the whole 16S rRNA as a marker gene or just the region used in common amplicon sequencing protocols (from Figure 3c I would guess you used whole length).

      • If the regions of 16S used in amplicon sequencing are not included separately, it would IMHO be nice to add them - I can see reasons why they would be better than full-length 16S as well as reasons why they would be worse.
    1. On 2019-05-29 23:23:37, user Liang Huang wrote:

      A heavily-revised version of this paper has been accepted to the Proceedings of ISMB 2019 and will be published by the journal Bioinformatics in July 2019 as:

      Liang Huang, He Zhang, Dezhong Deng, Kai Zhao, Kaibo Liu, David Hendrix, and David Mathews (2019). LinearFold: Linear-Time Approximate RNA Folding by 5'-to-3' Dynamic Programming and Beam Search. Bioinformatics, Vol. 35, July 2019, Special Issue of ISMB 2019 Proceedings.

      https://www.iscb.org/cms_ad...

      Stay tuned -- the final PDF will be online on the publisher's website (Oxford University Press) in July 2019.

      In the meantime, please try out our LinearFold server and code:

      http://linearfold.org<br /> https://github.com/LinearFo...

    1. On 2019-05-27 14:19:11, user bli wrote:

      Dear authors, maybe I haven't searched thoroughly enough, but I haven't found indications regarding the availability of tRNAscan-SE 2.0. What is the official site hosting the source code and usage documentation ?

      Thanks in advance.

    1. On 2019-05-23 13:30:52, user MohanBalasubramanian wrote:

      This work will soon be published in Int J Mol Sciences in a special issue edited by Kensaku Sakamoto on "Expanding and Reprogramming the Genetic Code", which carries ~15 articles on this subject.

    1. On 2019-04-25 15:28:19, user Pedro Madrigal wrote:

      Hi,

      Thank you very much for posting the ChIA-DropBox preprint. Unfortunately, it looks like the GitHub repositories mentioned in the preprint (see below) are not available.

      Best regards,<br /> Pedro


      Code availability<br /> The ChIA-DropBox data processing pipeline is publicly available at: https://github.com/TheJacks.... The ChIA-view visualization tool for multiplex interactions is publicly available at: https://github.com/TheJacks.... Both repositories contain README files with further details on how to get started using the software.

    1. On 2019-04-24 11:57:00, user daniele marinazzo wrote:

      Thanks!

      My comments here

      https://pubpeer.com/publica...

      and pasted below

      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

      (partial) review for the preprint.

      Disclaimer - COI statement: Matteo Fraschini invited me as guest professor for two weeks in Cagliari in the summer of 2017, and we are regularly in contact, even if no collaboration is ongoing at the moment.

      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

      Dear colleagues,

      thanks for sharing this work. Here follow some ideas and comments.

      Let's start with a conceptual issue: from the evidence that PSD and network metrics are very much correlated, you " conclude that it may represent good practice to report the findings from the two approaches in conjunction to have a more comprehensive view of the results." One might argue that this should be even ore true for two methods that are not connected one to the other, rather than for two methods which mostly reflect the same thing.

      Now, one might want to argue why the two measures are correlated, and whether this is specific to the brain. I will provide some indications below.

      I will go through two steps:

      PLV and PSD are correlated (as also observed here A comparison of spectral magnitude and phase-locking value analyses of the frequency-following response to complex tones)<br /> PLV and network measures are correlated.<br /> Let's start with 1. As clearly stated in Phase locking value revisited: teaching new tricks to an old dog, by Bruña, Maestú, and Pereda " PLV in a given band is related to the average coherence in the frequency range encompassed by the band weighted by their relative amplitude ". In particular they propose a computationally efficient, but more importantly illuminating way to define PLV without using the Hilbert transform and phase extraction, coded in Matlab as

      [nc, ns, nt] = size(data); ndat = data ./ abs(data); plv = zeros(nc, nc, nt); for t = 1: nt     plv(:,:, t) = abs(ndat(:,:, t) * ndat(:,:, t)') / ns; end

      which makes the relation between PSD and PLV evident. From the code that I pull requested to your repository

      https://github.com/danielem...

      https://users.ugent.be/~dma...

      Now, you used unthresholded, weighted, undirected networks and compute Strength, Clustering Coefficient, and Betweenness Centrality on them.

      Even taking a random matrix as "network" it's evident that higher values of the entry correlate with Strength and Clustering Coefficient

      https://users.ugent.be/~dma...

      https://users.ugent.be/~dma...

      On the other hand the Betweenness Centrality is constant.

      https://users.ugent.be/~dma...

      This mainly reflects your finding, except from Betweenness Centrality, which is where one might argue that the nonrandom structure of the functional network is slightly more modulated.

      So, to summarize, the results reported on real EEG data reflect this simple associations, and that's a conforting confirmation.

      Then one might wonder how to interpret network measures when the links are statistical dependencies across brain areas. This is of course a topic which is much bigger than a paper :). But indeed the link with a more interpretable local measure is helpful.

      For the moment you could expand on what this association can tell us, and if we need network measures of this kind.

      Thanks!

    1. On 2019-04-16 18:20:04, user Charles Warden wrote:

      Hi,

      Thank you for sharing this preprint (and the code for analysis).

      I hope I am not overlooking something, but I had a few questions:

      1) I apologize, but I may be unclear about the experiment design. At first, I saw the “1,011 yeast isolates,” and guessed that you were doing an experiment for one generation for the 16 x 16 crosses (with 3-4 replicates each). However, I think I am supposed to be focusing on “13,950 individual recombinant haploid yeast segregants by crossing each parental strain to two different strains and collecting an average of 872 progeny per cross (Fig. 1, Supplementary Table 1).” Is there some sense of “generations” in the experiment, or are you looking for the effect of mutations in the parental strains in the original 16 x 16 crosses?

      2) If you are looking for inherited mutations (from the parental strain), should the minimum MAF for a given variant be closer to 6% (1/16) or 3% (1/32, for haploid segregates from diploid parents, if I remember the yeast biology right)? If so, I’m sorry, but I’m a little confused about the <1% (inherited / heritable) variants.

      3) If you have a novel mutation causing a more clear phenotype (such as an acquired somatic mutation that is selected for as resistance to your treatments?), are you testing additional crosses/replicates with that isolate (to see if you can reproduce the phenotype in other crosses/segregants that retain that mutation)? Or, does that have something to do with 100s of segregants per cross (for example, testing the phenotype for a multiple isolates at X weeks, 2X weeks, 3X weeks, etc.)?

      4) Is there an accession number for deposited data?

      5) Also, it probably doesn’t need to be said for the yeast readers, but I was admittedly almost confused by chrX in Figure 4 (but I eventually figured out that was the 10th chromosome).

      Thank you very much for taking time to review these questions, particularly from somebody who doesn’t perform an experiment like this on a regular basis.

      Best of luck with your lab’s projects!

      Sincerely,<br /> Charles

    1. On 2019-04-05 12:59:59, user Dasapta Erwin Irawan wrote:

      Dear authors,<br /> Thank you for sharing the paper. I am currently working on a similar case here in Indonesia. I am interesting to learn more on the dataset and R code, where can I find them?

    1. On 2019-02-05 20:19:36, user B. Arman Aksoy wrote:

      This is a really amazing demonstration of a label-free classification method in T cells.

      One thing that surprised me was that the size of the cell, compared to NAD(P)H signal, wasn't helping much with the classification of activation status (Figure 2d-f). In our case, when human primary T cells are fully activated, their diameter increases from, on average, 8-9 microns to 12-13 microns and these are mostly the cells that proliferate fast. This is also apparent from your Figure 1a, where cells that were activated, on average, are bigger than the unstimulated ones.

      I think if you activated these cells using anti-CD3/anti-CD28 beads and imaged them on day 3 instead of day 2, you would have a better chance of capturing the overall differences between unstimulated and activated cells. The reason I am saying this is that I recently compared these two activation methods side by side and cells that were activated with the tetramer had a different size distribution compared to the population activated with beads on day 3:<br /> https://uploads.disquscdn.c...

      Also, it is worth noting that Immunocult media even without supplementing with tetrameric antibody complex causes T cells to get activated (although the activation happens much slowly). So I think if you use RPMI as the base media and use the beads for activation, your classification method should have an easier time.

      I can't wait to see if this technique can be used to classify tumor infiltrating T cells (without the need to disturb a tumor sample) and can further distinguish different functional subtypes: Th_1, Th_17, or T_reg... Thanks for sharing this work in preprint form -- I hope you will be able to share the annotated images and the classification code that you use soon, too, as I believe this technique would be much more valuable if applied on a larger segment (that contains hundreds or thousands of cells) and in an automated manner (i.e. the segmentation).

    1. On 2019-01-26 14:53:59, user Jingjing Liang wrote:

      Response to Dormann et al.

      Thank you for your comments on Liang et al. 2016. It is always stimulating when someone is discussing our findings. There are many interesting questions you raise, and others neither you nor we have yet wrestled with fully. Please find, below, our response to your comments as numbered on Page 1.

      (1) The authors computed “relative tree species richness” in such a way that it represents a gradient from boreal to tropical plots, rather than in local species richness. When instead computing species richness relative to the maximum value in the region the effect of species richness on productivity is dramatically reduced.

      Response: Thank you for your suggestion in your first sentence. However, confining our analysis strictly at the ecoregion level would render us unable to derive a true global biodiversity-productivity relationship (BPR) which should account for both intra- and inter-ecoregion variability. There are likely a variety of different ways of assessing this; ours and yours are just two. Considering mounting concerns on the delineation of ecoregion boundaries (e.g. Jepson and Whittaker 2002), an ecoregion-level study would create substantial problems of its own. Thus we believe both options (yours and ours) have strengths and weaknesses, and address the same overall question but from different angles. There are many other issues that could be, and should be addressed, in grappling with how best to do this. This includes whether productivity should be standardized (i.e. the issues raised for richness might also apply in some way for productivity); and how best to standardize either richness or productivity (as there a number of ways of doing this). We are working on delving further into these issues.

      Regarding the point in the second sentence, we disagree that the BPR relationship is dramatically reduced when examined at eco-regional scales. We will demonstrate below that even when we use relative tree species richness at an ecoregion-level, the trendline and standard error bands are similar to the global trend as reported by Liang et al. 2016.

      For this demonstration, we selected the three grassland biomes (i.e. Montane Grasslands and Shrublands, Flooded Grasslands and Savannas, and Temperate Grasslands, Savannas and Shrublands), because your graphs in Page 32 suggest that these biomes do not conform to the global trend of Liang et al. 2016. For this analysis, we combined the three biomes together, because there are less than 2000 plots for Montane Grasslands and Shrublands and Flooded Grasslands and Savannas together, and almost a half of the plots within these two ecoregions are monocultures

      The combined grassland biomes have a total of 23,133 plots (including ~3000 monoculture plots). For simplicity, we ignored the spatial autocorrelation, and the result from a robust bootstrapping estimation (Efron and Tibshirani, 1993) is quite consistent with the global trend of Liang et al. 2016 (Fig. B1) (see the Appendix for the R script for estimating BPR for the grassland biomes). This is also generally true for most of the other ecoregions (not shown), as long as there are a sufficient number of plots and a sufficient number of mixed-species plots. In fact, the theta values we have produced to date across regions don’t systematically differ from the global one, although we are still working on making sure we are doing these appropriately. So we are unclear how you arrived at the values you did. Additionally, we also think that perhaps we (and you or anyone else working with these data) should eventually re-run everything eliminating data from the desert ecoregion. Looking at the total lack of productivity there it is hard to justify calling anything that went into that group a forest.

      We acknowledge that performing an ecoregion-level study would be a good supplement to Liang et al. 2016. We would be glad to collaborate with you or anyone else on this idea. Additionally we believe that examining alternative approaches, including non-parametric models, and different ways of standardizing either or both productivity and richness, to the global relationship would be worth doing.

      We also note that we have some residual questions about your approach. We are unable to understand how a global line like yours (your left panel, Figure 1) could average and max out around 2.5 for productivity when so many of the Ecoregions with most of the data have means so much higher than that? Additionally, you call the x-axis of your first panel in Figure 1 ”relative local species richness” which confuses us. If your draws were across all data, then the ‘relative’ value is not ‘local’ even if you used the maximum values of each draw rather than the global max as we did (but we am not entirely sure what you did). If the maximum richness was from each draw, should your x-axis be “sample max” not “local max”. Are we misinterpreting what you did or is this just unclearly labeled?

      https://uploads.disquscdn.c... <br /> Figure B1. Estimated BPR curve (with 95% confidence interval bands), using an ordinary least squares (OLS) model, based on the three grassland biomes (i.e. Montane Grasslands and Shrublands, Flooded Grasslands and Savannas, and Temperate Grasslands, Savannas and Shrublands). We converted species richness (S) to relative species richness (S_hat): S_hat = S *100 / 271.

      (2) Plots are overwhelmingly from temperate forest; indeed only some 2500 plots are from the tropics (equivalent to 0.4%), despite these forests representing around 30% of the world’s forest. Stratifying the plots accordingly weakens the TSR-P-relationship.

      Response: Thanks for the concern raised in your first sentence. We are well aware of that problem, and have even discussed it in our paper. Of course this is just one more case of a general trend of under-documentation of all species (not just trees) from developing countries. This is problem all researchers from developed countries should at least be aware of and try to mend as best we can; we at the GFBi are doing our part and currently trying to collect more samples from the tropics for future research studies.

      Regarding your second point, we recognize that stratifying the plots may make the results more robust, but its effect would be limited and will not alter the overall global trend, because you already stated in your comments (Page 20) that “the (stratification) effect is moderate, with slightly lower values than the original non-stratified approach. This result suggests that also with non-stratified sampling always some tropical plots with high species richness are drawn, making the original Š robust to unrepresentative sampling.”

      Additionally, because the data are overwhelmingly temperate, roughly 3% boreal, and <1% tropical, and draws in Liang et al 2016 were random across the globe, most of the 500-stand draws in our original 2016 paper were likely to have most data from non-tropical sites, so the influence of tropical high diversity, high productivity sites were likely modest, unless they had extremely high influence per datum on the overall fitted function because of their position in data space (which is possible). This is relevant to your concern (above) about our global result being influenced by the sharp gradient in boreal to tropical forests in both productivity and richness. Similarly, boreal stands would have shown up not very often; maybe 15 or times on average in each 500 stand draw, with tropical stands drawn twice or so on average out of each 500 draw. In contrast, if our data had hypothetically been roughly representative equally of boreal, temperate and tropical forests, the global relationship might have been much more influenced by the gradient from low diversity, low productivity boreal to high-high tropical. In other words, our original data and fits were likely strongly temperate in flavor, despite our concerns about the undue influence of the boreal-tropical gradient. It may be in fact that we should have a different concern; not that boreal-tropical gradient exerted too much influence on our published global fitted relationship of productivity-richness, but that our global analysis ‘undercounted’ the impact of tropical and boreal forests on the global relationship, given that the vast majority of stands in each 500-lot draw were temperate. We are not yet sure how best to check these issues.

      (3) In the spatial regression model, distances between plots were computed without taking the spherical nature of earth into account. This had little effect on the slope estimate of the TSRP-relationship.

      Response: Thank you for sharing your insight into and findings about this. We appreciate it. We recognize that calculating distances between plots by taking the spherical nature of earth into account may slightly improve the accuracy of our estimated BPR. The magnitude of such improvement is yet to be determined by future research.

      (4) The computational burden of the spatial model required subsampling the data to 500 data points. The authors did not correctly compute confidence intervals for this approach, wrongly interpreting subsampling as bootstrapping and additionally incorrectly computing bootstrap standard errors. A correct subsampling-based estimation led to approximate trippling of the reported confidence interval.

      Response:

      Thank you for raising this concern. Bootstrapping is only efficient at depicting a global trend if the re-sampling size is close to the global sampling size (Efron and Tibshirani 1993). However, for our study, the 500-plot subsample is far from our global sampling size (>700,000). Considering that you used a minimalism approach, in which “while Liang et al. (2016) run 10000 bootstraps, we only do 50,” (p.8) your suggested global results only represent, in fact, ~ 50*500=25000 plots or approximately 3 percent of the global sample. In other words, there is a 97% information loss in your approach.

      In the textbook description of the bootstrapping by Efron and Tibshirani (1993), echoed by many (e.g. Hesterberg 2015), it is outlined that the bootstrap sample should be equal in size as the original sample, and that any smaller re-sampling sizes would lead to a biased estimate of standard error. This is also the main reason why you did not find a significant global BPR as it should have been.

      Allow us to demonstrate, with R-code (in blue) and outputs, how we have derived our results. While there is well-established literature regarding the validity of the subsampling method we have taken, less is known about an appropriate choice of the size of a subsample and the number of subsamples. With a global sample size over 600,000, we have chosen the subsample size to be 500 and a total of 10,000 subsamples out of consideration for computational feasibility and adequate representation of the global sample. Our approach leading to these choices is indeed ad-hoc and the standard errors are at best approximations. We welcome ideas and possible collaboration to establish more rigorous approaches. On the other hand, with a large amount of data and thus information, statistical significance is not tenuous to attain.

      1. For each random subset of 500 plots, we consider this subset a separate study unit (one can regard this as equivalent to a subregion). In the Geospatial Random Forests model, we calibrate one biodiversity-productivity relationship (BPR) curve based on this subset. With a global sampling size of >700,000, we find that it takes more than 2,000 subsets of 500 samples in our global BPR analysis, so that any single plot would have been accounted for at least once in the analysis. To be safe, we used 10,000 subsets (i.e. iterations) (Fig. 1);

      https://uploads.disquscdn.c... <br /> Figure 1. A graphic demonstration of the Geospatial random forests model. We randomly select 500 plots from across the world as one study unit or “subregion” (yellow), calibrate one biodiversity-productivity relationship (BPR) using the model, and draw a ceteris paribus BPR curve. Repeating this 10,000 times provide a sufficient global coverage as each plot has on average been covered for ~7 times (500*10000/720000≈7). Note that actual subregions can be spatially discontinuous depending on the randomization. <br /> A major strength of this approach is that it does not require any a priori assumption on the population distribution or any a priori delineation of forest type units across the world, within which forests have similar conditions. This is especially useful because there is no universally accepted forest type delineation across the world (FAO 2015).

      1. Load the global data set, note that we did remove plots with extreme species richness or productivity values (i.e. those beyond 99.996th percentile), and plots with zero species richness or productivity.

      Load packages

      library(nlme)

      Load plot-level data

      Download GFB1_data_figshare.xlxs from Figshare and convert to a csv file

      data<- read.csv("GFB1_data_figshare.csv")<br /> data <- subset(data, P>0)<br /> data <- subset(data, S>0)

      quantile(data$S,0.99996)<br /> quantile(data$P,0.99996)

      data1 <- subset(data,data$S<=270 & data$P<=533 & data$S >0 & data$P>0) # removed 894 plots with 0 or extreme S and P values

      1. For each subset of 500 plots (without replacement), we consider this subset a separate study unit (one can regard this as equivalent to a subregion). We draw one BPR curve based on this subset, using our geospatial random forests, by keeping other variables constant at their sample mean, only increasing species richness from 1 to 271 (the global maximum).
      #############################################################
      ############# Derive Global GeoRF Estimation #####################
      #############################################################

      logP <- log(data1$P)

      jig coordinates to avoid duplicated values

      Lon1 <- data1$Lon+ runif(length(data1$Lon),-0.0001,0.0001)<br /> Lat1 <- data1$Lat+ runif(length(data1$Lat),-0.0001,0.0001)<br /> data1 <- cbind.data.frame(data1, logP, Lat1, Lon1)

      ###### Loop ##################

      coef <- matrix(0, nrow=10000, ncol=20) # Coef Matrix

      for(i in 1: 10000) {<br /> tryCatch({<br /> training <- data1[sample(1:nrow(data1), 500, replace=FALSE),] # turn 'replace' off to maximize inclusion of new plots<br /> logS <- log(training$S)<br /> training <- cbind.data.frame(training, logS)<br /> gls1 <- gls(logP~ logS + G + T3 + C1 + C3 + PET + IAA + E, data=training, method="ML", corr= corSpher(form = ~ Lon1 + Lat1, nugget = TRUE), control=glsControl(singular.ok=TRUE))<br /> coef[i,3] <- i<br /> coef[i,4] <- logLik (gls1)<br /> coef[i,5] <- AIC (gls1)<br /> coef[i,6]<- BIC (gls1)<br /> #Generalized coefficient of determination<br /> gls0 <- gls(logP~ 1, data=training, method="ML") <br /> R2 <- 1-exp(logLik(gls0)-logLik(gls1))^(2/500)<br /> coef[i,7]<- R2<br /> coef[i,8] <- coef(gls1)[1] <br /> coef[i,9] <- coef(gls1)[2]<br /> coef[i,10] <- coef(gls1)[3]<br /> coef[i,11] <- coef(gls1)[4]<br /> coef[i,12] <- coef(gls1)[5]<br /> coef[i,13] <- coef(gls1)[6]<br /> coef[i,14] <- coef(gls1)[7]<br /> coef[i,15] <- coef(gls1)[8]<br /> coef[i,16] <- coef(gls1)[9]<br /> coef[i,17] <- 0<br /> # Baseline (S=1) productivity<br /> # logS + B1 + T3 + C1 + C3 + PET + IAA + E<br /> newdata <- data.frame(logS=0, G=mean(training$G), T3=mean(training$T3), C1=mean(training$C1), C3=mean(training$C3),PET=mean(training$PET), IAA=mean(training$IAA), E=mean(training$E))<br /> coef[i,20] <- exp(predict(gls1,newdata))<br /> #counter<br /> cat(i, " of ", 1000, date(),"Theta=",coef(gls1)[2], "R2=", R2, "\n" )<br /> #remove files<br /> rm(training, newdata, gls1, R2)<br /> }, error=function(e){})<br /> }<br /> coef_df <- as.data.frame(coef)

      names(coef_df) <- c("0", "0", "i", "Loglik", "AIC", "BIC", "R2","const","theta", "B", "T3", "C1", "C3", "PET", "IAA", "E", "0", "0", "0", "P_1")

      write.csv(coef_df, "global_estimates.csv")

      1. Repeating the foregoing step 10,000 times, we get a combined subregions that cover the entire global forest range. Meanwhile, we have 10,000 curves (green in the following Fig. 2) that represent possible BPR’s across the world. Treating each region as an independent study unit, instead of a bootstrapping re-sample, we can calculate and plot the mean and standard error (SE) of the predicted BPR curves across the world as shown in the figure below (mean: black line, with red curves representing 95% C.I.)
      ##############################################################

      Draw estimated Biodiversity-Productivity Relationship (BPR) curves #########

      data<- read.csv("global_estimates.csv")

      theta <- data$theta<br /> mean(theta)<br /> P_base <- mean(data$P_1)

      Predict P over an increased S from 1 to global max (271), which corresponds to S_hat from 100/271 to 100

      S <- seq(1,271,1)<br /> S_hat <- S*100/271

      P_est <- data.frame(matrix(0, 10000, ncol =273))<br /> P_est[,1] <- P_base<br /> P_est[,2] <- theta

      for (i in 1:10000){<br /> P_est[i,3:273] <- P_est[i,1] * S ^ P_est[i,2]<br /> }

      demosntration plot only shows the first 18 iterations

      plot(S_hat,colMeans(P_est[,3:273]), ylim=c(0,20), type="l",col = "blue", ylab="P")<br /> for (i in 1:18){<br /> P_est[i,3:273] <- P_est[i,1] * S ^ P_est[i,2]<br /> lines(S_hat,P_est[i,3:273],col = "green")<br /> }

      Confidence intervals

      lines(S_hat,colMeans(P_est[,3:273])+1.96*apply(P_est[,3:273], 2, sd)/sqrt(10000), ylim=c(0,20), type="l",col = "red")<br /> lines(S_hat,colMeans(P_est[,3:273])-1.96*apply(P_est[,3:273], 2, sd)/sqrt(10000), ylim=c(0,20), type="l",col = "red")

      https://uploads.disquscdn.c... <br /> Figure 2. Sample BPR curves from the 10,000 estimated curves from across the world. The figure is nearly identical to Fig. 3A of Liang et al. 2016, with some minor differences due to the random process. For easy comparison across the world, we set the base value of P as 2.5m3ha-1yr-1, and convert species richness (S) to relative species richness (S_hat): S_hat = S *100 / 271.

      1. To demonstrate that this estimated global mean and confidence interval from our Geospatial random forests model (Fig. 2) is a good proxy of the true global BPR trend, we compare this result with an outcome from an ordinary least squares model (OLS), of which the estimates are based on the entire sample (with >700,000 plots).

      A comparison with OLS model ##

      data <- read.csv("GFB1_data_figshare.csv")<br /> data1 <- subset(data,data$S<=270 & data$P<=533 & data$S >0 & data$P>0) # removed 894

      logS <- log(data1$S)<br /> ols1 <- lm(logP~ logS + G + T3 + C1 + C3 + PET + IAA + E, data=data1)

      theta <- coef(ols1)[2]<br /> summary(ols1)<br /> se_theta <- 2.100e-03

      S <- seq(1,271,1)<br /> S_hat <- S*100/271<br /> P_base <- 2.5

      P_est_ols <- P_base * S ^ theta # mean predicted BPR<br /> P_est_ols_ub <- P_base * S ^ (theta+1.96* se_theta) # upper bound of 95% CI<br /> P_est_ols_lb <- P_base * S ^ (theta-1.96* se_theta) # lower bound of 95% CI

      plot(S_hat, P_est_ols, ylim=c(0,20), type="l",col = "blue", ylab="P")

      Confidence intervals

      lines(S_hat,P_est_ols_ub, ylim=c(0,20), type="l",col = "red")<br /> lines(S_hat,P_est_ols_lb , ylim=c(0,20), type="l",col = "red")

      The corresponding line plot is printed below. According to this graph, the BPR has the same curvature, but estimated productivity (P) is in general 10-20% lower than the estimated values from the Geospatial random forests, presumably due to the fact that spatial autocorrelation is not accounted for in the OLS model. Nevertheless, the confidence interval from the OLS model generally matches the confidence interval from the Geospatial random forests (Fig. 2).

      https://uploads.disquscdn.c... <br /> Figure 3 Estimated BPR curve (with 95% confidence interval bands), using an ordinary least squares (OLS) model, based on the entire GFB sample with >700,000 plots. For easy comparison across the world, we set the base value of P as 2.5m3ha-1yr-1, and convert species richness (S) to relative species richness (S_hat): S_hat = S *100 / 271. <br /> <br /> (5) As noted earlier (Schulze et al., 2018), some 4% of the plots had productivity values (far)<br /> beyond what is biologically plausible (Stape et al., 2010). The likely reason is that small plots with large inventory errors in the productivity may lead to erratically high values. Not taking this into account in the analysis, e.g. by down-weighting plots with productivities above 30 m2ha????1y????1 at least indicates an unre ected use of data.

      Response: <br /> Thank you for your concern. As shown in the R-code above, we have removed extremely high productivity values, above the top 0.004 percent quantile (P<=533). It is admittedly a difficult task to filter out the potentially biased values from such a large sample, but we are working with data scientists and data contributors to further improve the accuracy of our data.

      References

      Efron, B., and R. J. Tibshirani. 1993. An introduction to the bootstrap. Chapman & Hall, New York.<br /> FAO. 2015. Global Forest Resources Assessment 2015 - How are the world’s forests changing? , Food and Agriculture Organization of the United Nations, Rome, Italy.<br /> Hesterberg, T. C. 2015. What teachers should know about the bootstrap: Resampling in the undergraduate statistics curriculum. The American Statistician 69:371-386.<br /> Jepson, P., and R. J. Whittaker. 2002. Ecoregions in Context: A Critique with Special Reference to Indonesia. Conservation Biology 16:42-57.

      Appendix: R script for estimating BPR for the grassland biomes

      Estimate BPR curves by ecoregion

      (C) Jingjing Liang 2018

      library(nlme)

      Load plot-level data

      Download GFB1_data_figshare.xlxs from Figshare and convert to a csv file

      data<- read.csv("GFB1_data_figshare.csv")<br /> data <- subset(data, P>0)<br /> data <- subset(data, S>0)<br /> attach(data)

      Montane Grass and shrubs

      data1 <- subset(data, data$Ecoregion==10 | data$ Ecoregion ==9 | data$Ecoregion ==8)<br /> data1 <- subset(data1,data1$P<=quantile(data1$P,0.999))

      ######## BPR Estimation
      ###### Bootstrapping ##################

      coef <- matrix(0, nrow=50, ncol=101) # Coef Matrix

      for(i in 1: 50) {<br /> tryCatch({

      training <- data1[sample(1:nrow(data1), 23133, replace=TRUE),]<br /> logP <- log(training$P)

      Lat1 <- training$Lat + rnorm(length(training$Lat))<br /> Lon1 <- training$Lon + rnorm(length(training$Lon))<br /> training <- cbind(training, logP, Lat1, Lon1)

      S_max <- max(training$S)<br /> SR <- training$S/S_max*100

      logS <- log(SR)<br /> training <- cbind(training, logS)

      lm1 <- lm(logP~ logS + G + T3 + C1 + C3 + PET + IAA + E, data=training)

      Derive ceteris paribus BPR curve

      newdata <- data.frame(logS=log(seq(1,100,1)), G=mean(training$G),T3=mean(training$T3), C1=mean(training$C1), C3=mean(training$C3),PET=mean(training$PET), IAA=mean(training$IAA), E=mean(training$E))<br /> coef[i,1] <-coef(lm1)[2] #theta<br /> coef[i,2:101] <- exp(predict(lm1,newdata))<br /> plot(coef[i,])<br /> #counter<br /> cat(i, " of ", 50, date(), "\n" )

      remove files

      rm(training, newdata, gls1)

      }, error=function(e){})<br /> }

      coef_df <- as.data.frame(coef)

      write.csv(coef_df, "Ecoregion_Grasslands_BPR.csv")

      Plot mean and 95% CI of bootstrapping

      plot(seq(1,100,1),colMeans(coef_df[,2:101]), ylim=c(0,6), type="l",col = "blue", ylab="P",xlab="S_relative")

      Confidence interval

      lines(seq(1,100,1),colMeans(coef_df[,2:101])+1.96*apply(coef_df[,2:101], 2, sd), ylim=c(5,8), type="l",col = "red")<br /> lines(seq(1,100,1),colMeans(coef_df[,2:101])-1.96*apply(coef_df[,2:101], 2, sd), ylim=c(5,8), type="l",col = "red")

      End of the code

    1. On 2018-11-27 10:17:30, user Camila wrote:

      I find your paper very interesting and I appreciate that you implement Gene Ontologies to build your two dimensionality reduction models. It is a sensible idea to use the biological knowledge that has been collected so far. I would maybe like to see how robust your method is on a bigger and more diverse dataset. And maybe you don’t mention it on your paper, but it would be nice that you had a GitHub repository with the documented implementation of your code. And one last thing is that instead of using ZIFA you could try to compare the performance with SIMLR which is a multi kernel learning method built by some of the authors that developed ZIFA, that has less stringent conditions on the data and has already proved to outperform ZIFA for dimensionality reduction of Single-Cell RNA Seq Data.

    1. On 2018-11-10 06:51:22, user cabbageleek wrote:

      Nice work, but you need to improve the write up to get any traction!<br /> I hope my comments will help you improve the manuscript, please take them that way.<br /> The title will be meaningless to most people. species2vec is not the novel method, but it is the code that implements the model.<br /> What is "species representation"? This is not a commonly used term for anything ecological.<br /> The abstract should explain the research in much clearer language. Currently, it is only comprehensible to a tiny number of people.<br /> Say what you are trying to achieve in very simple language. For example, you should not start the abstract with the words "Word embeddings", because that is meaningless to most people who might be interested to read it. The explanation of word embedding it in the methods, but it should be moved to the introduction.<br /> You mention the skip-gram model in the abstract with no explanation at all. <br /> There is no point having the species names on figure 3, as they are unreadable. <br /> Figures should be understandable without reference to the text...<br /> Figure 4 needs a better legend. What do the colour codings mean?<br /> Figure 5 needs a better legend.<br /> The DOI of the GBIF download should be cited.<br /> I assume the words "substitude" and "continious" are spelling mistakes.<br /> I would avoid using the term subword without explanation, it is a rather obscure term and only known to small cliche of researchers and not to species distribution modellers.<br /> In figure 3 the cluster Africa is further from Europe than South America and Australia is closer to Europe than South America. This doesn't make sence from a biogeographic point of view. Can you explain this? Did you remove the introduced species before doing the analysis?

    1. On 2018-11-02 23:26:27, user Wouter De Coster wrote:

      Dear authors,

      Thank you for this interesting work. I'll read it later more in depth, but I have some quick feedback. It seems the URL to your tool is only available at the very end of the manuscript, although it would be convenient to have it earlier, preferably at the end of the abstract. Since your tool is the main deliverable of this work you should put it more in the spotlight and easier to find. My second comment is about the use of python 2.7. It seems your code is not compatible with python 3, which is a problem. Python2.7 is not going to be supported for that long anymore, and by publishing tools with an outdated python interpreter now you are not writing future-proof software.

      Cheers,<br /> Wouter

    1. On 2018-09-07 17:18:34, user Alexander Predeus wrote:

      Congratulations to the authors on this preprint - very promising result, which could be useful for numerous applications. Would it be possible to publish the R code used for finding the associations on github? Thank you in advance!

    1. On 2018-04-08 06:58:41, user dante wrote:

      The yellow color code present in the pre-print and in https://public.tableau.com/... <br /> seems have two components - one V1 which looks like being associated with West Siberian HG and the other V6 which looks like south Asian. There are five colors as opposed to K=6 ADMIXTURE. Could you please provide separate color codes for V1 and V6 ?

    2. On 2018-04-03 03:01:35, user bennedose wrote:

      This paper has some of the following issues which I find interesting. I believe that geneticists are mistaken when any good work that they do is mixed with the unscientific hypotheses about language spread that linguists have established for the last 150 years. Genes do not code for language. Genetic fact should not be mixed up with tenuous linguistic hypotheses.

      1.The linguistics dates of IE spread are taken from David Anthony (ref no 46) whose assumptions linking the archaeology of Andronvo grave pits and Sintashta culture with grossly erroneous translations of the Rig Veda are untenable. In particular Anthony’s linking of Rig Veda 10:18 with Kurgans is based on an incorrect translation, apart from other assumptions

      1. This paper reinforces what was earlier suggested by Reich et al and Priya Moorjani (refs 44 and 45) that ASI-ANI mix occurred in the last 2000-4000 years ago. Both this paper and the earlier Reich paper suggest that there was lack of mixing of ASI and ANI before that. In fact this paper says that ASI and ANI were "largely unformed) 4000 years ago (page 14).

      Now here is what is odd. If ANI and ASI ancestry represent IE speakers and Dravidian language speakers (respectively) and if they started mixing as recently as in the last 4000 years when the migrations are said to have occurred, but the admixture had NOT occurred before that date - it tears down all assumptions of a "Caste system" having been created by the migration of ANI/IE speakers displacing ASI/Dravidian language speakers. If they were mixing, there were not remaining separate. It is as simple as that.

      If one assumes (probably wrongly) that IE languages came to India about 1500-2000 BCE then it appears that the real mixing of ASI and ANI started AFTER those people came contrary to all assertions by linguists that endogamy commenced in India with the arrival of IE speakers in India.

      1. Of course all this ignores and fails to explain how the oldest IE language Sanskrit has references to hydronyms and the riverine geography of IVC area from 3000 BCE and the start of aridity in 2000 BCE - adding to the growing body of evidence that linguistic hypotheses have depended upon dubious assumptions.

      It is notable that German and Greek have perhaps 25-40 % Indo-European language content, the rest of the words having a non Indo-European or pre- IE origin. However Sanskrit, the oldest known IE language has 97% Indo-European word content. This anomaly cannot be explained by saying that no language was spoken in India before the arrival of IE languages in 1500 BCE.

      1. This paper makes statistical predictions about the entire Indian subcontinent with zero samples from IVC or anything east or south of that (as clearly admitted in the paper). The closest "south Asia" samples are from Swat in Pakistan. Swat is closer to "Turan" (Iran, Uzbekistan, Turkmenistan) than to major IE speaking areas of India like Bengal, south Gujarat, Maharashtra and the Eastern Gangetic plain. So linguistic conclusions connected to genetics need to be viewed with caution.
    1. On 2018-08-13 20:08:36, user Jeff Barrett wrote:

      I was asked to review this manuscript for biOverlay, and I am posting my review here as well.

      There has been a lot of attention to the role of de novo coding mutations in autism and other neurodevelopmental disorders, and a natural next question is whether it's possible to identify non-coding mutations that confer disease risk. One of the major challenges to this is how to understand the "regulatory code" and distinguish "synonymous" from "non-synonymous" mutations outside of genes. This paper applies a deep learning (as the authors note six times) algorithm, DeepSEA, to the this problem.

      The major result is that DeepSEA successfully predicts a systematic difference between de novo mutations in individuals with autism and their unaffected siblings, which is an impressive achievement. One of my major suggestions for this paper is to provide more information about the input data (de novo mutations called from whole genome sequence), as this is notoriously tricky. I'm somewhat less concerned because the dataset provides a natural internal control between affected individuals and their siblings, but it would still be good to see more detail. For example:

      • How many de novos were called per proband and per sib? This is a basic QC metric, but I couldn't find it.
      • More info would be helpful: "Further filtering was then applied to remove variants that were called in more than one SSC families."
      • The paper refers to 127,140 de novo SNVs. I assume these are only non-coding (and coding mutations are stripped out), but it's not totally clear.

      DeepSEA predicts biochemical disruption, and these predictions were further trained on curated HGMD disease mutations and variants observed in 1000 Genomes. What happens if the predictions from DeepSEA are used directly in the autism data? The noncoding disease mutations in HGMD might be a problematic training set, as there are not that many known, and some may not be actually pathogenic, even in the curated set.

      Further analyses (e.g. of tissue specific expression and enriched biological functions) provide additional support for the main findings.

      For a follow-up paper, someone (the authors or another interested party) should run this analysis on the Deciphering Developmental Disorders dataset (which I was involved in), which should have good power to find specific causal mutations: https://www.nature.com/arti...

      Minor comments:<br /> The authors suggest that 30% of simplex ASD probands have a de novo coding cause (and their point is that this is not very much), but I think that's high. I'm not sure where the number comes from, as ref 3 finds diagnostic mutations for 11%, and their Sup Note says 2.4%.

      The comparison between versions of DeepSEA is described only fleetingly: "leading to significantly improved performance, p=6.7x10-123, Wilcoxon rank-sum test". More generally, one needs to read that paper to really understand what's going on. This is often the case, but a bit more of a summary of the method would help.

      On page 18 there's a repeated word in "40 SSC families families".

    1. On 2018-08-09 22:28:17, user bioinfo clemson wrote:

      Hi Colin,

      Thanks very much for your interesting in DeepSignal! this DeepSignal is not ours. we have not release our source code yet, but will do this soon. I will keep you updated.

      Best,<br /> Feng

    1. On 2018-07-25 21:06:13, user Laurent Jacob wrote:

      A short version of this pre-print is now published as a Genome Biology<br /> correspondence<br /> (https://genomebiology.biome....

      An updated version of the complete manuscript is available at<br /> ftp://pbil.univ-lyon1.fr/pu... (we<br /> are not allowed to update the bioRxiv pre-print now that the paper is<br /> published). The corresponding code is available at<br /> ftp://pbil.univ-lyon1.fr/pu...<br /> and the data at<br /> ftp://pbil.univ-lyon1.fr/pu....

      We thank Drs Timmons and Haibe-Kains for their comments on our<br /> analysis. Below are our answer on the main points raised in the<br /> discussion.

      • In our experiments on chronological age prediction, Dr Timmons point<br /> out the fact that we are (i) randomly sampling from the 25%<br /> (~13,500) probe-sets with largest median absolute deviation on<br /> training data rather than the full set of 54,000 probes and (ii)<br /> only applying each sampled set of 150 probe-sets on a single dataset<br /> as opposed to assessing its performance across datasets.

      Regarding (i), Affymetrix arrays are known to contain a large number<br /> of probe-sets with non-specific hybridization [1]. These probe-sets<br /> provide a noisy signal and filtering them out is a standard<br /> procedure. However as pointed out by Dr Timmons, chronological age<br /> explains a large proportion of the variance across the training<br /> dataset, so that retaining high variance probe-sets may enrich the<br /> sampling pool in genes associated with age (as opposed to just<br /> removing noisy probe-sets). In the updated manuscript, we avoid this<br /> problem by retaining probe-sets with high average intensity across the<br /> training dataset. We also remove the 670 probe-sets identified by [2]<br /> as being predictive of age from our sampling pool. These 670 include<br /> their 150 healthy ageing gene signature (HAGS).

      As to (ii), we ranked the AUCs of 10,000 random sets of 150 probe-sets<br /> plus the HAGS on each dataset, and represented each set by its median<br /> rank across datasets. Figure 1 in our report suggests that the most<br /> favorable case for the HAGS is a restriction to the four muscle<br /> samples using external validation (the HAGS was built on a dataset<br /> measuring gene expression from the same tissue, on the same platform).<br /> 9% of the randomly sampled sets of 150 probe-sets have a better median<br /> rank than the HAGS on external validation across these four<br /> datasets. In other words, the null hypothesis ``the performance of the<br /> HAGS for age prediction on muscle gene expression measured by<br /> Affymetrix arrays is sampled from the distribution of performances<br /> provided by sets of 150 probe-sets randomly sampled from the 25%<br /> probe-sets with largest average intensity across training data<br /> excluding probe-sets identified as having good performance to predict<br /> age on training data'', can only be rejected at level 9%. Adding the<br /> brain dataset (still on the same Affymetrix platform), this proportion<br /> goes up to 17%.

      More details on this point are given in Section 2.2 of the updated<br /> report.

      • In our analysis of the Alzheimer's Disease (AD) cohorts, Dr Timmons<br /> writes that we are incorrectly combining AD and mild cognitive<br /> impairment (MCI) patients.

      Dr Timmons made this observation to us by email earlier when we sent<br /> him a first version of our report. We offered to re-run this analysis<br /> if he specified precisely which samples and labels had been used used<br /> for the ROC analysis in [2] but never received these<br /> specifications. Merging MCI and AD patients does not seem to be<br /> completely invalid as it leads to similar AUCs as reported in [2]<br /> (Figure 5 in the paper). We also notice that AD and MCI have very<br /> similar projections along PC1 (Figure 3 in our report). We would<br /> however be interested to know which grouping was used in [2] and<br /> whether it actually leads to different conclusions.

      More details on this point are given in Section 3.1 of the updated<br /> report.

      • Dr Timmons writes that we refused to answer any question or share<br /> code for the past year, and that we failed to provide a single set<br /> of 150 probe-sets working across all data.

      The code used to produce all figures in our report was always<br /> available in the supplementary material section of this preprint. We<br /> also sent a copy of this code along with the report to Dr Timmons upon<br /> its submission two years ago and answered the questions he sent us<br /> subsequently. We were asked to produce a single set of 150 probe-sets<br /> working across all data during the reviewing process of our<br /> correspondence. We accepted and asked for a code which reproduced the<br /> results of [2]. We wanted to avoid working on a wrong version of the<br /> data, pre-processing or analysis, as the authors could then have<br /> (rightly) claimed that our results was invalid. We were told they had<br /> declined to provide their code.

      [1] Michael Barnes, Johannes Freudenberg, Susan Thompson, Bruce<br /> Aronow, and Paul Pavlidis. Experimental comparison and<br /> cross-validation of the affymetrix and illu- mina gene expression<br /> analysis platforms. Nucleic Acids Research, 33(18):5914, 2005. doi:<br /> 10.1093/nar/gki890. http://dx.doi.org/10.1093/n...

      [2] Sanjana Sood, Iain J Gallagher, Katie Lunnon, Eric Rullman, Aoife<br /> Keohane, Han- nah Crossland, Bethan E Phillips, Tommy Cederholm,<br /> Thomas Jensen, Luc J C van Loon, Lars Lannfelt, William E Kraus,<br /> Philip J Atherton, Robert Howard, Thomas Gustafsson, Angela Hodges,<br /> and James A Timmons. A novel multi-tissue rna diagnostic of healthy<br /> ageing relates to cognitive health status. Genome Biol, 16: 185,<br /> 2015. doi: 10.1186/s13059-015-0750-x. URL http://dx.doi.org/10.1186/<br /> s13059-015-0750-x.

    1. On 2018-07-20 17:38:50, user netelabs wrote:

      In this preprint, we have attempted to be responsive to helpful critique from reviewers of a previous version submitted to a journal. The underlying DIY approach that drove development of ERNIE was made possible by open source tools. The case studies in the main manuscript are confirmatory of prior work in one case and showcase an approach to historical tracing using a combination of relational and graph data in a second. Code in the Github repo referenced in the preprint should enable replication of the repo and extension. Parsing code for Web of Science and Derwent data is freely available and can also be downloaded from the Github site. We intend to add a module for Scopus data in coming months. Data from Clarivate Analytics requires a license and can, therefore, not be shared by us. All the other data can be made available. Plans for a public query interface, new applications and continued methods development are also being developed.

    1. On 2018-04-20 22:22:43, user Death In Venice wrote:

      Hello to you both,

      Did you ever make this code available? I'm looking to make a similar comparison and the only R package I've found thus far to implement the KS test on survival objects is a deprecated package with a NAMESPACE issue.

      Best,<br /> DiV

    1. On 2018-03-27 17:35:43, user Haley Dylewski wrote:

      Hello! We are graduate students at the University of Tennessee reading BioRxiv submissions and we enjoyed your paper! We have compiled our thoughts into a review and hope you find it helpful!

      The authors hypothesize that initial TLR4 expression and concentration changes dictate how sepsis syndrome is initiated. A model consisting of three ODE’s was constructed to simulate initial TLR4 flux between relevant compartments of the cell. The proposed model consists of three ODEs that describe three regions in phase space, each space representing a unique cellular compartment relevant to TLR4 activation. For the most part the authors do a good job explaining the physiological basis of their model as well as discussing their assumptions and reasonings.<br /> The paper starts off strong, providing detailed rationale for the study and a strong biological background for the process described. The description of the model and the biological processes it represents seem sound however, the paper becomes weaker after the model is presented. The authors do not discuss the significance of the model’s outputs in a meaningful way: What information do the phase portraits offer about the system dynamics? What can be interpreted biologically and how exactly was the model applied to clinical situations? <br /> Overall, the authors seem to be addressing an impactful problem and I encourage them to pursue this work further.

      Major Comments:

      1. Alternative models were not considered in this paper. In the application of the model, the authors ultimately relate mRNA production rate to patient outcomes. Such a relationship seems like it could be represented by a linear regression; I think it is important that the authors compare their model to a simpler one to justify using their model and ultimately make their paper stronger.

      2. Though the model design is based on biology, the application of the output is not adequately addressed. The authors predict patient outcomes using the model but do not specify how they do so (it seems that the signal resolution of patient data was compared to the simulated systems). What is the rationale for choosing phase portraits as the output? Does the model provide insight into the system or is its purpose simply to be a predictor of patient outcomes.

      3. Some of the greek letters are missing from the text. This may be due to the format of bioRxiv but makes some discussion unclear. We would suggest checking the produced pdf when approving the ms. in bioarxiv.

      4. Phi is the crux of this model however the variable is not sound:

      a. There are no units associated with it. Without units it is hard to interpret what the variable means. <br /> b. The use of phi in the model is not adequately justified. Phi represents the rate tlr4 mRNA is produced. This parameter appears to be treated functionally as the TLR4 flux entering the system as a result of LPS stimulation. The authors did not clearly explain simplifications assumed in this relationship. Eukaryotic protein production is complex; is the rate of mRNA production representative of the flux in this system? Or is this variable chosen for convenience? There needs to be some kind of mathematical relationship linking rate of mRNA production to rate of protein production, otherwise justification for omitting such a critical relationship should be stated.<br /> c. Three different response systems are modeled representing varying levels of LPS stimulation. The parameter differentiating these conditions is phi however how the values were chosen is not clear to me. How was the cut off of each response determined? It is said that phi “has been determined experimentally”. There is a reference on the measurement but perhaps, because of importance of this parameter, there should be a short explanation of how the measurement was done.

      1. Similar to 4-a, none of the parameters have unitis. If the model was made dimensionless, this needs to be clearly illustrated. This is critical for physiological interpretation and reproducibility.

      2. The values for the parameters are said to have been “varied until a stable limit cycle was attained”. What does this mean? And does using this method choose parameter values that adequately describe the physical system? It would be good to provide a comparison between the experimentally quantified parameters and the parameters chosen based on the stable limit cycle.

      3. At the bottom of page 5, it is mentioned that the model leaves out additional populations of TLR4 in the cell. The reason for omitting this from the model should be discussed.

      4. Is it possible that the model predictions would become negative? It appears that in eq. 1 yz are subtracted independently of x - is that well justified?

      Minor Comments:<br /> 1. The Supplemental data file does not contain the Tables referenced. It does contain access to the actual model code which is helpful.

    1. On 2018-02-26 15:55:56, user raonyguimaraes wrote:

      Hi there,

      Do you have any plans to make the source code and/or the trained model and/or the predicted scores from this study publicly available so others could also try to use it?

      Kind Regards.

    1. On 2018-02-23 06:40:49, user elise_is wrote:

      For LOD situation, maybe you'd like to check this two papers: 1) Wei, Runmin, et al. "GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies." PLoS computational biology 14.1 (2018): e1005973., 2) Wei, Runmin, et al. "Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data." Scientific reports 8.1 (2018): 663. They evaluated both of the MCAR, MAR, and MNAR situations, and developed an iterative Gibbs sampler based left-censored missing value imputation approach, which is called (GSimp). All the code of this approach is available online. I think it will be a nice try on your data to perfect your research. Hope this helps :)

    1. On 2018-02-20 14:20:03, user Ruben Milla wrote:

      this paper might be useful: Li et al. 2014. New Phytologist 203: 63–69

      also, i missed information on what does the region Chr4@16.92-17.23 code for

    1. On 2018-01-27 01:19:32, user Casey Greene wrote:

      I reviewed this paper at a journal. I thought that the journal in question would make the review public, but perhaps that is only after the paper is accepted. In the interests of improving the discussion of papers before they become published, I'm posting my review here as well.


      Confidential Competing Interests (required):<br /> None

      Reveal reviewer identity to authors (required): Yes

      General assessment and major comments (Required):<br /> The authors describe HAWK, a k-mer based approach to association analysis. The idea is certainly clever, and I can imagine this work as a jumping off point for other approaches to the analysis of genetic variants that differ between groups.

      I have some concerns about how the work is presented. The method discusses k-mer association analysis as a technique for "sequencing data." Within the manuscript, the method is applied to simulated E. coli genomic data and to the 1k genomes dataset.<br /> - If the authors want to suggest that this works for E. coli, or other bacterial, data they should apply the method to real genome sequencing data from these organisms. It seems like plasmids that vary with the test condition could make the approach somewhat computationally expensive (they would need to be built from k-mers). It'd be nice to see A) if this works in practice; and B) how scaling is affected. If this is not intended to be used for real microbial data, then perhaps the authors should note this.<br /> - The only application in the manuscript is to whole genome data. It seems like this approach would be a relatively inefficient way to deal with RNA-Seq data. Should the domain be refined?<br /> - Is the approach expected to work with exome sequencing data? If so, it would be nice to see an example showing that the capture process doesn't introduce any systematic biases that affect the method's false positive rate.

      I downloaded the software and it compiled successfully. It is a bit difficult to use. The documentation is also sparse. It would be helpful to have a wrapper script that would handle the most common workflow as well as documentation with one fully worked example. The version of the source code associated with the published paper should be archived to figshare, zenodo, or a similar service.

      Some assertions are made with regard to computational cost of competing methods in the intro. It would be helpful to me to see some benchmarking of HAWK.

      "We provide scripts to lookup number... as future work." This is fine to leave for the future, but can you at least provide some documentation of these scripts in the repository's README?

      Lines 277-280: is it possible that certain samples have different contamination? I'm not disputing that this is one possible explanation, but it doesn't seem like other possibilities (contamination, etc) have been ruled out to this point.

      Minor Comments:<br /> In "Counting k-mers", what is a sample for "appear once in a sample." Is this once in a condition, or is there a first stage of sample filtering before the k-mers are aggregated?

      The github repo contains DS_Store files. This should be added to .gitignore

      Both the GPL v2 and v3 licenses appear to be included with the CPP source code.

      The source code on the website has a version number, but there are no tags in the github repository. Please tag with the version number.

      In "verification with 1k genomes data": I think line #169 is referring to significant differences between the YRI and TSI samples using the standard calling algorithm. This paragraph could be reworded for clarity.

      typo: "While upto 20%"

    1. On 2018-01-22 03:30:20, user BenjaminSchwessinger wrote:

      Thanks for posting this preprint. The detail of analysis and the availability of all code is great. it is excellent to see more plant pathogenic obligate biotrophic fungi sequenced. My 'feel' is that these genomes may well look pretty different to some of the better studied non-obligate oomycetes and fungi e.g. 'two speed' genome with effectors clustering to TEs. I could conceived that at least a subset of effectors may well be required in obligate biotrophs as they have to infect the host to complete the life-cycle.

      Some thoughts and questions:

      • Would be great to see some read length statistics on your PacBio sequencing to get a better understanding why the genome is still in a good number of contigs.

      l. 146 Instead of beginning and end of contig I would use 5' and 3' prime of the sequence.

      l. 185 ff. I got confused here as the numbers didn't add up for me 6039 single-copy groups give rise to 6,844 one-to-one mappings? I think I get it after reading it several times, yet some rephrasing may well help. Else proteinortho with the synteny flag may have also been an option for doing this analysis.

      l. 223ff: The observation of smaller parts of the genome being reshuffled in DH14 vs. RACE1 is pretty interesting. We saw something similar comparing the two haploid genomes in wheat stripe rust fungus (see Figure 2, https://www.biorxiv.org/con.... Wonder how this all happens. Else http://assemblytics.com/ may also be a useful too in future to compare two genomes with each other in regards to structural variations.

      l. 265ff: Great analysis on paralogous. We still need to do this for our candidate effectors, yet we saw an overall 'clustering' of candidate effectors. I liked the part of looking if SPs are enriched on certain contigs. Does this also hold true if you consider gene content and not only contig length?

      Figure 4A would be easier to interpret if it were normalized to the number of genes analyzed and n given within the figure.

      l. 353 ff: Mirrors what we found in wheat stripe rust and others in P. coronata, where candidate effectors do not reside close to TEs in general and not in gene sparse regions. We also see that candidate effectors such as CEPS in Figure S2 C have no really close neighbours. This is pretty intriguing to me. Any thoughts on this? Have you tested if CSEPs are somewhat linked to BUSCOs following the idea that some effectors are necessary in obligate biotrophs. If that is the case for you guys as well, i would be happy to look into if the BUSCOs or effectors tor which this is true are conserved.

      l. 380 ff: The analysis of a TE burst in Bgh is very interesting indeed. I think it would profit from a bit more detail on what kinds of TEs were found and how much each family covered. Figure 5 also lacks some details about the usage of all these acronyms used in the figure eg. BOTR? Increasing font size and including a key in the legend would be great.

      What I wonder with BGH is where did all the old TEs go? Wouldn't you expect to have some of the older TEs still present around the same age/%id as in the other Blumeria? Within the Blumeria how many TE families were specific to each species? Could it be that your database does not include the most recent TEs from other fungi?

      Supplemental figure[:-3]: Not sure that joyplots are the best representation here. A circos plot maybe a better visualization.

      Great work. Gave me some good pointers for my own work.

    1. On 2018-01-11 20:30:48, user Christopher Ried wrote:

      In your supplemental code, you might want to add a separate file with your session details by running sessionInfo() in R see (https://stackoverflow.com/q... and saving the print out in a separate attached file. That way you can give your readers the ability to more completely replicate your analysis.

    1. On 2017-12-21 22:01:25, user awcm0n wrote:

      Mollie, great work, but I think there's a problem with your code for simulating from a fitted model in Appendix B:

      simdatsums=lapply(simdatlist, function(x){ <br /> + ddply(x, ~spp+mined, summarize, <br /> + absence=mean(count==0), <br /> + mu=mean(count))<br /> + })

      Error in .fun(piece, ...) : argument "by" is missing, with no default

    1. On 2017-11-17 18:56:02, user Robin Rohwer wrote:

      Here are my responses to Pat Schloss's thoughtful review:

      1. You are correct: "blastn is only used to split the dataset and once split, the data are classified using the Wang method." We avoid classifying with BLAST because it doesn't take into account the phylogenetic structure of the reference database, which is why the Wang classifier is preferred for classification. We use BLAST to split the dataset because the Wang classifier's algorithm can cause misclassifications if the database is really small and you try to classify a sequence with no similar references (what we called "forcing" in the manuscript). We agree this distinction should be added more clearly to the Fig. 1 description!

      2. We didn't delve into the creation of custom databases in this paper because we felt it was outside the scope- TaxAss is only a tool to use them, not to make them. To learn more about how the FreshTrain was created, we suggest reading the Newton et al. 2011 citation (ref 12). We also cite papers about creating other ecosystem-specific databases, some of which also have really detailed methods descriptions. See text starting on lines 105 and 295 to find these references.

      3. "I am also not clear why the authors did not want to pool FreshTrain with one of the comprehensive databases. A simple cat command would pool the two files producing a file that could then be used as a single reference." <br /> This was also my first thought when faced with trying to use the FreshTrain for classification! But, it wouldn't work to simply concatenate two databases because some reference sequences would end up duplicated (corresponding to the yellow bars in Figure 2). For example, in Freshwater the major bacteria acI has a detailed phylogeny in the FreshTrain that includes multiple clades and tribes. Reference sequences for acI also exist in Greengenes, however they're called ACK-M1 and don't have detailed genuses or species under them. If the FreshTrain and Greengenes were concatenated, the Wang classifier would find two equally good references with different names. Then (because it is a stochastic algorithm that bootstraps its classification many times) it would place the acI OTU 50% of the time into "acI" and 50% of the time into "ACK-M1." That would result in a bootstrap confidence of ~50% for each, and the OTU would be called "unclassified." We'll add this into the discussion section about Current FreshTrain Usage.

      "Also, a downside of the Greengenes database is that the core reference appears to be moth balled going forward while RDP and SILVA are still actively being developed." <br /> Exactly! This is why TaxAss is great- it gives you the flexibility to pair your existing custom database with the most current version of whichever comprehensive database you prefer. It would take a lot of effort to incorporate the FreshTrain's phylogenetic structure into SILVA, and then that would need to be repeated every time SILVA or the FreshTrain is updated. That type of ARB work is beyond the scope of many 16S analyses, so TaxAss is a way for everyone to use the most up-to-date version. In regard to Greengenes being "moth balled," we agree that it doesn't look like it will be updated going forward. But we used it for this paper because it is still pretty comprehensive and it includes classifications to the species level (SILVA only goes to genus), which allowed for a better comparison since we were placing a lot of emphasis on improved fine-level classifications.

      1. "One motivation that the authors state for the method is the issue of 'forcing'. I would call these 'false positives', but I get their point." <br /> I like this suggestion for vocabulary and it seems like forcing is confusing to a lot of people. I'm also considering just calling it misclassification.

      "The authors raise this issue numerous times. Yet I was unable to find a citation that quantifies forcing and the authors do not appear to measure the amount of forcing in their data. Perhaps this is what they were getting at in Figure 3? If that is the case, then I am a bit troubled because they are accepting the FreshTrain data as the ground truth, when it has not been validated yet."<br /> You're right that this is what we're trying to get at in Figure 3! (Specifically, the red bars of figure 3 over named organisms.) To clarify, the ground truth in figure 3 is TaxAss (FreshTrain + Greengenes), and the forcing is observed through a FreshTrain-only classification. I don't know a way to quantify the exact amount of forcing/false positives because it is impossible to know the "truth" and it would be different with different datasets and different databases. However, we can see with some sequences that it is happening to an extent that it would negatively impact our interpretations, so I would argue that is evidence enough that it's a problem.

      How we know the forcing/false positive problem is real: We know that the FreshTrain database is limited in scope and does not contain all organisms in a dataset. And some organisms it doesn't contain have good references in Greengenes so we know what they are. So when we classify organisms we know are not contained in the FreshTrain using the FreshTrain, and we see some of them end up with a high-confidence FreshTrain classification, we know it's a problem! We can't say for sure exactly *how big* a problem it is, but we look at its practical impact on the datasets we tested. In Fig. 3 and Supp Fig. 3 we define organisms not included in the FreshTrain as those below the BLAST percent identity cutoff. When those are classified in the FreshTrain as part of a FreshTrain-only classification we see that most of these end up "unclassified", but that some end up with classifications that we know to be incorrect "forcing." The misclassified sequences seem like a small problem at the phylum level, but at finer taxonomic levels we see that they will negatively impact our analysis. Just based on the rank abundance curves, you can see that incorrect sequences are being lumped in with the major lineages adding error to our analyses. We are not the first to notice this problem, Woodhouse et al. (Ref. #23) also observed cyanos being forced/misclassified when trying to use the FreshTrain pre-TaxAss.

      "I could also imagine that even with FreshTrain, there might be forcing if a taxonomic name is set for the full length sequence, but two variable region sequences are identical even though their parent sequences have different taxonomies." <br /> Actually, that scenario wouldn't cause forcing it would cause false negatives/underclassification. This scenario would effectively be like having duplicate reference sequences as in your comment #3, where Wang does 50/50 and ends unclassified.

      "More importantly, the source code indicates that the authors are using any confidence score with out applying a filter. The suggested confidence score is 80%, not 0%. ... In offline conversations with the authors, they reassured me that they are applying an 80% threshold in separate scripts. It would probably be worth adding that they are using 80% as a threshold in the Methods section." <br /> We mention the bootstrap cutoff in passing in line 585, but I agree that perhaps it would be clearer if we add a bit more detail to the "How to use TaxAss" section. We left a lot of these details out because they're included in the step-by-step directions on the github website ( https://htmlpreview.github.... ) but it sounds like we should add another paragraph to the methods summarizing these directions.

      1. "Related to this point, at L122 the authors state that 'In a large database an OTU dissimilar to any reference sequences will not be classified repeatably as any one taxon, resulting in a low bootstrap confidence.' This is correct, but is a bit misleading. I would suggest saying '...repeatedly as any one genus, resulting in a low bootstrap confidence and reclassification at a higher taxonomic level where there is sufficient bootstrap confidence'." <br /> I'm not sure I understand your point, but I think it's that most bacteria are classified to the phylum level at least by Greengenes so it's misleading to say they are unclassified. However, many phyla should be unclassified at the phylum level in the FreshTrain database because their references don't exist at all, so I was trying to say hypothetically if that happened in Greengenes (the phylum ref. of an OTU not included b/c it's super weird and unknown) the Wang classifier would still work and call it unclassified, but that forcing is an issue when the database is smaller. I'm not sure how to make that more clear- suggestions welcome!

      "I am concerned that the results and the discussion of forcing are based on not using a confidence threshold rather than the default 80% threshold."<br /> Pat's so worried because we applied the cutoff in an R script, not it the mothur call. But don't worry! The 80% bootstrap confidence threshold was applied prior to all the analyses, just not in the mothur command. This way we were able to change it easily without re-running the classify.seqs command, which was helpful to compare the bootstrap cutoff's impact during development. Details:

      • The R script that applies the bootstrap cutoff is find_classification_disagreements.R . The easiest way to see when it's called is using the step-by-step directions: <br /> https://htmlpreview.github.... .

      • It's first used in step 13 to define what counts as a "conflict" among coarse-level assignments (i.e. things where both gg and fw were above the bootstrap cutoff and were given different names. Since the FreshTrain doesn't differ from Greengenes above family-level, we consider these "forcing" and trust GG more.).

      • In step 13, an intermediate file of bootstrap values is also exported, which is used to apply the bootstrap cutoff in step 14 plot_classification_disagreements.R when it compares the percent reads classified. Step 14 is where Fig 4 is created.

      • The bootstrap cutoff is again applied in step 15 when the final taxonomy file is created (also within find_classification_disagreements.R)

      • Step 15.5.a is where Fig 2 is created, and it's script plot_classification_improvement.R applies the bootstrap cutoff by incorporating an intermediate file of final.taxonomy.pvalues that is created in step 15.

      • Step 15.5.b is where Fig 3 is created, and it also uses the previously used scripts find_classification_disagreements.R and plot_classification_disagreements.R to apply the bootstrap cutoff.

      • The function that implements the cutoff within find_classification_disagreements.R is called do.bootstrap.cutoff(), and is defined on line 301 of the script.

      • "To measure forcing, I would like to see the authors run the Greengenes and FreshTrain databases back through the classifier using a leave-one-out testing procedure and quantify how many times the incorrect classification is given, when using the 80% (or even the 0%) threshold. Again, I suspect the results would indicate that the problem isn't one of forcing, but of "holding back". ..."<br /> We do see "holding back" as a major problem with using the FreshTrain alone- most sequences not in the FreshTrain database ended up unclassified. However, if that were the only problem you could just do a sequential classification instead of TaxAss- i.e. classify everything with the FreshTrain, then take all the unclassifieds and classify them with Greengenes. That's why we stress forcing so much- it's why TaxAss is necessary.

      "It would be a really helpful contribution to show the percentage of forcing (false positives) and holding back (false negatives?) in a leave-one-out scheme and on a real dataset when classifying with (1) each of the comprehensive databases, (2) using TaxAss with the comprehensive databases and FreshTrain, (3) merging the comprehensive databases with FreshTrain and running them through the Wang classifier."<br /> Leave-one-out schemes are often used to test classification algorithms, but I'm not convinced they would actually add that much to our understanding here. That's because the quantitative amount of forcing would be different with different datasets and databases, so very careful quantification of one example wouldn't apply very well to other work. I think knowing qualitatively it's a problem at a level impacting analysis is reason enough, which we could see clearly in Fig. 3.

      1. "I am not sure what the authors mean by "maintaining richness" as they use it in the manuscript. Could the problem they are trying to address be described better? Also, I would ask whether they know what the *true* richness is and if not, why they think that one value of richness is better than another. Perhaps this corresponds to what I might call "underclassifciation" or "false negatives".<br /> Yes, when we talk about maintaining richness we mean avoiding the underclassification (false negatives) of sequences not in the FreshTrain. We don't know the total "true" richness because that's impossible to know without a perfect *true* taxonomy database. But we know that when organisms with known references in Greengenes end up unclassified in a FreshTrain-only classification that that is an incorrect loss of richness. TaxAss avoids this by classifying those sequences in Greengenes and keeping their known classifications.

      It seems like Figure 3 was not explained well because the two confusing ideas here are: <br /> 1. forcing/misclassifications/false positives (red bars in Fig 3) and <br /> 2. lost richness/underclassification/false negatives (blue bars and red "unclassified" bars in Fig 3). <br /> I welcome any more ideas on how to explain this better!

      L25 - "why not include the RDP reference database in this list?" <br /> I did include RDP in the intro on line 103, but I didn't include in the abstract b/c I was trying to be more concise and I didn't realize it was still actually being used. I thought it was much smaller than both SILVA and Greengenes now?

      L49 - "Course" should be "Coarse"<br /> ahhh I'm so embarrassed!!

      Thanks again to Pat Schloss for sharing his thoughts and starting this discussion! I encourage other readers and users of TaxAss to weigh in with their thoughts, critiques, and questions too.

    2. On 2017-11-09 12:09:26, user Pat Schloss wrote:

      The preprint by Robin Rowher and colleagues seeks to develop a workflow that complements methods for classifying 16S rRNA gene sequences with greater precision than is found in the Wang naive Bayesian classifier. This is an issue that many people have raised with me. A lack of classification for a sequence can be blamed on inadequacies of the taxa represented in the database, lack of taxonomic data (e.g. at the species level) within the database, and the selection of the region within the 16S rRNA gene to classify. This paper seems primarily concerned with the first problem by supplementing ecosystem-specific sequences and touches on the second problem by adding finer taxonomic information for the ecosystem-specific sequences. I felt like the authors were a bit conflicted over what they wanted this manuscript to be. Is it a description/announcement of a new method, TaxAss? Is it a validation study? Is it a benchmarking study? Overall, it is a description of a new method that is being used by the authors and others. However, I feel like it needs some help to improve the description as there are points in the manuscript that are not clear. Furthermore, I felt that the validation and benchmarking could use some help to quantify the need for the method and to demonstrate that the method overcomes that problem.

      General comments...

      1. As described in Figure 1, sequences that fall below an empirically determined threshold when compared to an ecosystem-specific database are classified using a comprehensive database and those that are above the threshold are classified against the ecosystem-specific database. Perhaps it's because I'm familiar with people using blastn to classify sequences, as I read the manuscript, it was not clear whether the sequences in the two arms were then classified using the Wang method or blastn. Reading through the source code, it looks like blastn is only used to split the dataset and once split, the data are classified using the Wang method. Perhaps this could be clarified in the text.

      2. It is not clear how the FreshTrain database was developed or how it is curated to add finer taxonomic names to the sequences. The authors have done this for the readers who are interested in fresh water bacteria, but what steps should someone interested in gut microbiota take to recreate the database to classify their data? More importantly, how did the authors decide on their taxonomic levels of lineage, clade, and tribe? Why not follow the phylogenetic approach used by the greengenes developers for defining family, genus, and species for environmental sequences?

      3. I am also not clear why the authors did not want to pool FreshTrain with one of the comprehensive databases. A simple cat command would pool the two files producing a file that could then be used as a single reference. The downside of this would be that they would need to add the same level of taxonomic detail that is in the FreshTrain database to the greengenes database. Also, a downside of the greengenes database is that the core reference appears to be moth balled going forward while RDP and SILVA are still actively being developed.

      4. One motivation that the authors state for the method is the issue of "forcing". I would call these "false positives", but I get their point. The authors raise this issue numerous times. Yet I was unable to find a citation that quantifies forcing and the authors do not appear to measure the amount of forcing in their data. Perhaps this is what they were getting at in Figure 3? If that is the case, then I am a bit troubled because they are accepting the FreshTrain data as the ground truth, when it has not been validated yet. I could also imagine that even with FreshTrain, there might be forcing if a taxonomic name is set for the full length sequence, but two variable region sequences are identical even though their parent sequences have different taxonomies. More importantly, the source code indicates that the authors are using any confidence score with out applying a filter. The suggested confidence score is 80%, not 0%. I don't think that the problem with classifications from the Wang method is forcing, rather, it's that the classifications don't go deep enough. Something may classify as a Bacillus with 20% confidence and so researchers should work their way up the taxonomy until the classification is above 80%, which might be Firmicutes. In offline conversations with the authors, they reassured me that they are applying an 80% threshold in separate scripts. It would probably be worth adding that they are using 80% as a threshold in the Methods seciton.

      5. Related to this point, at L122 the authors state that "In a large database an OTU dissimilar to any reference sequences will not be classified repeatably as any one taxon, resulting in a low bootstrap confidence." This is correct, but is a bit misleading. I would suggest saying "...repeatedly as any one genus, resulting in a low bootstrap confidence and reclassification at a higher taxonomic level where there is sufficient bootstrap confidence". I am concerned that the results and the discussion of forcing are based on not using a confidence threshold rather than the default 80% threshold.

      6. To measure forcing, I would like to see the authors run the greengenes and FreshTrain databases back through the classifier using a leave-one-out testing procedure and quantify how many times the incorrect classification is given, when using the 80% (or even the 0%) threshold. Again, I suspect the results would indicate that the problem isn't one of forcing, but of "holding back". To be clear, this isn't necessarily a problem with the Wang method, but the databases. Addressing this point is where I think the authors could really do the field a service. It would be a really helpful contribution to show the percentage of forcing (false positives) and holding back (false negatives?) in a leave-one-out scheme and on a real dataset when classifying with (1) each of the comprehensive databases, (2) using TaxAss with the comprehensive databases and FreshTrain, (3) merging the comprehensive databases with FreshTrain and running them through the Wang classifier.

      7. I am not sure what the authors mean by "maintaining richness" as they use it in the manuscript. Could the problem they are trying to address be described better? Also, I would ask whether they know what the *true* richness is and if not, why they think that one value of richness is better than another. Perhaps this corresponds to what I might call "underclassifciation" or "false negatives".

      L25 - why not include the RDP reference database in this list?

      L49 - "Course" should be "Coarse"

    1. On 2017-11-15 06:51:58, user Christoph Nowak wrote:

      Hi - great approach! I may have missed it - did you provide some (sample) code alongside the paper on github or similar? Thanks!<br /> Chris

    1. On 2017-11-13 08:54:50, user James Lloyd wrote:

      Thank you for posting this interesting article about short ORFs in moss onto a pre-print server. I think that it is great that you have been able to identify sORFs throughout the transcriptome, classify them as uORFs, dORF etc and examine conservation and even find evidence for protein products for some, and function for one.

      Comments and questions:

      In Fig 1B, it would be nice to see another column describing how many sORFs were found in each classification from the computational pipeline, without evidence of expression or translation. Also, it would be best to right-align the numbers in the columns so that it is more obvious the magnitude of the differences at a glance.

      I like how Figures are integrated with the main text in this pre-print, it is rarely done, but I think that it would be better to change the legend text so it does not get confused with the main text so easily.

      dORFs are not explicitly described in the main text of the manuscript. There are times when genic-sORF is used, and others when CDS-sORFs is used - I think that they refer to the same group, if not, make that clearer. If they are the same group, they should be consistent. It is not quite clear what “intergenic-sORFs” are when intergenic-sORFs are also a group.

      CDS-sORFs are an interesting group that I had not considered. I take it that they use internal start codons in a different frame from the main ORF. Is there much evidence from other systems that they are real (eg ribo-seq data focusing on start codons to suggest those types of start codons are used)? Are the peptides from these vastly different from the protein of the main ORF? This is unclear in the manuscript and I think that a clearer explanation would help the reader appreciate the importance and validity of this.

      Line 143 has "(REF)", it was not clear initially that this was a abbreviation rather than shorthand for citation. Perhaps it is just me but changing this in some way might aid the reader, especially as it is most frequently used much later in the manuscript, after initial definition. I'm not actually sure randomly selected exon fragments needs to be shorten; not doing so would aid readability.

      Lines 175-176 "significantly fewer uORFs and dORFs in the two closest species", do you think that this could relate to the poorer genome/transcriptome assembly of these organisms relative to the others in the study? Perhaps ends of transcripts are less reliability reconstructed?

      In Fig 5, it might be worth adding "WT", "OX" and "KO" to the respective panels to save the reading having to look between the legend and the images.

      Line 681 and other places: "custom-made python scripts (available upon request)", it would be ideal if these scripts could be placed on GitHub or deposited on something like Zenodo (or maybe FigShare)? With Zenodo you can get a DOI to reference. Even if the code has not been tidied up for maximum re-use, I still think that this would be the best thing to do.

      I could not see any list of global sORF classification (uORF, dORF etc), position (genomic coordinates and linked gene when possible) and whether it is conserved in other species or not. I think that such a list would be extremely useful. I know that I would use it to aid some of my future analysis in moss data.

    1. On 2017-10-25 16:29:10, user Oscar Puig wrote:

      Naive question: The code starts with an aligned BAM file, but standard BAM files are aligned to a reference without long repeats, so repeat reads are not aligned in the region. How do you overcome this limitation? Do you have to use a reference with long repeats in it to produce the BAM files?

    1. On 2017-10-11 15:20:36, user Jeremy Leipzig wrote:

      good paper but I wish it touched a bit more on reproducibility in the broad sense of being able to link publication with code and data in a discoverable and auditable manner.

    1. On 2017-10-06 03:45:22, user Steven Hart wrote:

      Really good walk through, including the history of CNNs. I'll make sure to require all my students to read this before working on one of these projects!

      I was however left with a couple of questions/ suggestions that I thought might add some value if you're interested in feedback.

      1. Methods state that whole slide images were only 2092x975 pixels. That seems way too small. Generally they are GB in size. Similarly, the algorithm was run on 20x20 pixels for basic-CNN and defaults for inception. Did you use 'tf.resize' for this?

      2. What were the total number of image patches evaluated for each study? That seems like a reasonable question since you are actually assessing patches instead of slides, right?

      3. When you say 'detect various cancer types by about 100% accuracy' , I think this is a little misleading. You are actually differentiating between lung, breast, and bladder tissue rather than cancer. That is, unless you are differentiating normal from cancer - but I don't see those results here.

      4. You might want to expand a bit more on what it is you are classifying with the IHC. Are you predicting what marker it is or the scoring of that marker? Some of the text makes it read as if the former, while Figure 2 suggests it's the latter.

      Also, work from the Google team showed improvement of patch classification by rotating the image multiple times and then averaging the logits to get a more robust measurement. I'd be happy to contribute the TF-Slim code I have for this (it fits in your train_image_classifier.py script).

    1. On 2017-08-08 09:09:42, user Martin Modrák wrote:

      So I've been testing the idea of using BUDS as a preprocessing step for further analysis of scRNA-seq data. Using the R package was straightforward, good work! While trying to understand the results, I noticed that there should AFAIK be a strong symmetry in the posterior distribution of tau - whenever tau[i] = s[i] for all i, then setting tau[i] = 1 - s[i] and swapping the shape parameters of the beta distribution should give exactly the same posterior probability. In other words, inverting the trajectory in time should give the same results. This would in turn mean that the posterior is bimodal / non-identifiable and should break the solver. However, this is not what happens: multiple runs of variational Bayes (ADVI) converge on almost exactly the same paremeters, and full NUTS (only one run tried as it was pretty long) gives very similar results to ADVI and no divergences.

      Did you somehow explicitly break this symmetry in your model? From my reading of the code I could not find a place where this happens.

      I have also not managed to get meaningful results with BUDS on the dataset I am working with, but it is well possible (maybe even likely) that the data actually does not have a clear time continuity as expected by BUDS. I could send you the dataset if you happen to be interested (it is the ILC3 cells identified in https://www.nature.com/ni/j... - GSE70580).

      I also want to reiterate that it's great you've shared your work as a preprint and code on GitHub, otherwise I won't be able to make any of those tests. <br /> Are you comfortable with discussing the paper here or would you prefer other channel? (e-mail / GitHub issue / ....)

    1. On 2017-07-14 13:35:54, user Rodolfo Jaffé wrote:

      An alternative approach - Maximum-likelihood Population Effects (MLPE)

      Maximum-likelihood population effects (MLPE) treats the residual for each pairwise distance as the sum of two random population-level effects and an observation-level error (Clarke et al. 2002). Because the MLPE correlation structure allows modelling the non-independence of pairwise distances within a likelihood framework, compatible with model selection, it is particularly appealing for landscape genetic studies (Jaffé et al. 2016). Code implementing the MLPE correlation structure within the R package NLME is provided at:<br /> https://github.com/nspope/c...

      Clarke RT, Rothery P, Raybould AF (2002) Confidence limits for regression relationships between distance matrices: estimating gene flow with distance. Journal of agricultural, biological, and environmental statistics, 7, 361–372.<br /> Jaffé R, Pope N, Acosta AL et al. (2016) Beekeeping practices and geographic distance, not land use, drive gene flow across tropical bees. Molecular Ecology, 25, 5345–5358.