42 Matching Annotations
  1. Jun 2024
    1. Asphenotypic changes during keratinocyte differentiation span across both space and time, this applicationperfectly showcases the power of ESPRESSO spatiotemporal omics in identifying not only the presenceof distinct phenotypes, but also providing insights about their spatiotemporal evolution.

      Again, I think a baseline here would make this claim more convincing. In other words, what aspects of the differentiation dynamics described here could only be captured by ESPRESSO?

    2. As shown in Figures 1c and 1d, GMM clustering easily identified the cell type-specific phenotypes andallowed the quantification of properties of interest in their organelle networ

      It would be helpful to compare this result to some baseline obtained from an established method like cell painting. In other words, can existing techniques also readily distinguish these cell types?

    3. to increase the acquisition speed 16-fold

      it would be helpful to also provide some absolute measures of throughput here, such as how many FOVs of a given size and resolution can be imaged per unit time.

    4. organelle properties are normalized, selected and reduced in dimensionality byPacMAP35, generating low-dimensional embeddings that encode the high-dimensional organelleproperties of each cell. A Gaussian Mixture Model (GMM36) clustering algorithm is then applied

      It sounds like the clustering was done after the embedding step; that is, using the low-dimensional embeddings from PacMAP, rather than the original feature matrix. If so, I'm worried that this will result in inaccurate clusters, as PacMAP (like all such methods) does not perfectly preserve the relationships between the original high-dimensional feature vectors.

  2. May 2024
    1. This suggests that genes not annotated by eggNOG-mapper are probablyproteins that either catalyze some protein, RNA, or DNA chemical modification, or bind to othermolecules, form macromolecular complexes, and are involved in the regulation of essentialprocesses for animals

      This is a bit confusing; it's so vague and general that it sounds like it could describe almost any protein.

    2. We therefore considered this evidence assupportive for not filtering

      I'm not sure that two examples can constitute evidence for or against filtering. Is it possible to use a ground-truth dataset to make this kind of filtering/no-filtering decision with more confidence?

    3. We show thatprotein language model-based annotations outperformed deep learning-based ones

      This is a bit confusing, because protein language models are a kind of deep learning model. It would help to clarify what "deep-learning-based models" refers to in this context.

    4. with a reliability index of 1

      What does a reliability index value of "1" mean?

  3. Mar 2024
    1. The search starts with acytochrome from corn (Zea Mays), and within the first 50 hits,we find similar structures originating from various animals(fish, eagle, mouse, cat, horse, etc.)

      The phrase "within the first 50 hits" feels tantalizing. What else appeared among the top hits? Were there hits that were surprising or potentially false positives? And were there proteins that should have appeared among the top hits, but didn't?

    2. Here, AlphaFind shows us (Figure 2)that highly similar hemoglobin structures can also be found inother species.

      Again, it would be really great to quantify what "highly similar" means here.

    3. in an average of 7 seconds withnegligible back-end load

      It would be helpful to mention details about the hardware here, as the time cost is hard to interpret without that information.

    4. Therefore, high occurrence of unstructuredregions in the input structure can bias the search. Thisphenomenon is more prevalent in coiled-coil structures but canbe also observed in some small structures

      Again, it would be great to quantify this and/or to discuss some examples of proteins for which this is a real problem.

    5. We tested AlphaFind on a diverse set of proteins varying insize, complexity, and quality. AlphaFind provided biologicallyrelevant results even for small, large and lower qualitystructures. When AlphaFind did not offer structures withhigh TM-scores, the results remained biologically relevant.

      I think these claims would be more convincing if they could be quantified and if the performance of AlphaFind could be compared to other existing tools, if possible.

    6. he latter two methodsin conjunction with (10) establish the basis of the indexingsolution presented in here

      What is the relationship between this approach and approaches to indexing or similarity-based lookup used by common vector databases?

    7. In the offline phase, we first extract semantic information fromraw cif files into vector embeddings,

      It would be helpful to explain in more detail how this is done, since it seems like a crucial step.

  4. Feb 2024
    1. Every376chromosome group is combined into a single sequence, with chromosome order randomly deter-377mined.

      It's surprising to me that chromosomes are randomly ordered; this feels a bit like the equivalent of randomly shuffling the clauses of a sentence. It would be helpful to explain this choice or discuss reasons why it might or might not be a concern.

    2. Start tokens are unique to each chromosome and species

      This feels confusing: if start tokens are unique to species, how is UCE able to generate embeddings for datasets from species it was not trained on?

    3. However, beyond that,238the effect levels off (Supplementary Fig. 6). This is expected due to the curse of dimensionality239in high-dimensional spaces and the variability in the level of ontological refinement in different240branches of the ontology

      This feels awfully hand-wavy. I can understand that a leveling off is expected at some distance, by why at 5 hops in particular?

    4. or all three species we observed204very high agreement between independent annotations of the novel species’ data and the nearest205cell type centroids in the IMA

      It would be helpful to mention here what these three species were and how distantly related they are to the eight species on which UCE was trained.

    5. We train185a simple logistic classifier on the UCE embeddings of the Immune Cell Atlas [38], and then apply186the classifier to B cell embeddings from Tabula Sapiens v2. This classifier accurately classifies the187Tabula Sapiens v2 cells as memory and naive B cell

      This result feels hard to interpret without a comparison to other approaches or models. In other words, are embeddings from UCE uniquely able to capture the information required for this classification task?

    6. UCE embeddings174distinctly separate cell types more effectively than other methods tested in zero-shot

      This feels a bit subjective; I think this claim would be more convincing if it were grounded in a quantitative measure of clustering accuracy.

    7. We compared several methods and found that UCE substantially out-167performs the next best method Geneformer by 9.0% on overall score, 10.6% on biological conser-168vation score, and 7.4% on batch correction scor

      If possible, it would be helpful to contextualize these relative increases in performance, particular given that none of the models listed in Supp Table 1 appear to significantly outperform using the log-normalized raw data. (the "overall score" is 0.74 for UCE and 0.72 for "log-normalized expression"). Without more context, it's hard to know what this means, whether it should be surprising, whether it reflects limitations of the metrics or of the models, etc.

      Also, I think it would be more transparent to mention here that there are two metrics for which UCE does not outperform other models (the ARI score and the "ASW (batch) score").

    8. Genes belonging to the same chromosome are grouped111together by placing them in between special tokens and are then sorted by genomic location

      It would be helpful to understand the context and motivation for this design decision. In other words, what aspects of UCE's performance depend (or are suspected to depend) on including information about genomic position?

    9. This allows UCE to meaningfully99represent any gene, from any species, regardless of whether the species had appeared in the training100data

      It would be good to clarify here if "training data" refers to the data used to train the protein language model or UCE itself.

    1. When calculating doubling times based on mitotic events in the remaining cells that were470not undergoing apoptosis (Figure 6D), the doubling times are similar to those for unexposed cells

      Again, it's great to see something like this quantified so carefully!

    2. Higher intensities of excitation light exposure led to significant cell death that was apparent by manual447inspection of images, and by the reduced relative cell numbers as shown by the green lines

      It seems surprising that there is such a big difference from 1x to 1.4x. Is this by design? (was the 1x intensity chosen from prior experience or experiments to be as high as possible without inhibiting cell division?)

    3. s shown in Figure 6A,438exposure of cells to the minimal intensity of fluorescence excitation light (56 mJ/cm2 referred to as 1x)

      It's super helpful that an absolute measure of intensity is provided here, but it would be great to also include the wavelength (or range of wavelengths) of the excitation light.

    4. Individual cells in the center of the colony tend to move less than cells near the

      Is it possible to correct this for the fact that, as the colony itself expands, cells near the edge necessarily must move more than cells in the center (which will not move at all, if the colony as a whole is stationary)?

    5. Average mitotic rates do not appear to depend on431distance from the colony edge (Figure 5D) and do not correlate with the increased cell motion

      It's great to see a subtle detail like this quantified so carefully! Is this consistent with prior work (if there is any)?

    6. The233manual data was paired to the 3D U-Net inferenced results using a linear sum assignment routine with234the cost function being proportional to the distance between mitosis events in space with an empirically235determined spatial cutoff of 15 pixels and a time cutoff of 6 frames.

      This is a bit hard to understand. How is distance in time measured? (i.e. the difference between the time of mitosis onset in the manual annotations and the segmentation results)

  5. Jan 2024
    1. are 1 to 2 to 20 to 20.

      how were these weights chosen? And is it correct to think of these weights as a kind of correction for the class imbalance between non-mitotic and mitotic nuclei?

    2. in each of the 5202frames before division, and as class 3 (one or two daughter cells) in each of the 3 frames after division.

      how were these numbers of frames chosen?

    3. The binary masks are created by151inferencing with 3 instances of the same model and thresholded by 2 (as explained in more detail in152Supplemental Figure 1B and 1C.

      This is a little confusing, especially the "thresholded by 2" part, and I didn't find the caption in Supp Fig 1 to be that much clearer. It would help to explain the origin of the variability in the predictions (in other words, what is an "instance of the same model"?)

    4. We trained a 2D U-Net to segment single-cell nuclei from phase contrast images starting with a pre-136trained U-Net (14) as our initial network

      It would be great to mention what kind of images the pre-trained model trained on. Do you have a sense for how important it is that pre-training be done on similar images? (and what kinds of similarity are most important: cell type, imaging modality, magnification, etc)

  6. Dec 2023
    1. Wereport the number of gene KOs with an AU-ROC > 0.55

      Why 0.55 and not 0.5?

    2. We trained a ViT-small model with patch size = 8, number of global crops = 2, number of local crops = 8on 4 nodes x 8 NVIDIA-V100 GPUs per node (32 GPUs) for 100 epochs

      would it be possible (and meaningful) to mention how many GPU hours this required? Also, some more details would be helpful for non-ML experts; e.g., why the choice of 100 epochs, was a stopping criterion used, which epoch was used for the final analysis/results, etc.

    3. we re-parameterized the first layer ofthe model as:

      This equation is a bit opaque; it would be helpful to explain what the superscripts and subscripts of theta mean.

    4. (both ~1-1.5million cell tile images)

      Does the 1-1.5m figure mean single-cell images? or FOVs? It would also be super helpful to comment on how this dataset size was chosen. Was it the minimum amount of data required for this level of performance? More generally, did you do any experiments varying the quantity or diversity of the training data?

    5. The superior performance of CP-DINO 1640 is unlikely a result oftrivial memorization, as the 1640-genes druggable genome library and 300-genes MoA library sharesimilar numbers of overlapping genes with the 124 PoC library (30 and 26 genes respectively).

      I think to make this claim more convincing, it would be important to show how many genes in the 1640 library are very similar to (rather than merely identical to) genes in the 124 PoC library ("very similar" is obviously subjective but I'm thinking of homologs/paralogs or genes that are components of the same complex or pathway)

    6. nti-phospho-S6 (pS6) antibodywith AlexaFluor 750-conjugated secondary antibody was used in the 6th channel as an establishedbiomarker

      it would be helpful to mention here what cellular structures of features the pS6 antibody labels, and also (for the non-biologists among us) what mTORC1 is

    7. Nevertheless, CP-DINO 300 trained on bioimaging data yielded a moreinformative embedding that has higher median prediction accuracy than the other two models (Fig.S4a-b), and correctly classified more perturbations with better accuracy (Fig. 4c). CP-DINO 300 alsorecovered more known biological relationships from StringDB as measured by cosine similarity of theaggregate gene KO embeddings (Methods) than the other two models (Fig. 4d)

      It's awesome to see such an explicit and direct comparison of classic feature engineering with modern unsupervised ML models!

      If possible it would be great to quantify how much better the DINO-based approach is; Figures 4a-d are a bit hard to understand at first and obscure the relative differences; Fig 4d in particular doesn't give the impression that DINO is that much better than the CellStats approach (even though the 0.12 of DINO vs the 0.09 of CellStats is actually a 30% improvement!). Also, some measure of statistical significance would be helpful; in particular, how likely is it that the 0.09 vs 0.12 in Fig 4d is reproducible?

    8. phenotypic clustering of genes by their annotated mechanism of action,

      It feels like there's a typo here somewhere, since genes don't really have a "mechanism of action" and the screen here does not involve compounds but rather gene KOs. Is the idea to use the phenotype of the KOs to cluster genes by the MoA of the compounds that target them? In any case, the reference to MoAs here is doubly confusing because the clustering shown in Fig 4E appears to capture cellular localization (and also pathway membership?), but I couldn't see any discussion of the clustering relative to the MoAs of the compounds used to select the 300 genes