On 2021-02-20 19:31:38, user Ekaterina Shelest wrote:
Further major concerns.
The FunOrder is positioned as a tool for “automated identification of essential genes in a BGC”; (for people who deal with BGCs, this means all cluster genes, because usually clusters are compact and spare genes are rare). But the input is already a set of BGC genes, so, first of all, the clusters are not really identified. We can only speak about some refined annotation. Given that the emphasis is made on biosynthetic genes and not all BGC genes, it is only partly refined. This makes all the statements about the importance of better cluster annotation, provided in the introduction, obsolete. Secondly, where the input BGC genes come from? In case of a new genome, will this be a set of genes in some vicinity of the PKSs and NRPSs (if yes – in which?)? Or a result of preliminary BGC annotation with antiSMASH and/or CASSIS? This should be specified. For known genomes and BGCs, again, what is the source of the BGC information? MIBiG, antiSMASH, other databases, literature? Where the examples used in this study were taken? Table 2 provides MIBiG IDs but not for all clusters; where the others come from?
MATERIAL AND METHODS <br />
FunOrder - Workflow
-
Practically the only part of the tool that deals with evolutionary questions is treeKO. This is fine. But it is not clear to me, if the “speciation history” is shown by the authors of treeKO as less significant in detection of co-evolution, why do you consider it at all? What’s the point of a combined measure that includes something that is less trustable and informative (“speciation history”, in this case)? The examples are not convincing; if you want to use a measure, you should show it’s useful.
-
I did not understand what was the point of making a curated proteome database. In which sense is it curated? Did you filter something out? If yes, what, on which principles? Is it just a collection of 134 proteomes from JGI and NCBI? Could you please explain the principle on which they were selected? One can blast against all ascomycetes in JGI and get many more hits for the query genes. Why limiting yourselves to just 134? Many of which are of the same genera? If the reason is just to rename the sequences assigning a species identifier, this can be done with any genome/proteome with a simple script, no need to keep the proteomes in a special database.
Performance evaluation.
Hmm… I was puzzled by the effort of manual comparison of 102 control BGCs, each with at least 3 genes. Did I understand it correctly, was it literally manual? Why did you do that? (Was it a practical assignment to a class of students?) I had a feeling that this manual assessment was then used as a gold standard to set up a threshold for the tool. But why? Why not simply select parameters of treeKO, which would allow to re-identify the true positive BGC genes? Eventually, this is what was done, setting up the treeKO parameters;<br />
I don’t understand the sense of the manual evaluation step.
Measures of the performance.
Here we come to an interesting part. <br />
The worries start with this: “we calculated three measures (two measures for the positive control BGCs and one for the negative control BGCs)”. In general, positive and negative controls are treated identically. Otherwise, they are not controls. Or did you mean something different?
Speaking about the proposed measures themselves, they are confusing. To start with, TP, TN, FP, FN are already defined with clear definitions and there is no need to re-define them. What you measure in your experiment and put in a confusion matrix ARE already TP, FP, and so on. A phrase like “obtained values for FCGM and ERM were classified as true positives (TP) or false negatives (FN), and the values for NCV were classified as true negative (TN) or false positives (FP).” is bewildering. You cannot classify ERM or ECGM or anything based on them into TP, FN, etc., because you use the real (measured) TP, FN, FP to calculate ERM, ECGM, and NCV! It seems that you are going in circles.
Probably you haven’t noticed that your notations “a”, “b”, “c”, correspond to FN, FP, P. The “number of genes necessary for the biosynthesis of a SM, that did not cluster with the other necessary genes in the FunOrder analysis” to me translates into “genes that we expected to be there but haven’t found”, which is a typical FN. So, your “a” from equation 1 is the FN. Moreover, your FCGM is not a new measure but just the sensitivity, or true positive rate (TPR), or recall, this is evident if you use standard notations:
a=FN; c=P; c-a=P-FN=TP; => (c-a)/c=TP/P=TPR.
What’s the point of inventing new notations?<br />
ERM is nothing else than accuracy: <br />
By definition ACC=(TP+TN)/(P+N)<br />
ERM=1-(a+b)/d; A=FN; b=FP (if there were no other genes that should not belong to the cluster); d=P+N; =><br />
ERM=1-(FN+FP)/(P+N)=(P+N-FN-FP)/(P+N)=(TP+TN)/(P+N)=ACC
I must also point out that the way how the equations are written is… a bit strange. It’s some brackets obsession there. There is no need for brackets in expression like 1-a/c, the division goes before subtraction anyway. Same for a/d+b/d; moreover, you are allowed to sum up the fractions. The scary expression for NCV looks actually like this:<br />
1-g/2d(d-1)
No need for three classes of brackets, especially between the factors of the multiplication.
Regarding the NCV, I did not fully understand what is meant by g. It is defined as a “number of … distances in all matrices” but this does not make sense. Is it the number of genes of the considered cluster on strict and combined distances at selected thresholds, in other words, genes that fulfil the condition to be considered as clustered? If yes, then this is just TP. If no, what is it, then? It’s also not clear, why 2d(d-1)? In general, could you please explain how this NCV measure was defined, derived and why?
Results and discussion: <br />
“In our experience, evaluating only the numerical values is not enough for a thorough analysis of a BGC and it is necessary to consider all provided visualisations for a thorough data interpretation“ – Usually visualisations are used for illustration or as supportive material. The idea of computational tools is to switch from human interpretations, which may be biased, to something more systematic, isn’t it? There are ways to extract the results of cluster analyses and operate with numbers.<br />
By the way, the Fig. 3 legend is mixed up.
Performance evaluation <br />
As I think that all metrics are calculated incorrectly, further discussion of the results is senseless. But if the metrics were correct, they could be hardly considered as good. <br />
This is not surprising because, as I said, we shouldn’t expect that all genes in the clusters are co-evolving.
More comments to come!