7 Matching Annotations
  1. Jul 2018
    1. On 2013 Dec 12, Adam Eyre-Walker commented:

      I thank Dr. Cherry for another set of insightful comments.

      He points out we may have overestimated the stochasticity associated with the accumulation of citations. He correctly notes that if assessors tend to err erroneously in the same direction in their judgments, then the errors associated with their assessments will be correlated. One might imagine, for example, that assessors tend to over-rate papers in high impact factor journals, by particular authors, or from a particular institution. Such correlated errors will mean that the correlation between assessor scores underestimates the error associated with making an assessment and this will in turn imply that the stochasticity associated with the accumulation of citations is less than we have estimated. However, if the error associated the accumulation of citations is also correlated to the error associated with the assessment then the stochasticity associated with the accumulation of citations may have been underestimated. The errors associated with assessments and citations might be correlated given that citations depend, to some extent, on post-publication subjective assessment.

      A likely bias, and hence a source of correlated erros, is a tendency for assessors to overestimate papers in high-ranking journals. As we showed, the partial correlation between assessor scores and between assessor scores and the number of citations, controlling for impact factor, are very weak (r<0.20). This suggests that within journals, subjective estimates of merit and the accumulation of citations are dominated by error. The weaknesses of these correlations might be because there is little variation in merit within journals, with most of the variance in merit being between journals. However, it seems unlikely that journals are a perfect arbiter of merit because their judgments are based on subjective assessment, which we have demonstrated to be poor. Furthermore, as noted above, it is quite likely that the errors associated with assessments are correlated to errors associated with the accumulation of citations. The system is clearly complex and it may prove to be very difficult to accurately estimate the variance associated with the accumulation of citations.

      We argued in our original paper that the impact factor might be the best of the methods currently available for assessing merit, though we emphasized that it was likely to be very error prone. We argued that it might be a reasonable measure, because the IF is a form of pre-publication review – in accepting a paper for a particular journal, the scientific community has decided that the paper of sufficient merit to be published where it is accepted. This decision is likely to be the consensus of several individuals, with some individuals, such as editors, having a greater say than others. Dr. Cherry points out that using the IF as a measure of merit might potentially be matched by combining the post-publication assessments of several individuals. He shows that if we ignore any potential biases (i.e. correlated errors), for example assessors being influenced by the IF, then the estimated correlation between assessor score and merit is 0.60, and the correlation between the IF and merit is expected to be 0.80. Dr. Cherry arrives at these estimates in the following manner (he elaborated upon this in a subsequent email). If the errors are uncorrelated then the correlation between two variables, which are correlated to X, is expected to be the product of their correlation to X; e.g. if the correlation between variable 1 and X is r1 and the correlation between variable 2 and X is r2 then the correlation between 1 and 2 is expected to be r1*r2. The correlation between assessor scores in the Wellcome Trust data is 0.36, which implies the correlation between a single assessor score and merit is SQRT(0.36) = 0.60. The square of this is the proportion of the variance in score explained by merit, which is equivalent to our equation 1 (the square of the correlation between a single assessor and merit is the expected correlation between two assessors). From this equation we can estimate the ratio of the error to merit variance, which is 1.78 from the Wellcome Trust data. If we have n independent assessors we expect this ratio to be reduced by a factor n; hence the expected correlation between the mean score from n assessors and merit is 0.60 (n=1), 0.73 (n=2) and 0.80 (n=3). The inferred correlation between the IF and merit is 0.80; this comes from noting that the correlation between assessor score and IF is expected to be the product of the correlation between assessor score and merit, and the IF and merit; given that the correlation between assessor score and merit has been estimated to be 0.60, and the observed correlation between assessor score and IF is 0.48, we estimate that the correlation between IF and merit is 0.80. Hence we would need 3 independent assessors to match the correlation between IF and merit. For comparison, the correlation between the number of citations and merit is inferred to be 0.69 if we use the correlation between IF and the number of citations, and 0.63 if we use the correlation between assessor score and the number of citations to make the estimate. Hence the IF is the best measure of merit, and would only be rivaled by subjective assessment if we engage 3 independent reviewers; the number of citations is estimated to be better than a single reviewer, but worse than two reviewers. However, I would emphasise that these estimates all assume that errors are uncorrelated.

      Finally, Dr. Cherry points out a problem with using the correlation coefficient on a bounded scale. The correlation coefficient is typically scale independent – i.e. if you add, subtract, multiply or divide one of the variables by some value then the correlation coefficient remains unchanged. However, this is only true if the scale is unbounded; if there is a maximum or minimum value then the correlation may be poor, even if the reviewers agree on the ranking of the papers. For example, if one assessor tends to rate harshly and another generously then the correlation may be poor because most of the harsh reviewer’s scores are the lowest mark and most of the liberal reviewer’s scores are the highest mark. The solution to this problem is to offer an essentially unbounded scale. However, as we pointed out in our original article, the tendency for reviewers to differ in their average mark could potentially have serious consequences in an assessment exercise, particularly if individuals or universities are assessed by a limited number of individuals.


      This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.

    2. On 2013 Nov 11, Joshua L Cherry commented:

      I thank Dr. Eyre-Walker for a wonderful response that should serve as a model for the rest of us.

      I have some additional comments about the paper's interpretation of the reported correlation coefficients. I would question the conclusion that journal impact factor (IF) is the "least bad" measure of article merit.

      1) The authors conclude that citation is a highly stochastic process and that citation number is a poor measure of article merit. This interpretation rests on assumptions that preclude the possibility that citation information contains any unique wisdom. In technical terms, the problematic assumption is that the errors made by different assessors in judging merit are uncorrelated, so that citation number can at best provide a less noisy estimate of what assessors estimate. It seems quite reasonable that, contrary to this assumption, different assessors tend to systematically err in the same ways, and that citation number is comparatively free of such errors.

      Suppose that it had turned out that assessor scores correlated perfectly with each other, but still only moderately well with citation number. The authors' reasoning would compel us to believe that the assessors were correct, and that the low correlation with citation number was entirely due to stochasticity of citation. This is not, however, the only reasonable interpretation. The assessors might simply agree in over-rating the merit of some papers and under-rating that others, while the citation number came closer to the truth.

      This point applies even when, as in reality, between-assessor correlation is far from perfect. Imagine, to take an extreme case, that citation number is a flawless measure of merit. Suppose further that different assessors tend to err in the same way in judging merit, but also disagree with each other to some extent. What, then, would the correlation coefficients look like? They might have exactly the values that were observed in this study. Thus, there seems to be no basis for dismissing citation number as a measure of merit.

      2) It is enlightening, nonetheless, to consider performance of the metrics under the assumption that assessor errors are uncorrelated. Under this assumption, the average of scores given by a large number of assessors approaches perfect correlation with merit. It is true that IF correlates better with that average (equivalently, with merit) than does a single assessor score (a correlation of ~0.8 as compared to ~0.6 for the WT data). However, we can also calculate that, for the WT data, the mean of two assessor scores would do about as well as IF, and the mean of three or more assessor scores would be superior to IF as a measure of merit.

      3) Perhaps the most important observation in the paper is the weakness of the correlation between scores given to the same paper by different post-publication assessors. This is at the heart of the conclusion that assessors do a poor job of assessing merit and that their ratings should not be used.

      One possible source of assessor disagreement is that some assessors tend to rate all papers more highly than others do. If a generous assessor is paired with a harsh assessor, the two will often rate a paper differently even if they agree on its merit relative to that of other papers. In principle, even if all assessors agreed perfectly on their ranking of all papers, a low between-assessor correlation could result. The correlation would be high if we considered just two assessors and maintained assessor identity in the calculations. However, the data analyzed in the paper involved many assessors, and assessor order was deliberately randomized, so the correlation would be diminished if assessors effectively use different scales.

      This type of disagreement would not be troubling with respect to assessors' ability to discern merit. It would present a practical problem, but this problem would be amenable to correction, which might be extremely valuable. One can imagine schemes of assessor assignment and score analysis that largely remove this effect. This might yield much improved estimates of merit that, with just two assessors per paper, greatly outperform the suggested use of IF. Even with a single assessor per paper, the position of a paper in the assessor's ranking might be superior to IF as a measure of merit.


      This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.

    3. On 2013 Nov 01, Adam Eyre-Walker commented:

      We thank the author for his insightful comments. Unfortunately Dr. Cherry is correct (see below); controlling for merit, using a noisy measure such as the number of citations, will leave a correlation between assessor score and the impact factor whether or not there is a tendency for assessors to overrate papers in high impact journals. Since we did not previously appreciate this problem, and it wasn’t caught by a number of referees that looked at our paper prior to publication, it might be worth elaborating why this is the case. Imagine that we consider all papers that have received 100 citations; we assumed in our analysis that these represented papers of equal merit. However, because the number of citations is a noisy measure of merit, some of these papers will be papers of poor merit that by chance got more than their fair share of citations, and others will be papers of good merit than received less than their fair share of citations. As a consequence there is variation in merit amongst papers that received 100 citations. Hence, if assessors tend to rate better papers more highly and better journals publish better papers, then there will be a correlation between assessor score and the impact factor even when the number of citations is controlled for. Furthermore, the decrease in the correlation between assessor scores, and between assessor score and the number of citations, when the impact factor is controlled for, may simply reflect the decrease in the variance in merit within journals.

      So as Dr. Cherry concludes, there is no evidence from our analysis that assessors overrate science in high impact journals. This tendency may exist, but there is simply no evidence from our analysis. However, the majority of our conclusions are unaffected by this insight; there is a rather poor correlation between assessor scores and between assessor score and the number of citations, whether or not the impact factor is controlled for. These correlations demonstrate that either assessors do not agree on what constitutes merit or they are not good at identifying merit, and that the accumulation of citations is highly stochastic.

      Finally, we note the correlation between assessor score and impact factor is stronger than either the correlation between assessor scores, and the correlation between assessor scores and the number of citations. These correlations therefore suggest that the impact factor is the best measure of merit if there is no tendency for assessors to be influenced by the journal in which a paper is published.

      There might be two approaches to determining whether assessors overrate papers in high-ranking journals. Developing a mathematical model of the relationship between assessor score, the number of citations and the impact factor of a journal – so far our attempts to do this have failed. Or to independently assess a range of papers before and after publication.


      This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.

    4. On 2013 Oct 29, Joshua L Cherry commented:

      This article claims to have demonstrated that post-publication assessors are strongly influenced by their knowledge of the journal in which a paper was published. Specifically, it is claimed that they "tend to over-rate papers published in journals with high impact factors". Furthermore, it is suggested that "scientists have little ability to judge...the intrinsic merit of a paper".

      These conclusions, which are based on coefficients of correlation between article metrics, do not follow from the data. The inferences involved are akin to taking correlation as proof of causation.

      The authors first observe that journal impact factor (IF) correlates with assessor score even when citation number is controlled for. They interpret this as evidence that (assessor knowledge of) IF directly influences assessor score. The observed partial correlation is in fact expected for imperfect measures of a latent variable even in the absence of causal effects among the measures. If, for example, all three variables (assessor score, IF, and citation number) are noisy measures of article merit with uncorrelated noise, any two are necessarily correlated with each other even when the third is controlled for. Even if we took citation number as a perfect measure of merit (which we have no reason to do), the correlations would show only that assessor score and IF tend to err in the same way, not that one of them influences the other. Note that controlling for citation number does not eliminate the correlation between the scores given by two assessors, but it would be erroneous to conclude that one affected the other. The authors even tell us that citation number is "a very poor measure of the underlying merit of the science, because the accumulation of citations is highly stochastic". Controlling for such a variable could not possibly eliminate the correlation between assessor score and IF, so the authors' reasoning would suggest a strong assessor bias even if no such bias existed.

      The authors also point out that controlling for IF significantly reduces the correlation between scores given by different assessors. They conclude that much of the correlation between assessors is due to their both being influenced by their knowledge of IF, rather than reflecting assessment based directly on intrinsic merit. This inference, too, is unfounded. Controlling for one of three intercorrelated variables can reduce the correlation between the other two under a range of conditions that do not involve causal connections. In fact, when two positively correlated variables positively correlate to the same extent with a third, as expected for assessor scores (since ordering of assessors is arbitrary), controlling for the third variable necessarily decreases the correlation between the first two. Thus, the observed reduction in correlation will occur whenever assessor scores correlate positively with IF, which certainly does not require the posited causal effect.

      This is not to say that the claimed effect can be disproven or is implausible, but only that it has not been demonstrated. The observed correlation structure is entirely consistent with the complete absence of such an effect, and in the presence of such an effect the authors' reasoning would likely overestimate it drastically.


      This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.

  2. Feb 2018
    1. On 2013 Oct 29, Joshua L Cherry commented:

      This article claims to have demonstrated that post-publication assessors are strongly influenced by their knowledge of the journal in which a paper was published. Specifically, it is claimed that they "tend to over-rate papers published in journals with high impact factors". Furthermore, it is suggested that "scientists have little ability to judge...the intrinsic merit of a paper".

      These conclusions, which are based on coefficients of correlation between article metrics, do not follow from the data. The inferences involved are akin to taking correlation as proof of causation.

      The authors first observe that journal impact factor (IF) correlates with assessor score even when citation number is controlled for. They interpret this as evidence that (assessor knowledge of) IF directly influences assessor score. The observed partial correlation is in fact expected for imperfect measures of a latent variable even in the absence of causal effects among the measures. If, for example, all three variables (assessor score, IF, and citation number) are noisy measures of article merit with uncorrelated noise, any two are necessarily correlated with each other even when the third is controlled for. Even if we took citation number as a perfect measure of merit (which we have no reason to do), the correlations would show only that assessor score and IF tend to err in the same way, not that one of them influences the other. Note that controlling for citation number does not eliminate the correlation between the scores given by two assessors, but it would be erroneous to conclude that one affected the other. The authors even tell us that citation number is "a very poor measure of the underlying merit of the science, because the accumulation of citations is highly stochastic". Controlling for such a variable could not possibly eliminate the correlation between assessor score and IF, so the authors' reasoning would suggest a strong assessor bias even if no such bias existed.

      The authors also point out that controlling for IF significantly reduces the correlation between scores given by different assessors. They conclude that much of the correlation between assessors is due to their both being influenced by their knowledge of IF, rather than reflecting assessment based directly on intrinsic merit. This inference, too, is unfounded. Controlling for one of three intercorrelated variables can reduce the correlation between the other two under a range of conditions that do not involve causal connections. In fact, when two positively correlated variables positively correlate to the same extent with a third, as expected for assessor scores (since ordering of assessors is arbitrary), controlling for the third variable necessarily decreases the correlation between the first two. Thus, the observed reduction in correlation will occur whenever assessor scores correlate positively with IF, which certainly does not require the posited causal effect.

      This is not to say that the claimed effect can be disproven or is implausible, but only that it has not been demonstrated. The observed correlation structure is entirely consistent with the complete absence of such an effect, and in the presence of such an effect the authors' reasoning would likely overestimate it drastically.


      This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.

    2. On 2013 Nov 01, Adam Eyre-Walker commented:

      We thank the author for his insightful comments. Unfortunately Dr. Cherry is correct (see below); controlling for merit, using a noisy measure such as the number of citations, will leave a correlation between assessor score and the impact factor whether or not there is a tendency for assessors to overrate papers in high impact journals. Since we did not previously appreciate this problem, and it wasn’t caught by a number of referees that looked at our paper prior to publication, it might be worth elaborating why this is the case. Imagine that we consider all papers that have received 100 citations; we assumed in our analysis that these represented papers of equal merit. However, because the number of citations is a noisy measure of merit, some of these papers will be papers of poor merit that by chance got more than their fair share of citations, and others will be papers of good merit than received less than their fair share of citations. As a consequence there is variation in merit amongst papers that received 100 citations. Hence, if assessors tend to rate better papers more highly and better journals publish better papers, then there will be a correlation between assessor score and the impact factor even when the number of citations is controlled for. Furthermore, the decrease in the correlation between assessor scores, and between assessor score and the number of citations, when the impact factor is controlled for, may simply reflect the decrease in the variance in merit within journals.

      So as Dr. Cherry concludes, there is no evidence from our analysis that assessors overrate science in high impact journals. This tendency may exist, but there is simply no evidence from our analysis. However, the majority of our conclusions are unaffected by this insight; there is a rather poor correlation between assessor scores and between assessor score and the number of citations, whether or not the impact factor is controlled for. These correlations demonstrate that either assessors do not agree on what constitutes merit or they are not good at identifying merit, and that the accumulation of citations is highly stochastic.

      Finally, we note the correlation between assessor score and impact factor is stronger than either the correlation between assessor scores, and the correlation between assessor scores and the number of citations. These correlations therefore suggest that the impact factor is the best measure of merit if there is no tendency for assessors to be influenced by the journal in which a paper is published.

      There might be two approaches to determining whether assessors overrate papers in high-ranking journals. Developing a mathematical model of the relationship between assessor score, the number of citations and the impact factor of a journal – so far our attempts to do this have failed. Or to independently assess a range of papers before and after publication.


      This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.

    3. On 2013 Nov 11, Joshua L Cherry commented:

      I thank Dr. Eyre-Walker for a wonderful response that should serve as a model for the rest of us.

      I have some additional comments about the paper's interpretation of the reported correlation coefficients. I would question the conclusion that journal impact factor (IF) is the "least bad" measure of article merit.

      1) The authors conclude that citation is a highly stochastic process and that citation number is a poor measure of article merit. This interpretation rests on assumptions that preclude the possibility that citation information contains any unique wisdom. In technical terms, the problematic assumption is that the errors made by different assessors in judging merit are uncorrelated, so that citation number can at best provide a less noisy estimate of what assessors estimate. It seems quite reasonable that, contrary to this assumption, different assessors tend to systematically err in the same ways, and that citation number is comparatively free of such errors.

      Suppose that it had turned out that assessor scores correlated perfectly with each other, but still only moderately well with citation number. The authors' reasoning would compel us to believe that the assessors were correct, and that the low correlation with citation number was entirely due to stochasticity of citation. This is not, however, the only reasonable interpretation. The assessors might simply agree in over-rating the merit of some papers and under-rating that others, while the citation number came closer to the truth.

      This point applies even when, as in reality, between-assessor correlation is far from perfect. Imagine, to take an extreme case, that citation number is a flawless measure of merit. Suppose further that different assessors tend to err in the same way in judging merit, but also disagree with each other to some extent. What, then, would the correlation coefficients look like? They might have exactly the values that were observed in this study. Thus, there seems to be no basis for dismissing citation number as a measure of merit.

      2) It is enlightening, nonetheless, to consider performance of the metrics under the assumption that assessor errors are uncorrelated. Under this assumption, the average of scores given by a large number of assessors approaches perfect correlation with merit. It is true that IF correlates better with that average (equivalently, with merit) than does a single assessor score (a correlation of ~0.8 as compared to ~0.6 for the WT data). However, we can also calculate that, for the WT data, the mean of two assessor scores would do about as well as IF, and the mean of three or more assessor scores would be superior to IF as a measure of merit.

      3) Perhaps the most important observation in the paper is the weakness of the correlation between scores given to the same paper by different post-publication assessors. This is at the heart of the conclusion that assessors do a poor job of assessing merit and that their ratings should not be used.

      One possible source of assessor disagreement is that some assessors tend to rate all papers more highly than others do. If a generous assessor is paired with a harsh assessor, the two will often rate a paper differently even if they agree on its merit relative to that of other papers. In principle, even if all assessors agreed perfectly on their ranking of all papers, a low between-assessor correlation could result. The correlation would be high if we considered just two assessors and maintained assessor identity in the calculations. However, the data analyzed in the paper involved many assessors, and assessor order was deliberately randomized, so the correlation would be diminished if assessors effectively use different scales.

      This type of disagreement would not be troubling with respect to assessors' ability to discern merit. It would present a practical problem, but this problem would be amenable to correction, which might be extremely valuable. One can imagine schemes of assessor assignment and score analysis that largely remove this effect. This might yield much improved estimates of merit that, with just two assessors per paper, greatly outperform the suggested use of IF. Even with a single assessor per paper, the position of a paper in the assessor's ranking might be superior to IF as a measure of merit.


      This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.