2 Matching Annotations
  1. Jul 2018
    1. On 2014 Nov 20, Bernard Carroll commented:

      This report breaks some new ground in design – the between-site train-and test exercise and the cross-validation of case assignment to unipolar or bipolar groups. As the authors stated, the data are not sufficiently strong for clinical use. My comments are intended to improve the quality of reporting of this and similar studies.

      1. There is selective highlighting of some results and a failure to present all the important findings clearly. In particular, the performance of the classification algorithms in distinguishing patients from normal control subjects was relegated to the Supplementary Material. It can be calculated from eTable 7 that between 27.5% and 38% of controls would be misclassified in the 1-way comparison with unipolar depressed cases. The corresponding Kappa coefficients of concordance would be fair at 0.48 for the SVM method and poor at 0.28 for the GPC method. Results for the bipolar contrast with control subjects were similarly weak. If the method cannot do better than this with normal subjects then clinical use is a very long way away. These sobering data properly belong in the main body of the paper.

      2. The cross-site training - testing results for the algorithms were described as “highly significant” (page 1226). Actually, for the train (Munster) and test (Pittsburgh) exercise the Kappa values that can be calculated from Table 3 were weak at 0.28 for SVM and 0.24 for GPC methods. They were only slightly better for the Pittsburgh – Munster exercise (Kappas each 0.38).

      3. P-values were given in Table 3 and in eTables 5,6,7 – but there is no statement of what statistical analyses generated these P-values. Were they Goodness of Fit Chi-squared tests? Standard tradecraft requires that such analyses be clearly described.

      4. No correction of P-values was made for multiple comparisons. That is another aspect of standard tradecraft.

      5. No data were shown for test-retest reliability of the algorithm-derived group assignments.

      6. All the analyses were predicated on the untenable assumption that the clinical diagnoses were 100% accurate. As the DSM-5 field trials taught us, that is far from the case in the real world of clinical assessment – the Kappa value for major depressive disorder diagnoses averaged over 4 sites was poor at 0.28 (Regier et al 2013). The authors failed to consider whether this confound degraded the strength of their findings (see a discussion of this issue in Carroll BJ 1989). At the very least, a statement of diagnostic reliability for the cases in this study is needed.

      References

      Carroll BJ. Diagnostic validity and laboratory studies. Rules of the game.<br> In: The Validity of Psychiatric Diagnosis, eds., L.N. Robins and J.E. Barrett, Raven Press, New York, 1989, pp. 229-245.

      Regier DA et al. DSM-5 Field Trials in the United States and Canada, Part II: Test-Retest Reliability of Selected Categorical Diagnoses. Amer J Psychiatry 2013; 170: 59-70.


      This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.

  2. Feb 2018
    1. On 2014 Nov 20, Bernard Carroll commented:

      This report breaks some new ground in design – the between-site train-and test exercise and the cross-validation of case assignment to unipolar or bipolar groups. As the authors stated, the data are not sufficiently strong for clinical use. My comments are intended to improve the quality of reporting of this and similar studies.

      1. There is selective highlighting of some results and a failure to present all the important findings clearly. In particular, the performance of the classification algorithms in distinguishing patients from normal control subjects was relegated to the Supplementary Material. It can be calculated from eTable 7 that between 27.5% and 38% of controls would be misclassified in the 1-way comparison with unipolar depressed cases. The corresponding Kappa coefficients of concordance would be fair at 0.48 for the SVM method and poor at 0.28 for the GPC method. Results for the bipolar contrast with control subjects were similarly weak. If the method cannot do better than this with normal subjects then clinical use is a very long way away. These sobering data properly belong in the main body of the paper.

      2. The cross-site training - testing results for the algorithms were described as “highly significant” (page 1226). Actually, for the train (Munster) and test (Pittsburgh) exercise the Kappa values that can be calculated from Table 3 were weak at 0.28 for SVM and 0.24 for GPC methods. They were only slightly better for the Pittsburgh – Munster exercise (Kappas each 0.38).

      3. P-values were given in Table 3 and in eTables 5,6,7 – but there is no statement of what statistical analyses generated these P-values. Were they Goodness of Fit Chi-squared tests? Standard tradecraft requires that such analyses be clearly described.

      4. No correction of P-values was made for multiple comparisons. That is another aspect of standard tradecraft.

      5. No data were shown for test-retest reliability of the algorithm-derived group assignments.

      6. All the analyses were predicated on the untenable assumption that the clinical diagnoses were 100% accurate. As the DSM-5 field trials taught us, that is far from the case in the real world of clinical assessment – the Kappa value for major depressive disorder diagnoses averaged over 4 sites was poor at 0.28 (Regier et al 2013). The authors failed to consider whether this confound degraded the strength of their findings (see a discussion of this issue in Carroll BJ 1989). At the very least, a statement of diagnostic reliability for the cases in this study is needed.

      References

      Carroll BJ. Diagnostic validity and laboratory studies. Rules of the game.<br> In: The Validity of Psychiatric Diagnosis, eds., L.N. Robins and J.E. Barrett, Raven Press, New York, 1989, pp. 229-245.

      Regier DA et al. DSM-5 Field Trials in the United States and Canada, Part II: Test-Retest Reliability of Selected Categorical Diagnoses. Amer J Psychiatry 2013; 170: 59-70.


      This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.