- Jul 2018
-
europepmc.org europepmc.org
-
On 2014 Sep 13, Vahid Rakhshan commented:
The formatted version of this unpublished letter to the editor with tables can be found at this address (LINK).
Sufficient test powers are necessary for validating non-significant results
The sample size was not based on power calculations. This might be justified partly by difficulty of finding surgery patients. However, compounded with next issues, this might become serious.
Except one, no pairwise comparison was significant. It is why the power becomes serious. Given the rather small sample size (2×13 for each t-test), it was possible that many of those non-significant results were merely type II errors (false negative results).
No exact P values had been reported for the comparisons. It was only stated that “P>0.05”. Reporting P values is necessary to distinguish non-significant cases like P=0.053 [which are still pointing to some difference] from cases like P=0.999.
When it comes to interclass correlations, only correlation coefficients had been reported. It was not mentioned whether those coefficients were significant or non-significant (even as any “P>0.05” or “P≤0.05” statements). The statistical significance is needed to verify if a correlation result is generalizable to the population. No matter how big is a correlation coefficient, unless it is not verified statistically (P≤α), it remains unsubstantiated and should not be used as a finding.
The authors have stated as a limitation, “the few subjects (6) with Le Fort advancement surgery”. However, their serious (and ignored) limitation was their lack of power calculations, even a post-hoc one. Certain studies [like this one] might favor the absence of significant differences between a new approach and the gold standard (i.e. P>0.05). Power calculations are extremely essential, especially in such studies. Without a sufficient power, the non-significant results are not reliable as they can be simply false negative errors.
The Discussion/Conclusion/Abstract are full of strong conclusions regarding the proper accuracy of evaluated programs. In this design with unknown/low power and many non-significant results, I could not be so sure about the accuracy of the programs.
The sample size (2×13) seemed rather small for a paired t-test. Authors might need to justify using a parametric test in this sample via proper normality assessments.
In Tables II–V [last columns], the “(maximum error)” was calculated incorrectly for some rows (Table I of this letter).
The 95%CIs are critical and also used for authors’ interpretations. At first look, some of the 95%CIs did not accord with the reported means. For example, in Table II, the mean error for “Stomion Superior” is +0.10±0.004 (minimum: –0.30, maximum: +1.02). However, its 95%CI was –0.36 to +0.13, which did not accord with the given information. Or, in the case of “Bridge of Nose” in Table V, the 95%CI was not even around the mean (mean = –0.063, CI= –0.20, –0.57)! Hence, I re-calculated the CIs using the provided means, SDs, and n=13, based on two distributions: normal and t (Tables II–III). The t distribution produced completely different CIs, while the normal distribution provided some similarities to the original report. However, even in that case, many original CIs were extremely different from CIs calculated using means/SDs. I think many calculated CIs are seriously erroneous. Any clarifications are appreciated.
Table I. The corrected “(maximum error)” values alongside the originally reported incorrect ones. Both columns show the original 95%CIs on the left of each column (exactly the same in both columns and borrowed from Nadjmi et al). On the right side of each column, stands the “(maximum error)” within the parentheses block. In the left column, this value is incorrectly calculated (in red font). The corrected one is visible in the right column.
Table II. The 95% confidence intervals (based on normal distribution) for the means reported in Tables II to IV of Nadjmi et al. Only the rows in bold blue font comprise original CIs matching my CIs from means and SDs. The rest of original CIs were inconsistent with the mean/SD (highlighted in yellow). Moreover, of those questionable original CIs (highlighted in yellow), some were strangely different from what they should be according to the reported means and SDs. They are written in red bold font.
Table III. The 95% confidence intervals for the means reported in Tables II to V of Nadjmi et al.1 calculated based on the t distribution. None of these CI values were similar to the CI values reported by Nadjmi et al.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-
- Feb 2018
-
europepmc.org europepmc.org
-
On 2014 Sep 13, Vahid Rakhshan commented:
The formatted version of this unpublished letter to the editor with tables can be found at this address (LINK).
Sufficient test powers are necessary for validating non-significant results
The sample size was not based on power calculations. This might be justified partly by difficulty of finding surgery patients. However, compounded with next issues, this might become serious.
Except one, no pairwise comparison was significant. It is why the power becomes serious. Given the rather small sample size (2×13 for each t-test), it was possible that many of those non-significant results were merely type II errors (false negative results).
No exact P values had been reported for the comparisons. It was only stated that “P>0.05”. Reporting P values is necessary to distinguish non-significant cases like P=0.053 [which are still pointing to some difference] from cases like P=0.999.
When it comes to interclass correlations, only correlation coefficients had been reported. It was not mentioned whether those coefficients were significant or non-significant (even as any “P>0.05” or “P≤0.05” statements). The statistical significance is needed to verify if a correlation result is generalizable to the population. No matter how big is a correlation coefficient, unless it is not verified statistically (P≤α), it remains unsubstantiated and should not be used as a finding.
The authors have stated as a limitation, “the few subjects (6) with Le Fort advancement surgery”. However, their serious (and ignored) limitation was their lack of power calculations, even a post-hoc one. Certain studies [like this one] might favor the absence of significant differences between a new approach and the gold standard (i.e. P>0.05). Power calculations are extremely essential, especially in such studies. Without a sufficient power, the non-significant results are not reliable as they can be simply false negative errors.
The Discussion/Conclusion/Abstract are full of strong conclusions regarding the proper accuracy of evaluated programs. In this design with unknown/low power and many non-significant results, I could not be so sure about the accuracy of the programs.
The sample size (2×13) seemed rather small for a paired t-test. Authors might need to justify using a parametric test in this sample via proper normality assessments.
In Tables II–V [last columns], the “(maximum error)” was calculated incorrectly for some rows (Table I of this letter).
The 95%CIs are critical and also used for authors’ interpretations. At first look, some of the 95%CIs did not accord with the reported means. For example, in Table II, the mean error for “Stomion Superior” is +0.10±0.004 (minimum: –0.30, maximum: +1.02). However, its 95%CI was –0.36 to +0.13, which did not accord with the given information. Or, in the case of “Bridge of Nose” in Table V, the 95%CI was not even around the mean (mean = –0.063, CI= –0.20, –0.57)! Hence, I re-calculated the CIs using the provided means, SDs, and n=13, based on two distributions: normal and t (Tables II–III). The t distribution produced completely different CIs, while the normal distribution provided some similarities to the original report. However, even in that case, many original CIs were extremely different from CIs calculated using means/SDs. I think many calculated CIs are seriously erroneous. Any clarifications are appreciated.
Table I. The corrected “(maximum error)” values alongside the originally reported incorrect ones. Both columns show the original 95%CIs on the left of each column (exactly the same in both columns and borrowed from Nadjmi et al). On the right side of each column, stands the “(maximum error)” within the parentheses block. In the left column, this value is incorrectly calculated (in red font). The corrected one is visible in the right column.
Table II. The 95% confidence intervals (based on normal distribution) for the means reported in Tables II to IV of Nadjmi et al. Only the rows in bold blue font comprise original CIs matching my CIs from means and SDs. The rest of original CIs were inconsistent with the mean/SD (highlighted in yellow). Moreover, of those questionable original CIs (highlighted in yellow), some were strangely different from what they should be according to the reported means and SDs. They are written in red bold font.
Table III. The 95% confidence intervals for the means reported in Tables II to V of Nadjmi et al.1 calculated based on the t distribution. None of these CI values were similar to the CI values reported by Nadjmi et al.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-