- Jul 2018
-
europepmc.org europepmc.org
-
On 2016 Jan 16, Lewis G Halsey commented:
Fay worries that in the real world of data collection, the power of a study is not known in advance. If true, this argument would only compound our own, which is that unless power is very high (>90%) P has surprisingly low repeatability (Halsey et al., 2015), and the power of most studies calculated after the data analysis is far lower than this (Button et al., 2013, Maxwell, 2004). Therefore, researchers would not be able to design their experiment to ensure it has a very high power, and they could not rely on good fortune instead for their experiment to turn out this way.
But anyway, an integral step in testing the null hypothesis generates an estimate of the variance of the pooled population. Using this estimated parameter, obtained as the data are analysed, the researcher can immediately gauge the study’s power. The exact parameters of the population are never known. These parameters are hypothesised and then estimated, with varying certainty, according to the sample that we have. With a limited sample, these estimates can vary substantially each time an experiment is repeated. If our samples, and the estimates they generate, suggest that power is poor, then any P value that we obtain, low or not, is untrustworthy. A small P value is of little import: a repetition of the same study would give another result (our study, figure 4). This is like looking at the world through a pinhole. When the theoretical power is 0.48, P values less than 0.05 are no more likely than P values greater than 0.05. Why get excited if P is <0.01, when the next replicate experiment could give a P of 0.6?
Fay asks if there is a single better measure than P to test the likelihood that the null hypothesis is untenable. First, this brings us back to the nub of the problem - P is only a good test of the null in the ideal circumstances that study power is very high. Second, there are long-held, big concerns about the value of null hypothesis significance testing as a method for analysing and interpreting data (Cohen, 1994).
Lewis G Halsey and Gordon B Drummond
BUTTON, K., IOANNIDIS, J., MOKRYSZ, C., NOSEK, B., FLINT, J., ROBINSON, E. & MUNAFO, M. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365-376. COHEN, J. (1994) The Earth is round (p < 0.05). American Psychologist 49, 997-1003. HALSEY, L., CURRAN-EVERETT, D., VOWLER, S. & DRUMMOND, G. (2015) The fickle P value generates irreproducible results. Nature Methods, 12, 179-185. MAXWELL, S. (2004) The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9, 147-163.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2016 Jan 13, Mike Fay commented:
Halsey, Curran-Everett, Vowler, and Drummond (2015) correctly point out that P values are variable and that estimates and confidence intervals often provide more useful information. I disagree with some of the apparent implications they make from these points. For example, they state “The P value is often used without the realization that in most cases the statistical power of a study is too low for P to assist the interpretation of the data.” The problem with this statement is that in practice we never know the statistical power of a study, so we never know if it is too low. The statistical power is a function of the sample size, the statistical model and the true value of the parameters. If the statistical model describes the data generating process and we knew the true value of the parameters, then we could know the power. The problem is that we collect data precisely because we do not know the true value of the parameters, and hence do not know the power. So we never know if the power of a study is “too low”. P can assist in interpreting the data. The P value represents the probability of observing equal or more extreme data when the null hypothesis is true. So small values of the P value imply that the null hypothesis is not likely to hold.
Here is another quote from the paper: “Put simply, the P value is usually a poor test of the null hypothesis.” This statement begs the question, is there another single statistic that is better to test the null hypothesis? If you want one statistic to test the null hypothesis, the P value is just the statistic to do that. So the P value is not “flawed” (see last paragraph of the article), but is one statistic designed to perform one function. Certainly, P values should not be the whole statistical story for any data set, but they can work for what they are designed to do.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-
- Feb 2018
-
europepmc.org europepmc.org
-
On 2016 Jan 13, Mike Fay commented:
Halsey, Curran-Everett, Vowler, and Drummond (2015) correctly point out that P values are variable and that estimates and confidence intervals often provide more useful information. I disagree with some of the apparent implications they make from these points. For example, they state “The P value is often used without the realization that in most cases the statistical power of a study is too low for P to assist the interpretation of the data.” The problem with this statement is that in practice we never know the statistical power of a study, so we never know if it is too low. The statistical power is a function of the sample size, the statistical model and the true value of the parameters. If the statistical model describes the data generating process and we knew the true value of the parameters, then we could know the power. The problem is that we collect data precisely because we do not know the true value of the parameters, and hence do not know the power. So we never know if the power of a study is “too low”. P can assist in interpreting the data. The P value represents the probability of observing equal or more extreme data when the null hypothesis is true. So small values of the P value imply that the null hypothesis is not likely to hold.
Here is another quote from the paper: “Put simply, the P value is usually a poor test of the null hypothesis.” This statement begs the question, is there another single statistic that is better to test the null hypothesis? If you want one statistic to test the null hypothesis, the P value is just the statistic to do that. So the P value is not “flawed” (see last paragraph of the article), but is one statistic designed to perform one function. Certainly, P values should not be the whole statistical story for any data set, but they can work for what they are designed to do.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY. -
On 2016 Jan 16, Lewis G Halsey commented:
Fay worries that in the real world of data collection, the power of a study is not known in advance. If true, this argument would only compound our own, which is that unless power is very high (>90%) P has surprisingly low repeatability (Halsey et al., 2015), and the power of most studies calculated after the data analysis is far lower than this (Button et al., 2013, Maxwell, 2004). Therefore, researchers would not be able to design their experiment to ensure it has a very high power, and they could not rely on good fortune instead for their experiment to turn out this way.
But anyway, an integral step in testing the null hypothesis generates an estimate of the variance of the pooled population. Using this estimated parameter, obtained as the data are analysed, the researcher can immediately gauge the study’s power. The exact parameters of the population are never known. These parameters are hypothesised and then estimated, with varying certainty, according to the sample that we have. With a limited sample, these estimates can vary substantially each time an experiment is repeated. If our samples, and the estimates they generate, suggest that power is poor, then any P value that we obtain, low or not, is untrustworthy. A small P value is of little import: a repetition of the same study would give another result (our study, figure 4). This is like looking at the world through a pinhole. When the theoretical power is 0.48, P values less than 0.05 are no more likely than P values greater than 0.05. Why get excited if P is <0.01, when the next replicate experiment could give a P of 0.6?
Fay asks if there is a single better measure than P to test the likelihood that the null hypothesis is untenable. First, this brings us back to the nub of the problem - P is only a good test of the null in the ideal circumstances that study power is very high. Second, there are long-held, big concerns about the value of null hypothesis significance testing as a method for analysing and interpreting data (Cohen, 1994).
Lewis G Halsey and Gordon B Drummond
BUTTON, K., IOANNIDIS, J., MOKRYSZ, C., NOSEK, B., FLINT, J., ROBINSON, E. & MUNAFO, M. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365-376. COHEN, J. (1994) The Earth is round (p < 0.05). American Psychologist 49, 997-1003. HALSEY, L., CURRAN-EVERETT, D., VOWLER, S. & DRUMMOND, G. (2015) The fickle P value generates irreproducible results. Nature Methods, 12, 179-185. MAXWELL, S. (2004) The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9, 147-163.
This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.
-