2 Matching Annotations
  1. Jul 2018
    1. On 2013 Dec 23, Gregory Francis commented:

      A longer version of the following text was submitted to JPSP as a comment. The editor agreed with three reviewers who felt that even if the analysis was valid, it was not useful to publish such a critique.

      Based on six experiments, Wiltermuth and Gino (2012), henceforth WG, concluded that separating rewards into categories increased people’s motivation to complete reward-earning tasks. An analysis of the reported findings suggests that the experiments do not properly support the conclusion.

      The primary variable for experiments 3, 4, and 5 was analyzed with a two-sample t-test, and post-hoc power was estimated with standard calculations. The analyses of the primary variable for experiments 1, 2, and 6 involved a logistic regression analysis, and the post-hoc power for each analysis was estimated with 10,000 simulated data sets that were drawn from populations having parameters matching the sample statistics from the experiments. The simulated data sets were then analyzed in the same way as the original experiments and the proportion of times they reported statistical significance was computed as an estimate of power. For experiment 6, the conclusion of a successful experimental outcome involved two different comparisons being statistically significant, and this requirement was also imposed on the estimate of experimental power. In every case, power was measured as the probability of rejecting the null hypothesis in the direction consistent with the findings reported in WG.

      When re-computing the p-values for the data presented in the paper, experiments 2, 3, and 4 produced p-values slightly above the standard significance criterion (.0546, .0516, .0544). The deviations are large enough that they cannot be due to rounding of the reported sample proportions or t-statistics. It appears that WG rounded down some p-values to .05 and then indicated statistical significance. In practice, this means that for these experiments WG used a significance criterion of .055 rather than the typical .05. Experimental power increases with the p-criterion, so the value .055 was used to estimate power, as this properly reflects the criterion used by WG.

      Since most of the experiments just barely rejected the null hypothesis, the post hoc power values tend to be close to one half. For the six experiments, the estimated power values are: .618, .507, .509, .499, .677, .561, respectively. Since each experiment is independent, the expected number of experiments that would reject the null hypothesis is the sum of the power values, which is 3.37. The probability that six experiments like these would all reject the null hypothesis is the product of the power values, which is .030. This probability is so low (an often used criterion is .1) that it renders the reported experimental results “too good to be true.”

      The analysis does not specify how such unusual results could have been produced, but there are three broad possibilities. First, the reported experiments might be a subset of a larger experiment set that included unreported null results. Second, the reported experiments may have been subject to a verification bias that utilized improper sampling methods. Third, the reported statistical analyses may have been one of several different methods (e.g., using different criteria to indicate significance) that were used to find one that produced statistically significant results. All of these approaches misrepresent the properties of the populations or the experimental outcomes. Thus, readers should be skeptical about the theoretical conclusions derived from the data reported in WG.

      R code for the simulated experiments is available at http://www1.psych.purdue.edu/~gfrancis/Publications/WiltermuthGino/


      This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.

  2. Feb 2018
    1. On 2013 Dec 23, Gregory Francis commented:

      A longer version of the following text was submitted to JPSP as a comment. The editor agreed with three reviewers who felt that even if the analysis was valid, it was not useful to publish such a critique.

      Based on six experiments, Wiltermuth and Gino (2012), henceforth WG, concluded that separating rewards into categories increased people’s motivation to complete reward-earning tasks. An analysis of the reported findings suggests that the experiments do not properly support the conclusion.

      The primary variable for experiments 3, 4, and 5 was analyzed with a two-sample t-test, and post-hoc power was estimated with standard calculations. The analyses of the primary variable for experiments 1, 2, and 6 involved a logistic regression analysis, and the post-hoc power for each analysis was estimated with 10,000 simulated data sets that were drawn from populations having parameters matching the sample statistics from the experiments. The simulated data sets were then analyzed in the same way as the original experiments and the proportion of times they reported statistical significance was computed as an estimate of power. For experiment 6, the conclusion of a successful experimental outcome involved two different comparisons being statistically significant, and this requirement was also imposed on the estimate of experimental power. In every case, power was measured as the probability of rejecting the null hypothesis in the direction consistent with the findings reported in WG.

      When re-computing the p-values for the data presented in the paper, experiments 2, 3, and 4 produced p-values slightly above the standard significance criterion (.0546, .0516, .0544). The deviations are large enough that they cannot be due to rounding of the reported sample proportions or t-statistics. It appears that WG rounded down some p-values to .05 and then indicated statistical significance. In practice, this means that for these experiments WG used a significance criterion of .055 rather than the typical .05. Experimental power increases with the p-criterion, so the value .055 was used to estimate power, as this properly reflects the criterion used by WG.

      Since most of the experiments just barely rejected the null hypothesis, the post hoc power values tend to be close to one half. For the six experiments, the estimated power values are: .618, .507, .509, .499, .677, .561, respectively. Since each experiment is independent, the expected number of experiments that would reject the null hypothesis is the sum of the power values, which is 3.37. The probability that six experiments like these would all reject the null hypothesis is the product of the power values, which is .030. This probability is so low (an often used criterion is .1) that it renders the reported experimental results “too good to be true.”

      The analysis does not specify how such unusual results could have been produced, but there are three broad possibilities. First, the reported experiments might be a subset of a larger experiment set that included unreported null results. Second, the reported experiments may have been subject to a verification bias that utilized improper sampling methods. Third, the reported statistical analyses may have been one of several different methods (e.g., using different criteria to indicate significance) that were used to find one that produced statistically significant results. All of these approaches misrepresent the properties of the populations or the experimental outcomes. Thus, readers should be skeptical about the theoretical conclusions derived from the data reported in WG.

      R code for the simulated experiments is available at http://www1.psych.purdue.edu/~gfrancis/Publications/WiltermuthGino/


      This comment, imported by Hypothesis from PubMed Commons, is licensed under CC BY.