- Sep 2019
I am writing this review for the Drummond and Sauer comment on Mathur and VanderWeele (2019). To note, I am familiar with the original meta-analyses considered (one of which I wrote), the Mathur and VanderWeele (henceforth MV2019) article, and I’ve read both Drummond and Sauer’s comment on MV2019 and Mathur’s review of Drummond and Sauer’s comment on MV2019 (hopefully that wasn’t confusing). On balance, I think Drummond and Sauer’s (henceforth DSComment) comment under review here is a very important contribution to this debate. I tended to find DSComment to be convincing and was comparatively less convinced by Mathur’s review or, indeed, MV2019. I hope my thoughts below are constructive.
It’s worth noting that MV2019 suffered from several primary weaknesses. Namely:
- On one hand, it didn’t really tell us anything we didn’t already know, namely that near-zero effect sizes are common for meta-analyses in violent video game research.
- MV2019, aside from one brief statement as DSComment notes, neglected the well-known methodological issues that tend to spuriously increase effect sizes (unstandardized aggression measures, self-ratings of violent game content, identified QRPs in some studies such as the Singapore dataset, etc.) This resulted in a misuse of meta-analytic procedures.
- MV2019 naïvely interprets (as does Mathur’s review of DSComment) near-zero effect sizes as meaningful, despite numerous reasons not to do so given concerns of false positives.
- MV2019, for an ostensible compilation of meta-analyses, curiously neglect other meta-analyses, such as those by John Sherry or Furuyama-Kanamori & Doi (2016).
At this juncture, publication bias, particularly for experimental studies, has been demonstrated pretty clearly (e.g. Hilgard et al., 2017). I have two comments here. MV2019 offered a novel and not well-tested alternative approach (highlighted again by Mathur’s review) for bias, however, I did not find the arguments convincing as this approach appears extrapolative and produces results that simply aren’t true. For instance, the argument that 100% of effect sizes in Anderson 2010 are above 0, is quickly falsified merely by looking at the reported effect sizes in the studies included, at least some of which are below .00. Therefore, this would appear to clearly indicate some error in the procedure of MV2019.
Further, we don't need statistics to speculate about publication bias in Anderson et al. (2010) as there are actual specific examples of published null studies missed by Anderson et al. (see Ferguson & Kilburn, 2010). Further, the publication of null studies in the years immediately following (e.g. von Salisch et al., 2011) indicate that Anderson's search for unpublished studies was clearly biased (indeed, I had unpublished data at that time but was not asked by Anderson and colleagues for it). So there's no need at all for speculation given we have actual examples of missed studies and a fair number of them.
It might help to highlight also that traditional publication bias techniques probably are only effective with small sample experimental studies. For large sample correlational/longitudinal studies, effect sizes tend to be a bit more homogeneous, hovering closely to zero. In such studies the accumulation of p-values near .05 is unlikely given the power of small studies. Relatively simple QRPs can make p-values jump rapidly from non-significance to something well below.05. Thus, traditional publication bias procedures may return null results for this pool of studies, despite QRPs, and thus, publication bias having taken place.
It might also help to note that meta-analyses with weak effects are very fragile to unreported null studies, which probably exist in greater numbers (particularly for large n studies) that would be indicated by publication bias techniques.
I agree with Mathur’s comment about experiments not always offering the best evidence, given lack of generalizability to real-world aggression (indeed, that’s been a long-standing concern). However, it might help DSComment to note that, by this point, probably the pool of evidence least likely to find effects are longitudinal studies. I’ve got two preregistered longitudinal analyses of existing datasets myself (here I want to make clear that citing my work is by no means necessary for my positive evaluation of any revisions on DSComment), and there are other fine studies (such as Lobel et al., 2017, Breuer et al., 2015, Kuhn et al., 2018; von Salisch et al., 2011, etc.) The authors may also want to note Przybylski and Weinstein (2019) which offer an excellent example of a preregistered correlational study.
Indeed, in a larger sense, as far as evidence goes, DSComment could highlight recent preregistered evidence from multiple sources (McCarthy et al., 2016; Hilgard et al., 2019, Przybylski & Weinstein, 2019, Ferguson & Wang, 2019, etc.) This would seem to be the most crucial evidence and, aside from one excellent correlational study (Ivory et al.) all of the preregistered results have been null. Even if we think the tiny effect sizes in existing metas provide evidence in support of hypotheses (and we shouldn’t), these preregistered studies suggest we shouldn’t trust even those tiny effects to be “true.”
The weakest aspect of MV2019 was the decision to interpret near-zero effects as meaningful. Mathur, argues that tiny effects can be important once spread over a population. However, this is merely speculation, and there’s no data to support it. It’s kind of a truthy thing scholars tend to say defensively when confronted by the possibility that effect sizes don’t support their hypotheses. By making this argument, Mathur invites an examination of population data where convincing evidence (Markey, Markey & French, 2015; Cunningham et al., 2016; Beerthuizen, Weijters & van der Laan, 2017) shows that violent game consumption is associated with reduced violence in society. Granted, some may express caution about looking at societal-level data, but here is where scholars can’t have it both ways: One can’t make claims about societal-level effects, and then not want to look at the societal data. Such arguments make unfalsifiable claims and are unscientific in nature.
The other issue is that this line of argument makes effect sizes irrelevant. If we’re going to interpret effect sizes no matter how near to zero as hypothesis supportive, so long as they are “statistically significant” (which, given the power of meta-analyses, they almost always are), then we needn’t bother reporting effect sizes at all. We’re still basically slaves to NHST, just using effect sizes as a kind of fig leaf for the naked bias of how we interpret weak results.
Also, that’s just not how effect sizes work. They can’t be sprinkled like pixie dust over a population to make them meaningful.
As DSComment points out, effect sizes that are this small have high potential for Type 1 error. Funder and Ozer (2019) recent contributed to this discussion in a way I think was less than helpful (to be very clear I respect Funder and Ozer greatly, but disagree with many of their comments on this specific issue). Yet, as they note, interpretation of tiny effects is based on such effects being “reliable”, a condition clearly not in evidence for violent game research given the now extensive literature on the systematic methodological flaws in that literature.
In her comment Dr. Mathur dismisses the comparison with ESP research, but I disagree with (or dismiss?) this dismissal. The fact that effect sizes in meta-analyses for violent game research are identical to those for “magic” is exactly why we should be wary of interpreting such effect sizes as hypothesis supportive. Saying violent game effects are more plausible is irrelevant (and presumably the ESP people would disagree). However, the authors of DSComment might strengthen their argument by noting that some articles have begun examining nonsense outcomes within datasets. For example, in Ferguson and Wang (2019) we show that the (weak and in that case non-significant) effects for violent game playing are no different in predicting aggression than nonsense variables (indeed, the strongest effect was for the age at which one had moved to a new city). Orben and Przybylski (2019) do something similar and very effective with screen time. Point being, we have an expanding literature to suggest that the interpretation of such weak effects is likely to lead us to numerous false positive errors.
The authors of DSComment might also note that MV2019 commit a fundamental error of meta-analysis, namely assuming that the “average effect size wins!” When effect sizes are heterogeneous (as Mathur appears to acknowledge unless I misunderstood) the pooled average effect size is not a meaningful estimator of the population effect size. That’s particularly true given GIGO (garbage in, garbage out). Where QRPs have been clearly demonstrated for some studies in this realm (see Przybylski & Weinstein, 2019 for some specific examples of documentation involving the Singapore dataset), the pooled average effect size, however it is calculated, is almost certainly a spuriously high estimate of true effects.
DSComment could note that other issues such as citation bias are known to be associated with spuriously high effect sizes (Ferguson, 2015), another indication that researcher behaviors are likely pulling effect sizes above the actual population effect size.
Overall, I don’t think MV2019 were very familiar with this field and, appearing unaware of the serious methodological errors endemic in much of the literature which pull effect sizes spuriously high. In the end, they really didn’t say anything we didn’t already know (the effect sizes across metas tend to be near zero), and their interpretation of these near-zero effect sizes was incorrect.
With that in mind, I do think DSComment is an important part of this debate and is well worth publishing. I hope my comments here are constructive.
Signed, Chris Ferguson