47 Matching Annotations
1. Oct 2021
2. psyarxiv.com psyarxiv.com
1. It is not I who wishes to ignore these issues in favor of a dichotomousand ritualizeddecision to act as if one hypothesis or the other is true,based on a statistical test;it is the NHST aficionados who wish this lack of nuance upon science

As explained before, the authors misunderstands the philosophy of science underlying NP hypothesis testing - we have explained this here: https://psyarxiv.com/af9by/ and we recommend reading through this explanation to see the mistakes in this section.

2. ce, part 1).

This should link to part 2

3. it would be necessary to employ limited sample sizes so as decrease the probability of obtainingstatistical significance

No, you should just perform range predictions, especially in huge sample sizes (see Lakens, 2021, where I explain why the rest of this paragraph is flawed).

4. Because the model is wrong, it should be obvious that provided a sufficiently large sample size is obtained, the null hypothesis will be rejected

And yet, this often does not happen in huge randomized controlled trials. So, it seems to models are close enough to the truth that tiny differences are not detected with tens of thousands of people. If they are the effect size (which also needs to be interpreted) will show the effect is trivially small. So this is a non-issue.

5. NHST is blatantly inadequate at the theory level,

Every approach to statistics is - it must be so. So this is not a problem - it's philosophy of science.

6. Therefore, there is no way to know whether evidence that is unlikely in light of the model is because themodel is slightly wrong, extremelywrong, or somewhere in-between.

This is irrelevant, because the p-value is based on the assumption that the model is true. If you want to falsify that model, collect data and falsify it. You can not claim that all models are wrong - that is unscientific, in a methodological falsificationist perspective on science (which forms the basis of Neyman-Pearson hypothesis testing).

7. t cannot index how wrong M

You might want to kink this to the well-known problem of underdetermination, which seems to be what you actually want to discuss. There is a long literature on this - engage with it.

8. null hypothesis,but between observed data and the whole model in which the null hypothesis is embedded

There is no difference here - the null hypothesis is a model of a data generating process.

9. Firstly, a pvalue is a measure, not of compatibility, but of incompatibility.

This seems semantics - it the p = 1, the model is perfectly compatible with the null. If it is .99, is it almost perfectly compatible with the null. Or rephrase it as only ever so slightly incompatible. More importantly, the author fails to see that I am citing Greenland and colleagues - the use of 'compatibility' is not my term, but theirs, when Greenland et al right: "we will adopt a more general view of the P value as a statistical summary of the compatibility between the observed data and what we would predict or expect to see if we knew the entire statistical model". So please do not ascribe this mistake (Which is not a mistake) to me - I am just citing Greenland.

10. they only have a 5% chance of having wrongly rejected the test hypothesis

This is a misinterpretation of the meaning of a p-value. In any single study, this probability is either 0% or 100% - and it is not known which of these 2 it is. We can only say that in the long run we will not be wrong more than 5% of the time. In this study, they could have a 100% probability of being wrong.

11. has a known probability of wrongly rejecting the test hypothesis

No - they have a MAXIMUM probability in the long run.

12. Lakens (2021) committedthat a pvalue is based on a hypothesis rather than on a mode

13. properly

this is not the correct title

#### URL

3. Jul 2021
4. osf.io osf.io
1. If so, the landscape of open data may not be the democratisation of knowledge, but just a further mechanism whereby the rich get richer

This paragraph is not convincing. The claim that open data is primarily accessible to institutes with more budget might be true - but this is irrelevant. You need to argue that the increase in accessibility has lower marginal utility for poorer institutes - but the opposite must be true. There is too much confirmation bias in this paragraph - you need to do a better job providing a fair cost-benefit analysis - it reads as if you were searching for arguments to support a preconceived notion - not like you tried to honestly weigh costs and benefits.

2. undermine their utility

This is a very strong statement - In monetary terms, it is important to note that APC for people in some countries is 0 after a waiver, and in other countries it costs research budget. A Strong statement requires a much better cost-benefit analysis here to be believable.

3. Open Sciencerelies upon local training

Why? I learned everything I know about Open Science online, through free online courses, and open access articles. This seems to be a very strong and unreasonable assumption.

4. equity is one aim of Open Science amongst other

It would be useful to distinguish schools - there are people who do not believe in making 'Open Science' a container concept that means everything.

5. Open Science2has been proposed at least in part as a corrective for some of these issues

By who? Where? Can you provide references? Regardless who said this, there must be many people who disagree with such a broad definition of open science - open science is typically used to refer to science that is open - not science that is equitable. Equitable science is an orthogonal goal - important, but unrelated to Open Science.

#### URL

5. Nov 2020
6. psyarxiv.com psyarxiv.com
1. statistical powe

This is like saying that driving 200 miles per hour inside the city compares favorably on speed of getting to your destination. It is important, but meaningless if you do not also consider the number of people you kill along the way. The article misses a careful reflection on Type 1 error rates, and how these compare between tests - I think because Type 1 error rates look very bad for the Bayes factor approach, as it is well known to be biased towards the null - Figure 3 clearly shows Type 1 error rates are through the roof, and largely unacceptable for scientific purposes.

2. TOST and HDI-ROPE have no discriminatory power form= 0:1

So you have simulated badly designed studies, because this would not happen in practice after a power analysis. Are your recommendation on which test to use conditional on the requirement of researchers to design bad studies, or do they hold more generally?

3. Equivalence margins larger than this come closeto what Cohen labeled medium-sized effects and are in most contexts unreasonable largeto demarcate equivalence

Research on the minimally clinical differences suggests that on an individual level effects of d = 0.5 or smaller are typically too small to be subjectively noticed as an improvement. As this critical design choice in this article lacks any cited justifications, not a discussion of the literature, the authors might want to look into the literature on minimally clinically relevant effects. Also because for such larger effects sizes, it would be interesting to see if the criticism that Bayes factors are biased towards the null leads to unacceptably high Type 1 error rates for an equivalence test.

#### URL

7. Apr 2020
8. psyarxiv.com psyarxiv.com
1. positive evidence for successful replicatio

but the idea by Simonsohn is to combine an equivalence test with NHST - so this is not a real difference.

the small telescopes approach is what the study had .33 power for, Lakens et al (2018) suggest using what could have been detected in the original study (typically what the study had 50% power for) - this is a very liberal criterion - what is the justification?

3. theoretically

If the value is based on the sample size used by the original researchers, there is no good reason to call this 'theoretically interesting' - sample sizes are chosen based on feasibility, cost, and time - and only then to a certain extent based on what is expected.

#### URL

9. Mar 2019
10. psyarxiv.com psyarxiv.com
1. Abstrac

In his commentary, Alex Holcombe makes the argument that only ‘one or two exemplars of a color category’ are typically examined in color studies, and this is problematic because a color such as ‘red’ is a category, not a single hue.

Although in some fields it is very important to examine a range of stimuli, and in general examining the generalizability of findings has an important place in research lines, I do not think that currently this issue is a pressing concern in color psychology. Small variations in hue and brightness naturally occur in online studies, and these are assumed not to matter for the underlying mechanism. Schietecat, Lakens, IJsselsteijn, and De Kort (2018) write: “In addition, we conducted Experiments 1 and 3 in a laboratory environment, but Experiments 2, 4, and 5 were conducted in participants’ homes with an internet-based method. Therefore, we could not be completely sure that the presentation of the stimuli on their personal computers was identical for every participant in those experiments. However, we expected that the impact of these variations on our results is not substantial. The labels of the IAT (i.e., red vs blue) increased the salience of the relevant hue dimension, and we do not expect our results to hold for very specific hues, but for colors that are broadly categorized as red, blue, and green. The similar associative patterns across Experiments 2 and 3 seem to support this expectation.”

We wrote this because there is nothing specific about the hue that is expected to drive the effects in association based accounts of psychological effects of colors. If the color ‘red’ is associated with specific concepts (and the work by Schietecat at all supports the idea that red can activate associations related to either activity and evaluation, such as aggression or enthusiasm, depending on the context). This means that the crucial role of the stimulus is to activate the association with ‘red’, no the perceptual stimulation of the eye in any specific way. The critical manipulation check would thus be is people categorize a stimulus as ‘red’. As long as this is satisfied, we can assume the concept ‘red’ is activated, which can then activate related associations, depending on the context.

Obviously, the author is correct that there are benefits in testing multiple variations of the color ‘red’ to demonstrate the generalizability of observed effects. However, the authors is writing too much as a perception researcher I fear. If there is a strong theoretical reason to assume slightly different hues and chromas will not matter (because as long as a color is recognized as ‘red’ it will activate specific associations) the research priority of varying colors is much lower than in other fields (e.g., research on human faces) where it is more plausible that the specifics of the stimuli matter. A similar argument holds for the question whether “any link is specifically to red, rather than extending to green, yellow, purple, and brown”. This is too a-theoretical, and even though not all color research has been replicable, and many studies suffered from problems identified during the replication crisis, the theoretical models are still plausible, and specific to predictions about certain hues. We know quite a lot about color associations for prototypical colors in terms of their associations with valence and activity (e.g., Russell & Mehrabian, 1977) and this can be used to make more specific predictions than to a-theoretically test the entire color spectrum.

Indeed, across the literature many slightly different variations of red are used, or in online studies (Schietecat et al., 2018) studies have been performed online, where different computer screens will naturally lead to some variation in the exact colors presented. This doesn’t mean that more dedicated exploration of the boundaries of these effects can be worthwhile in the future. But currently, the literature is more focused on examining whether these effects are reliable to begin with, and explaining basic questions about their context dependency, than that they are concerned about testing the range of hues for which effects can be observed. So, although in principle it is often true that the generalizability of effects is understudies and deserved more attention, it is not color psychology’s most pressing concern, because we have theoretical predictions about specific colors, and because theoretically as long as a color activates the concept (e.g., ‘red’), the associated concepts that influence subsequent psychological responses are assumed to be activated, irrespective of minor differences in for example hue or brightness.

Daniel Lakens

References

Russell, J. A., & Mehrabian, A. (1977). Evidence for a three-factor theory of emotions. Journal of Research in Personality, 11(3), 273–294. DOI: https://doi.org/10.1016/0092-6566(77)90037-X Schietecat, A. C., Lakens, D., IJsselsteijn, W. A., & Kort, Y. A. W. de. (2018). Predicting Context-dependent Cross-modal Associations with Dimension-specific Polarity Attributions. Part 2: Red and Valence. Collabra: Psychology, 4(1). https://doi.org/10.1525/COLLABRA.126

#### URL

11. Sep 2018
12. psyarxiv.com psyarxiv.com
1. This development may amuse those who, like me, have taken on board the lessons of the empirical turn in the study of science. One of those lessons after all was that science does not work the way Karl Popper thought it shoul

It is worth pointing out the reform movement is in no way guided by simplistic and orthodox Popperian views on science - this transition to a less strict falsificationist view on research has already occurred (see Meehl, 1990) so the author seems to be amused by a strawmen. The current reform movement is more diverse, and less single-mindedly focused on Popper, as the author argues here.

#### URL

13. Aug 2018
14. psyarxiv.com psyarxiv.com
1. Equivalence Testing and the Second GenerationP-Value

#### URL

15. Jul 2018
16. f1000research.com f1000research.com
1. Amendments from Version 1

For a blog post commenting on an earlier draft of this article (many points raised remain relevant) see here.

#### URL

17. psyarxiv.com psyarxiv.com
1. Making ‘Null Effects’ Informative:

We look forward to any comments you might have!

#### URL

18. daniellakens.blogspot.com daniellakens.blogspot.com
1. The 20% Statistician

#### URL

19. Apr 2015
20. learnbayes.org learnbayes.org
1. confidence

This is closest to what modern CI advocates propose we use, correct? Perhaps you can explicitly mention this.

2. known triangular

It's really too bad you didn't choose an example with a normal distribution.

3. intentionally simple

Really? I doubt the average psychologist finds this simple.

4. statistics

I guess this shows you should always ask an engineer for help, who would just throw down 50 ropes. But ok. ;P

5. probability

So they do not follow the normal distribution, correct? This raises the question in my mind whether this matters or not. Isn't this criticisms relevant when data is normally distributed? If not, please explicitly specify this. If it is, please use normally distributed data, that's closer to what psychologists deal with.

6. one

I guess just dropping down 50 lines at 20 cm distances did not occur to anyone? ;)

7. procedure

and this is not ' precision'?

8. precision of

Perhaps it is important to define ' precision' - I think the term means different things to different people.

9. 349

Again, Cumming & Maillardet (2006) explain this problem the best way, I think. They hould be cited here as well.

10. value

There have been many, many papers explaining this to lay people (e.g., Lakens & Evers, 2014). It might be fair to acknowledge this.

11. FCF

You are not defining the mistake people make very clearly - instead, you use citations of statements by other people. Cumming & Maillardet also clearly show that a single CI will contain the true parameter only 83.4% of the time. It seems you try to judge Cumming on a common language statement, while he clearly would never make the FCF.

12. understanding

To be clear: you mean all the three interpretations you specify above are not correct, right?

13. CP

in the CI context, the abbreviation CP reminds me strongly of a Capture Probability (or Capture Percentage) of a single CI (see Cumming & Maillardet, 2006). Perhaps better not to abbreviate?

14. such

which considerations?

15. suggest

Please start this paragraph by ummarizing in 1 sentence how modern proponents suggest I use CI.

16. may

What exactly do you mean? Should not be used? You have not yet explained how modern proponents suggest we use CI, so I can not understand this main point at this time.

17. o

Capital C