133 Matching Annotations

Apr 2018
www.scienceintheclassroom.org www.scienceintheclassroom.org

Estimating the reproducibility of psychological science

1
1. rmrahal 17 Apr 2018
  
  in Public
  
  high-powered
  
  A study is referred to as high-powered, if the size of the sample from which data is collected is large enough that it becomes highly probable (at least 80% probability) that an effect of interest that exists in the population would actually be found in this data.
  
  For example, let’s say we were interested in finding out whether cupcake consumption increases well-being. Because we cannot ask every person on the planet to please report their well-being, eat a cupcake, and then report their well-being again, we have to restrict our investigation to a certain sample of people.
  
  Glossary
Visit annotations in context

Tags

Glossary

Annotators

rmrahal

URL

scienceintheclassroom.org/research-papers/good-science-reproducible-science
Jan 2018
www.scienceintheclassroom.org www.scienceintheclassroom.org

Estimating the reproducibility of psychological science

1
1. rmrahal 18 Jan 2018
  
  in Public
  
  experimental
  
  A study is referred to as experimental if it contains random allocation of participants to experimental conditions or treatments in which a variable of interest is manipulated. Such experiments can allow claims that the manipulation has caused changes in outcomes.
  
  For example, if we wanted to study the influence of rewards during class on students’ biology exam scores in an experimental study, we would randomly assign students to two conditions: In condition 1, students would receive candy bars for active participation in class, whereas in condition 2, students would not receive any candy bars.
  
  Then we would observe the exam scores for each group of students, to judge if our candy-bar treatment improved the scores compared to the no-candy-bar control condition. We could then conclude if rewards cause better exam scores in this context.
  
  Glossary
Visit annotations in context

Tags

Glossary

Annotators

rmrahal

URL

scienceintheclassroom.org/research-papers/good-science-reproducible-science
Oct 2017
www.scienceintheclassroom.org www.scienceintheclassroom.org

Estimating the reproducibility of psychological science

17
1. rmrahal 10 Oct 2017
  
  in Public
  
  innovation versus verification
  
  Innovation refers to coming up with new ideas for research—in other words, generating new hypotheses. Verification refers to checking if a certain idea holds up in subsequent research—in other words, confirming hypotheses.
  
  Glossary
2. rmrahal 10 Oct 2017
  
  in Public
  
  preregistration
  
  A preregistration is a document in which researchers compile information on how their study will be run and analyzed before it is conducted. The document often contains information on which research question will be pursued; which hypothesis will be tested; how the data is collected and how the sample is generated; which data is excluded; and how the data will be prepared for analysis and ultimately analyzed. Documenting in advance helps separate confirmatory hypothesis testing from exploratory research.
  
  Glossary
3. rmrahal 10 Oct 2017
  
  in Public
  
  (such as experience and expertise)
  
  The expertise of researchers conducting the replication attempts has been the topic of much debate.
  
  In a recent study, Protzko and Schooler have questioned whether researchers' "caliber" influences their success in reproducing studies.
  
  Read more in PsyArXiv Preprints: https://osf.io/preprints/psyarxiv/4vzfs/?t=1&cn=ZmxleGlibGVfcmVjc18y&iid=c229da44ef46429fb9c1547524d11052&uid=1158994320&nid=244+276893704
  
  NewsAndPolicy
4. rmrahal 10 Oct 2017
  
  in Public
  
  repeated measurement designs
  
  A repeated measurement design assesses the same outcome variable at several points in time. For example, let’s say we want to find out whether jogging before class improves students’ ability to follow a class. We might ask 20 students to jog before class and 20 students not to jog before class, and then after class ask them how easy it was for them to follow the class. However, we might be unlucky and conduct our experiment on a day where a particularly difficult topic was covered in class. No one—neither the joggers nor the nonjoggers—could understand the lecture, so all our subjects report they absolutely couldn’t follow the class.
  
  This problem could be ameliorated if we used a repeated measurement design instead. We would ask our 20 joggers and 20 nonjoggers to either jog nor not jog before class on five days in a row, and then ask them for their ability to follow the class each time. Now, we would have not only one point of measurement from each student, but five points of measurement of their ability to follow the class at several points in time.
  
  Glossary
5. rmrahal 10 Oct 2017
  
  in Public
  
  within-subjects manipulations
  
  Within-subjects manipulations refer to situations in experiments where the same person is assigned to multiple experimental conditions.
  
  For example, let’s say we want to find out which of two different learning techniques (A and B) is more effective in helping students prepare for a vocabulary test. If we conducted a within-subjects manipulation, each student would apply both learning techniques. Let’s say every student must first apply learning technique A, then take a vocabulary test, and then a week later for the next test apply learning technique B. We could now compare following which learning technique the students perform better with.
  
  In contrast, if we conducted a between-subjects manipulation, each student would only apply one learning technique. We would split the group of students, so that half of them use learning technique A and then take the vocabulary test, while the other students use learning technique B and then take the vocabulary test. Again, we could compare which learning technique the students perform better with.
  
  Glossary
6. rmrahal 10 Oct 2017
  
  in Public
  
  fixed-effect model
  
  A fixed effect model is a statistical model which accounts for individual differences in the data which cannot be measured by treating them as nonrandom, or “fixed” at the individual level.
  
  As an example, let’s say we wanted to study if drinking coffee makes people more likely to cross the street despite a red light. Our outcome variable of interest is how often each subject crosses a street despite a red light on a walk with 10 red traffic lights. The explanatory variable we manipulate for each participant is if they had a cup of coffee before the experiment or a glass of water (our control condition), and we would use this variable to try to explain ignoring red lights. However, there are several other influences on ignoring red lights which we have not accounted for. Next to random and systematic error, we have also not accounted for individual characteristics of the person such as their previous experience with ignoring red lights.
  
  For instance, have the participants received a fine for this offense? If so, they might be less likely to walk across a red light in our experiment. Using a fixed effect model makes it possible to account for these types of characteristics that rest within each individual participant. This, in turn, gives us a better estimate of the relationship between coffee drinking and crossing red lights, cleaned from other individual-level influences.
  
  Glossary
7. rmrahal 10 Oct 2017
  
  in Public
  
  Also, the replication “succeeds” when the result is near zero but not estimated with sufficiently high precision to be distinguished from the original effect size.
  
  Here, the authors describe a problem of judging the replication success by evaluating the replication effect against the original effect size. When the replication effect size is near zero, it could be possible that the data shows no effect, and therefore we would find an unsuccessful replication attempt. However, the estimation of the effect size could be imprecise. This means that there could be a lot of “noise” in the data, from random or systematic errors in the measurement. If there was a lot of noise in the data, it could distort our impression of whether the effect is really zero or not. We might conclude that a replication with an effect size close to zero was sufficiently different from zero and thus successful, although the effect was really just produced by noise in the data, and the true effect is zero, meaning that the replication could be falsely taken as a success.
  
  AuthorsExperiments
8. rmrahal 10 Oct 2017
  
  in Public
  
  A key weakness of this method is that it treats the 0.05 threshold as a bright-line criterion between replication success and failure (28).
  
  Braver, Thoemmes, and Rosenthal (28) argue that judging the success of a replication only by whether it shows a significant effect (in the current study, at the 0.05 threshold) would be inappropriate. They argue that replication success depends a lot on the statistical power and therefore on the sample size used in the replication study. The replication study must have sufficientl subjects so that it is probable enough that the effect in question, should it really exist in the population, can be found in this sample. If a replication study had low power, for example because the size of the original effect was overestimated and the replication sample size was consequently too small, this makes it less likely that the replication attempt will be successful and show a result that is statistically significant at the 0.05 threshold. For each individual replication study, the replication success therefore depends on the sample size. If you assess several replication attempts individually, the replication success rate could therefore be distorted to underestimate how reproducible an effect really is.
  
  To circumvent this problem, the authors suggest using a different technique than counting if individual replications were significant at the 0.05 threshold. Their analysis is called “continuously accumulatingmeta-analysis.” The data of several replication attempts are combined, so that conclusions on whether the data of all the replication attempts supports the effect of interest.[[this sentence doesn't make sense to me]] After a new replication attempt is conducted, its data is added to the pool of data from previous replication attempts. This data is then taken together, and on the combined data, a test is run to estimate the effect of interest.
  
  PreviousWork
9. rmrahal 10 Oct 2017
  
  in Public
  
  multivariate interaction effects
  
  A multivariate interaction effect is an effect that is the product of several variables working together and influencing each other.
  
  For example, we might be interested in finding out how water temperature (warm: 38°C; cold: 15°C) affects the body temperature of humans and sea lions. We might find that humans, on average, have a higher body temperature than sea lions, and that body temperature is higher when the body is immersed in warm compared to cold water. However, we might find that a human’s body temperature shows bigger differences between the warm and cold water conditions than the sea lion’s body temperature. Because sea lions have a substantive layer of protective fat, their body temperature does not change as much when water temperature changes, compared to humans. Here, species and water temperature show an interaction effect on body temperature.
  
  Glossary
10. rmrahal 10 Oct 2017
  
  in Public
  
  standard error
  
  When experiments are run using a sample instead of the entire population, each sample will show slightly different estimates of the true population parameter. The standard deviation of this range of estimates is called the standard error.
  
  For example, if we wanted to know the average body mass of Chihuahuas, we couldn’t gather data from every single Chihuahua in the world. If we sampled 20 Chihuahuas, we might find that the average is 2.5 kg. If we sample 20 other Chihuahuas, their average weight might be 2.4 kg. Repeating this process, we would find a range of different average weights in the different samples. Taken together, these means are our estimates for the true average Chihuahua body mass in the population of all Chihuahuas. The dispersion, or the amount of variation in these means, is called standard error.
  
  Glossary
11. rmrahal 10 Oct 2017
  
  in Public
  
  Wilcoxon signed-rank test
  
  The Wilcoxon signed-rank test is a statistical procedure used with two related samples. It assesses the differences between each data pair with regard to both direction and size.
  
  For example, if we wanted to find out if students prefer pasta or salad served in the school cafeteria, we could run an experiment where on three consecutive days, we invite 20 students for lunch and observe how many of them chose pasta and how many chose the salad option. We end up with three pairs of data: On the first day, 18 students chose pasta and two chose salad; on the second day, 15 students chose pasta and four chose salad; on the third day, four students chose pasta and 16 chose salad. The test now calculates the differences between each data pair: On the first day, the difference is 18 – 2 = + 16; on the second day, the difference is 15 – 4 = + 11; on the third day, the difference is 4 – 16 = - 12. Then, the differences are sorted by their absolute size (ignoring the sign: 11, 12, 16) and assigned a rank (11 gets rank 1, 12 gets rank 2, 16 gets rank 3). The sum of the ranks of the positive differences (1 + 3 = 4) is then compared to that of the negative differences (2). The smaller of the two sums of ranks (2) is then compared against a critical value, which informs us whether it is statistically different from zero. If we find a statistically significant result, we can conclude that students have a preference for pasta over salad.
  
  Glossary
12. rmrahal 10 Oct 2017
  
  in Public
  
  df
  
  Df is an abbreviation for the term “degrees of freedom.” The degrees of freedom are an important piece of information for a statistical test, which describes the number of values in the analysis that are free to vary. It depends on how many values are considered (that means, how big the sample size is), and which statistical test is used.
  
  Glossary
13. rmrahal 10 Oct 2017
  
  in Public
  
  exploratory analyses
  
  An exploratory analysis is conducted in the absence of a specific hypothesis you would like to confirm with your study. They are used to explore the data; that is, to see what data patterns can be found, without trying to prove a specific point.
  
  Glossary
14. rmrahal 10 Oct 2017
  
  in Public
  
  generalizability
  
  When we conduct a scientific study, it is often not possible to collect data from every person in the population in the exact situation we want to study. Instead, we often have only a sample of subjects, which we observe in a certain, typical situation. For example, if we want to study adherence to red lights in traffic, we cannot check if every human being will stop at every red light, when driving cars, riding a bike, walking, skateboarding, or using any other means of transportation. We could, however, test 200 pedestrians’ behavior at the traffic light in front of a university.
  
  Generalizability refers to whether a study’s findings, given its own restricted circumstances, can be extended to make statements about what will be true for the population in general, and for similar situations. For example, imagine we want to study adherence to red lights in traffic by observing 200 pedestrians’ behavior at the traffic light in front of a university. Given that our sample size is small and not representative (because there are mostly students in front of a university, a very specific sample of people), and that the situation we observe is only one facet of participation in traffic (we ignore driving, cycling, skateboarding, etc.), we could not make very good statements about adherence to red lights in general.
  
  Glossary
15. rmrahal 10 Oct 2017
  
  in Public
  
  predictors
  
  A predictor (sometimes also called a predictor variable or an independent variable) is a variable that represents the potential reasons why we see a certain result.
  
  For example, if we wanted to study which factors increase students’ performance in their final exams, we could consider a number of different potential reasons, or predictors, such as how often they did their homework during the past school year, how much time they spent reviewing the materials before the exam, or how well they slept the night before the exam.
  
  Glossary
16. rmrahal 10 Oct 2017
  
  in Public
  
  random or systematic error
  
  There are two sources of error which can occur in scientific studies and distort their results.
  
  Systematic errors are inaccuracies that can be reproduced. For example, imagine we wanted to measure a participant’s weight and we make our participant step on five different scales and measure her weight on each scale 10 times. Four scales report that she weighs 74 kg each time she steps on them. The last scale shows that she weighs 23 kg each time she steps on it. We would say there is a systematic error involved in our study of her weight, because the last scale consistently and erroneously reports her weight as too low.
  
  Random errors are inaccuracies that occur because there are unknown influences in the environment. For example, imagine we wanted to measure a participant’s weight and had her step on the same scale three times in a row, within one minute. The first time, the scale reports 74.43 kg, the second time 74.34 kg, the third time 74.38 kg. We don’t think that the participant's weight has actually changed in this 1 minute, yet our measurement shows different results, which we would attribute to random errors.
  
  Glossary
17. rmrahal 10 Oct 2017
  
  in Public
  
  correlational
  
  A study is referred to as correlational if it investigates if there is a relationship between two factors without assigning subjects to conditions manipulating a variable of interest. A causal interpretation (that changes in factor A cause changes in factor B) is not possible in correlational studies.
  
  For example, if we wanted to study the influence of intelligence on students’ biology exam scores in a correlational study, we would first observe students’ intelligence via an IQ test, and then measure their score in the exam. Then we could judge if there was a positive relationship between IQ and exam score: Smarter students might be shown to score better on the test. However, since we did not manipulate students’ IQ to be high or low, we could not say that a higher IQ causes better test scores, only that the two variables are positively related.
  
  Glossary
Visit annotations in context

Tags

Glossary

NewsAndPolicy

AuthorsExperiments

PreviousWork

Annotators

rmrahal

URL

scienceintheclassroom.org/research-papers/good-science-reproducible-science
Jul 2017
www.scienceintheclassroom.org www.scienceintheclassroom.org

Estimating the reproducibility of psychological science

18
1. rmrahal 28 Jul 2017
  
  in Public
  
  confidence intervals (CIs)
  
  When studies are run, we aim at estimating values that are true for the population. However, we often cannot record data from everyone in the population, which is why we rely on drawing a random sample from the population. For example, while we may want to estimate the average difference in height between all men and all women in the world, we cannot possibly measure the height of all men and women in the world. Therefore, we draw a random sample of men and women. Let's say we collect data from 100 men and 100 women. The study reveals the average difference in height we find in this sample of 200 people, but it does not tell us what the true difference in height in the population of all men and women in the world is.
  
  If we drew random samples of 200 people from the population of all men and women in the world again and again and again, and assessed their average difference in height each time, we would find a range of values. This range of values represents our estimates for the height difference in the population of all men and women in the world.
  
  We refer to this range of values (interval) as the confidence interval. We want to make sure that it includes the true value of the variable we are estimating for the population sufficiently often. If we refer to a 95% Confidence Interval ('CI'), this means that our range of estimates from random samples contains the true value of the population in 95% of all cases.
  
  If we calculate a CI from one study that we have run, it tells us the probability (e.g., 95%) that the CIs of repeated future samples would contain the true population value.
  
  Glossary
2. rmrahal 28 Jul 2017
  
  in Public
  
  cumulative process
  
  The term “cumulative process” here refers to taking an approach to research in which we try to gain insight not by interpreting strongly the results of one individual study at a time, but by integrating the results of several studies and broader research programs to gain an overview of the overall evidence.
  
  Glossary
3. rmrahal 28 Jul 2017
  
  in Public
  
  validity
  
  Validity refers to the degree to which a certain result or conclusion in research corresponds with reality. There are different aspects of a study which can improve or decrease its validity. For example, a study has high ecological validity if its results can be directly applied to real-life situations outside of the lab.
  
  Glossary
4. rmrahal 28 Jul 2017
  
  in Public
  
  narrow-and-deep approach
  
  This refers to results of studies that go into detail on a specific area, without covering a wide range of different topics.
  
  Glossary
5. rmrahal 28 Jul 2017
  
  in Public
  
  broad-and-shallow evidence
  
  This refers to results of studies that cover a wide range of different topics, without going into detail on a specific area.
  
  Glossary
6. rmrahal 28 Jul 2017
  
  in Public
  
  upwardly biased effect sizes
  
  Here, upwardly biased means that the effect sizes reported in the literature are distorted to appear bigger than they really are.
  
  Glossary
7. rmrahal 28 Jul 2017
  
  in Public
  
  consistently
  
  When results of several analyses point in the same direction, we say the results are consistent. For example, if we run three correlation analyses and find that enjoyment of hiking, self-assessed nature-lovingness, and number of times previously hiked all correlate positively with the probability that someone enjoys hiking holidays, we would say that the results are consistent. If we found that the number of times previously hiked was negatively correlated with the probability that someone enjoys hiking holidays, the results would be less consistent.
  
  Glossary
8. rmrahal 28 Jul 2017
  
  in Public
  
  pre-analysis plans
  
  A pre-analysis plan is a document that specifies which analyses will be run on the data, before these analyses are performed. This plan can specify which variables and analyses will be used, how data will be prepared for analyses, and in which cases data will be excluded from analyses. This tool helps researchers specify and commit to the way they want to run the analyses in their study.
  
  Glossary
9. rmrahal 28 Jul 2017
  
  in Public
  
  confirmatory tests
  
  A confirmatory test is a statistical analysis of a certain relationship which had previously been hypothesized to hold. The test tries to find out if the hypothesis is supported by the data.
  
  Glossary
10. rmrahal 28 Jul 2017
  
  in Public
  
  publication bias
  
  Publication bias is a type of distortion that can occur in making academic research public. When findings which show that a certain effect of interest was found to be statistically significant are more likely to be published than findings which show no evidence or even evidence against this effect, publication bias is present. In this case, if you only read the published papers, you would find a lot of papers showing support for an effect, while studies which do not show support for the same effect are not published, giving you the impression that the effect was less disputed and more consistently found than it actually is.
  
  Glossary
11. rmrahal 28 Jul 2017
  
  in Public
  
  population effect size
  
  The population effect size is the estimate of the strength of the effect in the population of all possible subjects (e.g., all humans).
  
  Glossary
12. rmrahal 28 Jul 2017
  
  in Public
  
  goodness-of-fit χ2 test
  
  A goodness-of-fit test indicates how well a statistical model fits the data. It shows whether the difference between the observed data and the predicted, expected values is too big, or if the difference is small enough that we could assume the model captures reality sufficiently well. A goodness-of-fit χ2 (chi-squared) test is a specific type of goodness-of-fit tests.
  
  Glossary
13. rmrahal 28 Jul 2017
  
  in Public
  
  Spearman’s rank-order correlations
  
  Spearman’s rank-order correlation is a specific type of correlation analysis, which assess the relationship between two variables with regard to its strength and direction.
  
  Glossary
14. rmrahal 28 Jul 2017
  
  in Public
  
  standardizing
  
  Standardizing refers to a procedure of preparing the data for analysis, in which all data are transformed such that their mean across the participants lies at 0 and that their standard deviation is 1.
  
  Glossary
15. rmrahal 28 Jul 2017
  
  in Public
  
  sample size
  
  The sample size refers to the number of people from whom data is collected in a study.
  
  Glossary
16. rmrahal 28 Jul 2017
  
  in Public
  
  R script
  
  An R script is a document written in the programming language R which contains a number of commands that the computer should execute. For this study, all commands necessary to run the analyses reported here are compiled in such a script, which is available online, so that everyone who is interested in them can download the script and rerun all analyses on their own computer.
  
  Glossary
17. rmrahal 28 Jul 2017
  
  in Public
  
  R statistical programming language
  
  R is a computer program that can produce statistical analyses. To run an analysis, scientists tell this program what they want to do in a specific programming language which the computer speaks.
  
  Glossary
18. rmrahal 28 Jul 2017
  
  in Public
  
  accumulated evidence
  
  Accumulated evidence refers to the results of several studies taken together.
  
  Glossary
Visit annotations in context

Tags

Glossary

Annotators

rmrahal

URL

scienceintheclassroom.org/research-papers/good-science-reproducible-science
Jun 2017
www.scienceintheclassroom.org www.scienceintheclassroom.org

Estimating the reproducibility of psychological science

1
1. rmrahal 20 Jun 2017
  
  in Public
  
  Results
  
  The authors used 5 measures for replication success to check to what extent the 100 original studies could be successfully replicated.
  
  Conclusions
Visit annotations in context

Tags

Conclusions

Annotators

rmrahal

URL

scienceintheclassroom.org/research-papers/good-science-reproducible-science
Jan 2017
scienceintheclassroom.org scienceintheclassroom.org

Science in the Classroom | A collection of annotated research papers and accompanying teaching materials

63
1. rmrahal 28 Jan 2017
  
  in Public
  
  Abstract
  
  Video recording of a video symposium explaining the motivation and methodology of the reproducibility project, previewing preliminary results and offering discussion points on implications:
  
  Reproducibility Project Psychology results preview meeting
  
  Resources
2. rmrahal 28 Jan 2017
  
  in Public
  
  Such debates are meaningless, however, if the evidence being debated is not reproducible.
  
  An example of a case where psychologists are currently face vivid debates about the replicability of an effect is the Ego Depletion literature, as explained in this video:
  
  Why an Entire Field of Psychology Is in Trouble (by SciShow)
  
  Resources
3. rmrahal 28 Jan 2017
  
  in Public
  
  Reproducibility
  
  Introductory video summarizing the Reproducibility Project:
  
  Science 101: The Basics of Reproducibility/Replicability (by Public Domain TV)
  
  Resources
4. rmrahal 23 Jan 2017
  
  in Public
  
  T. M. Errington et al., An open investigation of the reproducibility of cancer biology research. eLife 3, e04333 (2014). doi: 10.7554/eLife.04333; pmid 25490932
  
  Similarly to the reproducibility project in psychology, Errington and colleagues planned to conduct replication attempts on 50 important papers from the field of cancer biology. While the registered reports are already available online, the replication studies themselves are currently still being conducted.
  
  Read more on eLife: https://elifesciences.org/collections/reproducibility-project-cancer-biology
  
  References
5. rmrahal 23 Jan 2017
  
  in Public
  
  B. A. Nosek, J. R. Spies, M. Motyl, Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspect. Psychol. Sci. 7, 615–631 (2012). doi: 10.1177/1745691612459058; pmid 26168121
  
  Nosek and colleagues argue that scientists are often torn between "getting it right" and "getting it published": while finding out the truth is the ultimate goal of research, more immediately, researchers need to publish their work to be successful in their profession.
  
  A number of practices, such as the establishment of journals emphasizing reports of non-significant results, are argued to be ill suited for improving research practices. To reconcile the two seemingly-at-odds motives, Nosek and colleagues suggest measures such as lowering the bar for publications and emphasizing scientific rigor over novelty, as well as openness and transparency with regard to data and materials.
  
  References
6. rmrahal 23 Jan 2017
  
  in Public
  
  L. K. John, G. Loewenstein, D. Prelec, Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23, 524–532 (2012). doi: 10.1177/ 0956797611430953; pmid 22508865
  
  John, Loewenstein and Prelec conducted a survey with over 2,000 psychologists to identify to which extent they used questionable research practices (QPRs). The respondents were encouraged to report their behavior truthfully, as they could increase donations to a charity of their choice by giving more truthful answers.
  
  Results showed that a high number of psychologists admitted to engaging in QRPs such as almost 70% of all respondents admitting to not reporting results for all dependent measures, or around 50% of respondents admitting to reporting only studies that showed the desired results. Moreover, results showed that researchers suspected their peers also occasionally engaged in such QRPs, but that psychologists thought that there was generally no good justification for engaging in QRPs.
  
  References
7. rmrahal 23 Jan 2017
  
  in Public
  
  research community is taking action
  
  An important part of taking action to advance psychological research is establishing an open discussion and dialogue about the directions the field could take. In the course of this movement, several researchers' blogs have become an increasingly popular medium for such debate.
  
  Read more on the topic of reproducibility in Andrew Gelman's Blog: http://andrewgelman.com/?s=reproducibility and in Uri Simonsohn's Blog Data Colada: http://datacolada.org/?s=reproducibility .
  
  NewsAndPolicy
8. rmrahal 23 Jan 2017
  
  in Public
  
  Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than was variation in the characteristics of the teams conducting the research (such as experience and expertise).
  
  Third, from this study we can conclude that the precautions the authors took against replication success depending on the team of researchers who conducted the replication study were quite successful: there is no evidence that characteristics of the replication team influenced the outcomes of the replication attempt.
  
  Rather than replication-team-specific differences that influenced the replication outcome, there were systematic differences between successfully replicated and non-replicable studies based on characteristics of the original study. Therefore, the fourth conclusion of this paper is that original studies which showed stronger evidence for the effect they investigated were also more likely to be successfully replicated.
  
  Conclusions
9. rmrahal 23 Jan 2017
  
  in Public
  
  Meta-analysis combining original and replication effects
  
  Moreover, the authors planned to combine the results of each original and replication study, to show if the cumulative effect size was significantly different from zero. If the overall effect was significantly different from zero, this could be treated as an indication that the effect exists in reality, and that the original or replication did not erroneously pick up on an effect that did not actually exist.
  
  AuthorsExperiments
10. rmrahal 23 Jan 2017
  
  in Public
  
  hypothesis that this proportion is 0.5
  
  In this case, testing against the null hypothesis that the half of the replication effects are stronger than the original study effects means assuming that there is only a chance difference between the effect sizes. The alternative hypothesis is that the replication effects are on average stronger, or weaker, than the original study effects.
  
  AuthorsExperiments
11. rmrahal 23 Jan 2017
  
  in Public
  
  Fisher’s method
  
  Fisher's method is a statistical procedure for conducting meta analyses, in which the results of all included studies are combined. The procedure examines the p-values of the individual studies, and allows inferences on whether the null hypothesis (that there are, in fact, no effects) holds.
  
  Glossary
12. rmrahal 23 Jan 2017
  
  in Public
  
  subjective assessments of replication outcomes
  
  One of the indicators for whether a study was replicated successfully or not was a subjective rating: each team of replicators was asked if their study replicated the original effect (yes or no).
  
  AuthorsExperiments
13. rmrahal 23 Jan 2017
  
  in Public
  
  sampling frame and selection process
  
  The authors wanted to make sure that the studies that were selected for replication would be representative of psychological research, that means they would give a good picture of which kinds of studies psychologists typically run. Representativeness was important because it would mean that the conclusions that would be drawn from the replication outcomes could be cautiously extended to assumptions about the state of the field overall.
  
  At the same time, they had to make sure that the studies selected could also be conducted (that is, that one of the coauthors had the necessary skill or equipment to collect the data).
  
  To achieve this goal, a step-wise procedure was used: starting from the first issue of 2008 from three important psychology journals, 20 studies were selected and matched with a team of replicators who would conduct the replication attempt. If articles were left over because no-one could conduct the replication, but more replication teams were willing to conduct a study, another 10 articles were made available. In the end, out of 488 studies drawn from the population of studies, 100 studies were attempted to be replicated.
  
  AuthorsExperiments
14. rmrahal 23 Jan 2017
  
  in Public
  
  transparency
  
  Transparency here means that the process in which a specific result was achieved is made as accessible for other researchers as possible, by explaining publicly, and in detail, everything that was done in a study to arrive at a specific result..
  
  Glossary
15. rmrahal 23 Jan 2017
  
  in Public
  
  11
  
  Prinz and colleagues comment on their experience as employees of a pharmaceutical company, which relies on preclinical research to decide whether to invest into the exploration and development of new drugs. Because companies find many preclinical research findings unreliable, they now often conduct their own research to reproduce the original findings before they decide to move on and invest large sums of money into the actual drug development. Only in about 20% to 25% of the cases did the company scientists report finding results of the reproduction that were in line with the originally reported findings.
  
  PreviousWork
16. rmrahal 23 Jan 2017
  
  in Public
  
  10
  
  Begley and Ellis are cancer researchers, who propose ways for research methods, publication practices and incentives for researchers to change so that research would yield more reliable results, such as more effective drugs and treatments. They argue that often new drugs and treatments enter clinical trials, which test their effectiveness to treat cancer in humans, before they reach sufficient standards in preclinical testing, leading to non-reproducible findings. To achieve more reliable preclinical results, they argue that more focus should be placed on reproducing promising findings in the preclinical phase.
  
  PreviousWork
17. rmrahal 23 Jan 2017
  
  in Public
  
  8)
  
  Schmidt argues that, although replication is critical for scientific progress, little systematic thought had been applied to how to go about replications.
  
  He suggests to differentiate direct replication (the repetition of an experimental procedure) and conceptual replication (the repeated test of a hypothesis or result using different methods).
  
  Moreover, he summarizes five main functions that replications serve: to control for sampling error, artifacts or fraud, to extend results to a larger or different populations and to check the assumptions earlier experiments made.
  
  Schmidt concludes that, although a scientific necessity, replications can be practically difficult to conduct, in particular because this type of work is not always easy to publish or highly regarded. Instead, he recommends that studies which include novel research questions could also include elements of replication of previous findings.
  
  PreviousWork
18. rmrahal 12 Jan 2017
  
  in Public
  
  or other disciplines
  
  Camerer and colleagues conducted a project aimed at evaluating the reproducibility of studies in experimental economics, using a somewhat different methodology.
  
  Read more in Science: http://science.sciencemag.org/content/351/6280/1433.full.pdf+html
  
  NewsAndPolicy
19. rmrahal 11 Jan 2017
  
  in Public
  
  meta-analyses
  
  Meta-analyses integrate the results of multiple studies to draw overall conclusions on the evidence.
  
  Glossary
20. rmrahal 11 Jan 2017
  
  in Public
  
  P value
  
  A p-value is a statistical threshold for determining if a result is extreme enough to be considered compelling evidence, because it is unlikely that this result would manifest in the data if the effect did not exist in reality.
  
  Glossary
21. rmrahal 11 Jan 2017
  
  in Public
  
  eye tracking machines
  
  Eye tracking machines are devices that can record eye-movements and make it possible to show what information people look at without asking them explicitly what they are attending to.
  
  Glossary
22. rmrahal 11 Jan 2017
  
  in Public
  
  social psychology
  
  Social psychology is a subdiscipline of psychology that studies how people interact with their social environment, and how their thoughts and behaviors are affected by others.
  
  Glossary
23. rmrahal 11 Jan 2017
  
  in Public
  
  correlation coefficient (r)
  
  A correlation coefficient describes the linear interdependence of two variables. It shows both the direction (positive coefficient: as A increases, B increases as well; negative coefficient: as A increases, B decreases), and the strength of the relationship (coefficient close to zero: strong relationship; coefficient close to +/- 1: weak relationship).
  
  For example, there might be a positive correlation between years of attendance to school and crystallized intelligence: with increasing school attendance, people could acquire more knowledge. On the other hand, there could be a negative correlation between age and fluid intelligence: with increasing age, people could get worse at solving problems in new situations.
  
  Glossary
24. rmrahal 11 Jan 2017
  
  in Public
  
  9
  
  Ioannidis conducted simulations to show that for most studies, it is more likely for a finding to be a false positive than true identification of an effect. Among the factors that make it more likely for research findings to be false are a small size of the sample or the underlying effect, and when designs, definitions and analyses are more flexible rather than rigorously objective.
  
  PreviousWork
25. rmrahal 11 Jan 2017
  
  in Public
  
  13)
  
  In this editorial, Prahler and Wagenmakers argue that doubts about the reproducibility of findings in psychology became increasingly critical after events such as the fraud case of Stapel in 2011, where fabricated and manipulated data resulted in numerous retractions of journal articles, or the debate around findings published by Bem in 2011, where claims that people had an ability to forsee the future were shown not to be replicable. The suspicion that researchers engaged in "questionable research practices"(QRPs) turned out to be more justified than the field had hoped for, for instance based on Simonsohn's findings that many psychologists admitted to engaging in some of these QRPs.
  
  PreviousWork
26. rmrahal 11 Jan 2017
  
  in Public
  
  sufficient
  
  Sufficient conditions are the circumstances that are enough to find a specific effect. If these conditions are not met, the effect could still be found in another way.
  
  For example, to find the effect that people can be manipulated to be more prosocial, it is sufficient to study what happens after you ask participants to think about a time where someone was generous to them, and then ask them to make a donation. This circumstance could be enough to make people more prosocial (it would therefore be sufficient), but you could think of other circumstances that could achieve the same result.
  
  Glossary
27. rmrahal 11 Jan 2017
  
  in Public
  
  necessary
  
  Necessary conditions are the circumstance that must be met in order to find a specific effect. If these conditions are not met, the effect cannot be found.
  
  For example, to find the effect that prosocial people are more likely to give change to a beggar, a necessary condition would be studying human subjects, not penguins.
  
  Glossary
28. rmrahal 11 Jan 2017
  
  in Public
  
  moderate
  
  In statistics, moderations refers to the dependence of the relationship between two variables on a third variable.
  
  For example, the positive relationship between socioeconomic status and health (the higher one's status, the better one's health) could be moderated by one's sense of control: people in low income groups with high sense of control might show health levels comparable with people from high-income groups, whereas people in low income groups with low sense of control have worse health (Lachman & Weaver, 1998).
  
  Glossary
29. rmrahal 11 Jan 2017
  
  in Public
  
  Editor's Introduction
  
  Reproducibility in psychology: How much research holds up when run again?
  
  The field of psychology has seen troubling news of researchers faking data or of questionable findings that wouldn't hold up when other researchers tried to run the studies again. A critical mass of such new was reached, and some argued that psychology was in crisis, that researchers were pushed to prioritize getting published over investigating the truth, and that faulty research reports could hinder the progress of psychological science. But how do you estimate if bad news only came from a few black sheep, and if the field of psychology overall still does good work?
30. rmrahal 11 Jan 2017
  
  in Public
  
  Reproducibility
  
  Reproducibility is the feature of an experiment that speaks to whether it can be run again, and if the same results as in previous runs of this experiment can be found. If an experiment has been reproduced successfully, it has been conducted more than once, and the overall evidence suggests that the original findings holds in the reproducing studies (also referred to as replication studies, or replications) as well.
  
  Glossary
31. rmrahal 11 Jan 2017
  
  in Public
  
  D. Fanelli, “Positive” results increase down the hierarchy of the sciences. PLOS ONE 5, e10068 (2010). doi: 10.1371/journal. pone.0010068; pmid 20383332
  
  Fanelli assessed more than 2000 papers from different scientific disciplines and found that the proportion of studies reporting support for their hypotheses increased in the disciplines such as psychology or economics compared to disciplines such as space science. It is concluded that both the type of hypotheses tested and the rigor applied in these tests differ between fields.
  
  References
32. rmrahal 11 Jan 2017
  
  in Public
  
  K. S. Button et al., Power failure: Why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376 (2013). doi: 10.1038/nrn3475; pmid 23571845
  
  Button and colleagues study the average statistical power of studies in neuroscience, and conclude that it is low. They highlight that low power does not only mean that studies have a lower chance of detecting a true effect, but that when using such low-powered studies, it is also becomes less likely that a significant effect indeed reflects a true effect. It is argued that using studies with low power may seem efficient at the first glance, because less money is spent on subjects, but that indeed, because future research could be building on an erroneous line of investigations, low-powered studies are inefficient in the long-run.
  
  References
33. rmrahal 11 Jan 2017
  
  in Public
  
  J. K. Hartshorne, A. Schachner, Tracking replicability as a method of post-publication open evaluation. Front. Comput. Neurosci. 6, 8 (2012). doi: 10.3389/fncom.2012.00008; pmid 22403538
  
  Hartshorne and Schachner suggest that replication success should be traced in a database connecting replication attempts with original studies. Based on this information, a replication success score could be computed, which could be used as a criterion for a journal's quality alongside other indicators such as citation counts.
  
  References
34. rmrahal 11 Jan 2017
  
  in Public
  
  G. S. Howard et al., Do research literatures give correct answers? Rev. Gen. Psychol. 13, 116–121 (2009). doi 10.1037/a0015468
  
  Howard and colleagues examine how the file drawer problem affects a research literature. They compare "old", existing bodies of literature that could be suffering from the file drawer problem (i.e., which could include the few studies were an effect was found, while studies yielding non-significant results on the same effect were never published) with a newly constructed body of literature guaranteed to be free of the file drawer problem, which they achieved by conducting new studies. This investigation suggests that some bodies of literature are supported as relatively file-drawer-free, while other bodies of literature raise concern and kindle further studies on the effects they include.
  
  References
35. rmrahal 11 Jan 2017
  
  in Public
  
  A. G. Greenwald, Consequences of prejudice against the null hypothesis. Psychol. Bull. 82, 1–20 (1975). doi 10.1037/h0076157
  
  Greenwald examines how research practices discriminate against accepting the null hypothesis (that an effect does not exist). Using a simulation, he suggests that too few publications accept the null hypothesis, and that the proportion of publications which falsely reject the null hypothesis although it would have been true is high.
  
  Greenwald further debunks traditional arguments why a null hypothesis should not be accepted, and suggests ways to improve research practices to improve the acceptance of accepting the null hypothesis.
  
  References
36. rmrahal 11 Jan 2017
  
  in Public
  
  Many Labs replication projects (32)
  
  Many Labs replication projects are studies in which multiple labs attempt to replicate the same effect. In this example, 36 teams of researchers from different countries attempted to replicate the same 13 effects, with more than 6000 participants.
  
  The data revealed that 10 effects could consistently be replicated, while one effect showed only weak support for replication and two effects could not be replicated successfully.
  
  PreviousWork
37. rmrahal 11 Jan 2017
  
  in Public
  
  J. P. Simmons, L. D. Nelson, U. Simonsohn, False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22, 1359–1366 (2011). doi: 10.1177/0956797611417632; pmid 22006061
  
  Simmons and colleagues conduct computer simulations and two experiments that show how significant results can easily be achieved for a research hypothesis that is false. They show that flexibility - or as they call it: researcher degrees of freedom - in four areas contributes to making it more likely to find significant effects for a false hypothesis:
  
  Flexibility in choosing the dependent variables reported: When researchers flexibly analyze two related dependent variables, this already almost doubles the probability of finding a positive result for a false hypothesis. 2. Flexibility in choosing the sample size: When researchers stop data collection, find no significant result, and collect additional data before checking for the same effect, this increases the probability of finding a positive result for a false hypothesis by 50%.
  
  Flexibility in the use of additional variables included in the analyses: When researchers include additional variables in the analyses, false positive rates more than double.
  
  Flexibility in the number of experimental conditions reported: When researchers collect data in three experimental conditions and flexibly decide whether to report the result of comparisons between any two conditions or all three, this more than doubles the false positive rate.
  
  If researchers used research practices where they used all four flexibilities, they would, overall, be more likely than not to find positive results although the underlying hypothesis was indeed false.
  
  References
38. rmrahal 11 Jan 2017
  
  in Public
  
  R. Rosenthal, The file drawer problem and tolerance for null results. Psychol. Bull. 86, 638–641 (1979). doi: 10.1037/0033- 2909.86.3.638
  
  Rosenthal addresses the 'file drawer problem', a questionable research practice where only studies that showed the desired result would be published and all other studies would land in the 'file drawer' and would not be known to the scientific community.
  
  In the extreme case, this could mean that, if a specific effect did not exist in reality, the 5% of studies that (due to statistical error allowed) find this effect get published and discussed as if the effect were true, whereas 95% of studies do not (and rightly so) find the effect, but are tucked away in a file drawer. This problem hinders scientific progress, as new studies would build on old, but false, effects.
  
  Rosenthal introduces a way to assess the size of the file drawer problem, the tolerance to future null results: calculating the number of studies with null results that would have to be in a file drawer before the published studies on this effect would be called into question.
  
  References
39. rmrahal 11 Jan 2017
  
  in Public
  
  Transparency and Openness Promotion (TOP) Guidelines (http://cos.io/top) (37)
  
  Nosek and colleagues summarize eight standards for transparency and openness in research that focus on citations, data accessibility, accessibility of computational resources, making research materials like participant instructions available and giving access to the design and analyses, study and analysis plan pre-registration, and the use of replication studies over all. They argue that journals should require and enforce adherence to transparency guidelines, and that the submission of replication studies, in particular in the Registered Report format, should be an option.
  
  PreviousWork
40. rmrahal 11 Jan 2017
  
  in Public
  
  Scientific incentives
  
  Incentives for working in scientific research often differ greatly by country and institution. In the UK, for instance the allocation of research funding and institutional positions depends on the number of published papers which are rated as highly original, significant and rigorously conducted.
  
  Read more in The Guardian: https://www.theguardian.com/higher-education-network/2016/oct/17/why-is-so-much-research-dodgy-blame-the-research-excellence-framework
  
  NewsAndPolicy
41. rmrahal 11 Jan 2017
  
  in Public
  
  procedures that are more challenging to execute may result in less reproducible results
  
  Making an experimental setup transparent and reproducible can be quite difficult, because undetected but theoretically relevant variations to the study protocol could produce different results. However, there are some new ideas about how such transparency could be achieved.
  
  Read more in The New Yorker: http://www.newyorker.com/tech/elements/how-methods-videos-are-making-science-smarter
  
  NewsAndPolicy
42. rmrahal 11 Jan 2017
  
  in Public
  
  improve the quality and credibility of the scientific literature
  
  Improving the quality and credibility of scientific literature can be accomplished through improving the daily practices involved in the research process. Improved reporting and registering hypotheses and sample sizes are some ideas for such improvements.
  
  Read more in Nature Human Behavior: http://www.nature.com/articles/s41562-016-0021
  
  NewsAndPolicy
43. rmrahal 11 Jan 2017
  
  in Public
  
  Other investigators may develop alternative indicators to explore further the role of expertise and quality in reproducibility on this open data set.
  
  In a later approach to estimating how researchers assess the reproducibility of science, a large-scale survey was conducted with more than 1500 researchers answering questions such as "Have you failed to reproduce an experiment?".
  
  Read more in Nature: http://www.nature.com/polopoly_fs/1.19970!/menu/main/topColumns/topLeftColumn/pdf/533452a.pdf
  
  NewsAndPolicy
44. rmrahal 11 Jan 2017
  
  in Public
  
  The present results suggest that there is room to improve reproducibility in psychology. Any temptation to interpret these results as a defeat for psychology, or science more generally, must contend with the fact that this project demonstrates science behaving as it should.
  
  The fifth and final conclusion of the paper regards the question what psychologists and other researchers should take from these results regarding their overall research practices. The conclusion is mixed. On the one hand, the authors recognize that research is a process where new ideas have to be explored and sometimes might turn out not to be true. Maximum replicability is therefore not desirable, because it would mean that no more innovations are being made. On the other hand, the authors also conclude that there is room for improvement: stronger original evidence and better incentives for replications would put progress in psychological research on a stronger foundation.
  
  Conclusions
45. rmrahal 11 Jan 2017
  
  in Public
  
  Nonetheless, collectively, these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings
  
  Because there is some uncertainty about how exactly the replication success rate in psychological research should be determined, the authors go about the interpretation of the results of this study very conservatively. This very careful interpretation of the data, and the second conclusion of this study, is that the replication studies yielded largely weaker evidence for the effects studied than the original studies.
  
  Conclusions
46. rmrahal 11 Jan 2017
  
  in Public
  
  No single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility.
  
  The discussion section begins with a cautionary sentence reminding the reader that it is difficult to say exactly how many original studies were successfully replicated. The precise conclusions drawn from this paper depend a lot on which of the 5 measures used to determine replication success you think is the most appropriate measure. The results of measure indicate a replication success rate as low as 36%, while another measure suggests a success rate of 68%. Perhaps some researchers would even say that another measure not included in this study would have made it possible to draw more meaningful conclusions. The scientific community has so far not agreed on what measure should be used to evaluate replication success rates.
  
  Moreover, there are other limitations to this approach to studying reproducibility (see paragraph "Implications and limitations"), which make it difficult to generalize the findings of this study to psychological research in general, or even to other disciplines. It is also difficult to evaluate from the findings in this study whether the evidence indicates a specific effect is true or does not exist.
  
  Therefore, the first conclusion of this paper is that all interpretations of the data are only an estimation of how reproducible psychological research is, not an exact answer.
  
  Conclusions
47. rmrahal 11 Jan 2017
  
  in Public
  
  Subjective assessment of “Did it replicate?”
  
  Finally, the authors used the subjective rating of whether the effect replicated as an indicator of replication success. Out of 100 replication teams, only 39 reported that they thought they had replicated the original effect.
  
  AuthorsExperiments
48. rmrahal 11 Jan 2017
  
  in Public
  
  Comparing original and replication effect sizes
  
  With this third measure for replication success, the authors further compared the sizes of the original and replicated effects. They found that the original effect sizes were larger than the replication effect sizes in more than 80% of the cases.
  
  AuthorsExperiments
49. rmrahal 11 Jan 2017
  
  in Public
  
  Evaluating replication effect against original effect size
  
  In a second way to look at the replication success, the authors checked if the sizes of the effects of the original studies weren't too far off from those of the replication studies. Using this measure, they found that less than half the replications showed an effect size that was not too different from the original effect size to speak of a successful replication.
  
  AuthorsExperiments
50. rmrahal 11 Jan 2017
  
  in Public
  
  Evaluating replication effect against null hypothesis of no effect
  
  First, the authors used the 5 measures for replication success to check to what extent the 100 original studies could be successfully replicated.
  
  In a first glance at the results, the authors checked how many replications "worked" by analyzing how many replication studies showed a significant effect with the same direction (positive or negative) as the original studies. Of the 100 original studies, 97 showed a significant effect. Because each replication study did not have a probability of 100% to find a positive result even if the investigated effect was true, if all original effects were true, we could have maximally expected around 89 successful replications. However, results showed that only 35 studies were successfully replicated.
  
  AuthorsExperiments
51. rmrahal 11 Jan 2017
  
  in Public
  
  Analysis of moderators
  
  Last, the authors wanted to know if successfully replicable studies differed from studies that could not be replicated in a systematic way. For this, they would check if a number of differences in the original studies would be systematically related to successfully replicable studies.
  
  AuthorsExperiments
52. rmrahal 11 Jan 2017
  
  in Public
  
  Subjective assessment of “Did it replicate?”
  
  Finally, the authors included a last measure for replication success: a subjective rating. All researchers who conducted a replication were asked if they thought their results replicated the original effect successfully. Based on their yes or no answers, subjective replication success would be calculated.
  
  AuthorsExperiments
53. rmrahal 11 Jan 2017
  
  in Public
  
  “coverage,” or the proportion of study-pairs in which the effect of the original study was in the CI of the effect of the replication study
  
  In this test for replication success, the authors will compare the size of the original study effect and the effect of the replication study to identify if there are indications that they are not too different, so that it is likely that the effect sizes in both samples correspond to the same effect size in the population.
  
  AuthorsExperiments
54. rmrahal 10 Jan 2017
  
  in Public
  
  Correlates of reproducibility
  
  Finally, the authors wanted to know if successfully replicable studies differed from studies that could not be replicated in a systematic way. As the criterion for replication success, they used their first analysis (significance testing).
  
  They found that studies from the social psychology journal were less likely to replicate than those from the two journals publishing research in cognitive psychology. Moreover, studies were more likely to replicate if the original study reported a lower p-value and a larger effect size, and if the original finding was subjectively judged to be less surprising. However, successfully replicated studies were not judged to be more important for the field, or to have been conducted by original researchers or replicators with higher expertise than failed replications.
  
  AuthorsExperiments
55. rmrahal 10 Jan 2017
  
  in Public
  
  The last measure for the success of the replications was a subjective rating from the replication teams. Each team was asked if they thought they had replicated the original effect. Out of 100 studies, 39 were judged to be successful replications.
  
  AuthorsExperiments
56. rmrahal 10 Jan 2017
  
  in Public
  
  Combining original and replication effect sizes for cumulative evidence
  
  Fourth, the authors combined the original and replication effect sizes and calculated a cumulative estimation of the effects. They wanted to see how many of the studies that could be analyzed this way would show an effect that was significantly different from zero if the evidence from the original study and that of the replication study was combined.
  
  Results showed that 68% of the studies analyzed this way indicated that an effect existed. In the remaining 32% of the studies, the effect found in the original study, when combined with the data from the replication study, could no longer be detected.
  
  AuthorsExperiments
57. rmrahal 10 Jan 2017
  
  in Public
  
  Statistical analyses
  
  Because the large-scale comparison of original and replication studies is a new development in the field of psychology, the authors had to formulate a plan for their analysis that could not rely much on previous research. They decided to use 5 key indicators for evaluating the success of the replications. They will compare the original and the replicated studies in terms of the number of significant outcomes, p-values and effect sizes, and they will assess how many studies were subjectively judged to replicate the original effect. Finally, they also run a meta-analysis over the effect sizes.
  
  AuthorsExperiments
58. rmrahal 10 Jan 2017
  
  in Public
  
  Aggregate data preparation
  
  After each team had completed the replication attempt, independent reviewers checked that their procedure was well documented and according to the initial replication protocol, and that the statistical analysis on the effects selected for replication were correct.
  
  Then, all the data were compiled to conduct analyses not only on the individual studies, but about all replication attempts made. The authors wanted to know if studies that replicated and those that did not replicate would be different. For instance, they investigated if studies that replicated would be more likely to come from one journal than another, or if studies that did not replicate would be more likely to have a higher p-value than studies which could be replicated.
  
  AuthorsExperiments
59. rmrahal 10 Jan 2017
  
  in Public
  
  constructed a protocol for selecting and conducting high-quality replications
  
  Before collecting data for the replication studies, the authors produced a detailed protocol that described how they were going to select the studies that were available for replication and how it would be decided which effect in these studies would be attempted to be replicated, and which principles would guide all replication attempts. Importantly, this protocol was made public, and all individual replication attempts had to adhere to it.
  
  AuthorsExperiments
60. rmrahal 10 Jan 2017
  
  in Public
  
  1–6)
  
  These articles provide an overview of arguments calling for reproducibility from the perspective of philosophy of science, arguing that scientific theory and explanation require reproducibility to enable scientific progress.
  
  PreviousWork
61. rmrahal 10 Jan 2017
  
  in Public
  
  t test for dependent samples
  
  The t-test for dependent samples is a statistical procedure that is used on paired data to compare the means of two groups.
  
  Glossary
62. rmrahal 10 Jan 2017
  
  in Public
  
  24
  
  The Open Science Collaboration published its plan for the Reproducibility Project. They announced how they would select the studies to be replicated, basic principles for how the replications would be conducted and how the results would be evaluated, and invited researchers to join the team conducting the replications.
  
  PreviousWork
63. rmrahal 10 Jan 2017
  
  in Public
  
  7
  
  Nosek and Lakens argue in this editorial that registered reports are a partial solution to the problem of few incentives for researchers to conduct replications. A registered report is an article format, where a proposal for replication is peer-reviewed before data is collected, and the pre-registered report of the replication will be published no matter what the data shows.
  
  PreviousWork
Visit annotations in context

Tags

Glossary

Conclusions

PreviousWork

NewsAndPolicy

Resources

AuthorsExperiments

References

Annotators

rmrahal

URL

scienceintheclassroom.org/research-papers/348/university
Oct 2016
scienceintheclassroom.org scienceintheclassroom.org

Science in the Classroom | A collection of annotated research papers and accompanying teaching materials

32
1. rmrahal 24 Oct 2016
  
  in Public
  
  z, F, t, and χ2
  
  z, F, t and X2 test statistics are parameters that are calculated from a sample and compared with what is expected given the null hypothesis (that there is no effect in reality). The test statistic allows inferences on whether the data allows us to reject the null hypothesis and assume an effect is present.
  
  Glossary
2. rmrahal 24 Oct 2016
  
  in Public
  
  We transformed effect sizes into correlation coefficients whenever possible.
  
  For the third indicator for replication success, the effect sizes of original and replication studies, the authors will calculate correlation coefficients to indicate effect sizes. In a single study, when the means of two groups are similar, the correlation coefficient will be close to 1, and when the means of two groups are very different, the correlation coefficient will be close to 0.
  
  The effect size of original studies was always coded as positive (values between 0 and 1). When the effect in the relevant replication study went in the same direction, the effect size was also coded as positive (values between 0 and 1), but when the effect in the replication went in the other direction, the effect size was coded as negative (values between -1 and 0).
  
  AuthorsExperiments
3. rmrahal 24 Oct 2016
  
  in Public
  
  Cohen’s d
  
  Cohen's d is a measure for the size of an effect, used to report the standardized difference between two means. It is used to make a judgment if an effect is small (d>0.20), medium (d>0.50) or large (d>0.80).
  
  Glossary
4. rmrahal 24 Oct 2016
  
  in Public
  
  Using only the nonsignificant Pvalues of the replication studies and applying Fisher’s method (26), we tested the hypothesis that these studies had “no evidential value” (the null hypothesis of zero-effect holds for all these studies).
  
  The first analysis that will be run on the data assesses all replication studies that yielded non-significant results. By applying Fisher's method, the authors tested if based on these failures to replicate, there is evidence that effects are present in reality for all assessed studies, or if these studies cannot justify a deviation from the null hypothesis that there are no effects in reality.
  
  AuthorsExperiments
5. rmrahal 24 Oct 2016
  
  in Public
  
  Wilcoxon signed-rank test
  
  The Wilcoxon signed-rank test is a statistical procedure that is used on paired data. The test compares the differences between paired data points, ranks the differences by size and considers the direction of the difference by retaining its sign (+ or -), to then allow an inference on whether the mean ranks differ or are the same.
  
  Glossary
6. rmrahal 24 Oct 2016
  
  in Public
  
  Second, we compared the central tendency of the distribution of P values of original and replication studies using the Wilcoxon signed-rank test and the t test for dependent samples.
  
  Moreover, the original and the replication studies are compared in terms of the p-values they yield: are the p-values very similar, or extremely different from each other?
  
  AuthorsExperiments
7. rmrahal 24 Oct 2016
  
  in Public
  
  central tendency
  
  The central tendency of a distribution is captured by its central, or typical values. Central tendency is usually assessed with means, medians ("middle" value in the data) and modes (most frequent value in the data).
  
  Glossary
8. rmrahal 24 Oct 2016
  
  in Public
  
  nominal data
  
  Nominal data is the simplest form of data, since it implies no natural ordering between values. For instance, consider subject gender (male and female), which is a nominal variable: neither male nor female comes first, and neither male nor female is larger than the other.
  
  Glossary
9. rmrahal 24 Oct 2016
  
  in Public
  
  McNemar test
  
  McNemar's test is a statistical procedure for analyzing data that is measured on a nominal scale and where pairs of data points exist. In this example, we have pairs of data points when we consider that each original study and its replication belong together. The test assesses if the outcomes (proportion of significant vs. non-significant results) are the same in the original and the replication studies.
  
  Glossary
10. rmrahal 24 Oct 2016
  
  in Public
  
  We tested the hypothesis that the proportions of statistically significant results in the original and replication studies are equal using the McNemar test for paired nominal data and calculated a CI of the reproducibility parameter.
  
  Next, the authors conducted another test to find out if the number of results in the original studies that produced significant results was equal to or different from the number of replication studies that produced significant results.
  
  AuthorsExperiments
11. rmrahal 24 Oct 2016
  
  in Public
  
  confidence interval
  
  A confidence interval is the range of values in which the true value of the variable of interest would fall, if the experiment were to be repeated again and again. In the case of the 95% confidence interval, the true value would fall in this range in 95% of all cases. Confidence intervals are often referred to with the abbreviation "CI".
  
  Glossary
12. rmrahal 24 Oct 2016
  
  in Public
  
  However, original studies that interpreted nonsignificant P values as significant were coded as significant (four cases, all with P values < 0.06).
  
  Here, the authors explain how they deal with the problem that some of the original studies reported results as significant, although in fact, they were non-significant. In each case, the threshold that is customarily set to determine statistical significance (p<0.05) was not met, but all reported p-values fell very close to this threshold (0.06>p>0.05). Since the original authors treated these effects as significant, the current analysis did so as well.
  
  AuthorsExperiments
13. rmrahal 24 Oct 2016
  
  in Public
  
  two-tailed test
  
  A two-tailed test looks for a hypothesized relationship in two directions, not just one. For example, if we compare the means of two groups, the null hypothesis would be that the means are not different from each other. The alternative hypothesis for a two-tailed test would be that the means are different, regardless if the one is bigger or smaller than the other. For a one-tailed test, one would formulate a more specific alternative hypothesis, for instance that the mean of the first group is bigger than the mean of the second group.
  
  Glossary
14. rmrahal 23 Oct 2016
  
  in Public
  
  cognitive psychology
  
  Cognitive psychology is a subdiscipline of psychology that studies mental processes like perception, problem solving, attention or memory.
  
  Glossary
15. rmrahal 23 Oct 2016
  
  in Public
  
  within-subjects designs
  
  Within-subjects designs vary the predictor in question within each subject: each participant will complete all experimental procedures, in all different conditions. In contrast, between-subjects designs vary the predictor in question between the subjects: each participant completes only one experimental condition.
  
  For example, if a study wanted to test how eating an apple or eating a banana impacts the performance in a subsequent math test, a within-subjects design would have all participants first eat one fruit and complete a test, and then eat the other fruit and complete an equivalent test. A between-subjects design would have half of the participants eat an apple and complete the test, and the other half of the participants eat a banana and complete the test. Some questions are better suited to be studied with a within-subjects design, others are better studied with a between-subjects design.
  
  Glossary
16. rmrahal 23 Oct 2016
  
  in Public
  
  covaries
  
  Covariation indicates how two variables change together, and is the basis needed to calculate a correlation.
  
  Glossary
17. rmrahal 23 Oct 2016
  
  in Public
  
  predictors
  
  Predictors are variables that could affect an outcome of interest.
  
  Glossary
18. rmrahal 23 Oct 2016
  
  in Public
  
  null hypothesis
  
  The null hypothesis is the assumption that a certain effect does not exist in reality, and that any observations of this effect in data is due to error.
  
  Glossary
19. rmrahal 23 Oct 2016
  
  in Public
  
  citation impact
  
  Citation impact refers to the importance of an effect deduced from how much of the subsequent body of literature refers to and builds on it by including a reference to the original paper.
  
  Glossary
20. rmrahal 23 Oct 2016
  
  in Public
  
  functional magnetic resonance imaging
  
  Functional magnetic resonance imaging is a procedure that detects the activity of areas in the brain by measuring blood flow.
  
  Glossary
21. rmrahal 23 Oct 2016
  
  in Public
  
  macaques
  
  Macaques are a type of monkeys.
  
  Glossary
22. rmrahal 23 Oct 2016
  
  in Public
  
  autism
  
  Autism is a mental disorder characterized by difficulties with of social communication and interactions, as well as restricted and repetitive behaviors.
  
  Glossary
23. rmrahal 23 Oct 2016
  
  in Public
  
  F test
  
  An F-test is a statistical procedure that assesses if the variance of two distributions are significantly different from each other.
  
  Glossary
24. rmrahal 23 Oct 2016
  
  in Public
  
  t test
  
  A t-test is a statistical procedure that assesses if the means of two distributions are significantly different from each other.
  
  Glossary
25. rmrahal 23 Oct 2016
  
  in Public
  
  a priori
  
  A priori means something was deduced or determined from theoretical considerations, before collecting data.
  
  Glossary
26. rmrahal 23 Oct 2016
  
  in Public
  
  selection biases
  
  Selection bias here refers to systematic error in the way studies are included or excluded in the sample of studies which would be replicated. An unbiased selection would be truly random, such that the sample of studies used for replication would be representative of the population of studies available.
  
  Glossary
27. rmrahal 23 Oct 2016
  
  in Public
  
  false positive
  
  A false positive is a result that erroneously indicates an effect exists: although the data suggests an effect exists, in reality, the effect does not exist.
  
  Glossary
28. rmrahal 23 Oct 2016
  
  in Public
  
  false negative
  
  A false negative is a result that erroneously indicates no effect exists: although the data do not suggest that an effect exists, in reality, this effect does exist.
  
  Glossary
29. rmrahal 23 Oct 2016
  
  in Public
  
  bias
  
  Bias refers to a systematic error or a process that does not give accurate results.
  
  Glossary
30. rmrahal 23 Oct 2016
  
  in Public
  
  effect sizes
  
  The size of an effect allows us to say whether an effect is big or small, compared to other effects.
  
  Glossary
31. rmrahal 23 Oct 2016
  
  in Public
  
  effects
  
  An effect is an observed phenomenon, where differences in one circumstance lead to observable differences in an outcome.
  
  Glossary
32. rmrahal 23 Oct 2016
  
  in Public
  
  statistically significant results
  
  Results are referred to as statistically significant when we find the result convincing because it is extremely unlikely to find this data if the effect did not really exist.
  
  Glossary
Visit annotations in context

Tags

Glossary

AuthorsExperiments

Annotators

rmrahal

URL

scienceintheclassroom.org/research-papers/348/university

rmrahal

Annotations: 133

Joined: September 23, 2016

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL