- Apr 2018
-
www.scienceintheclassroom.org www.scienceintheclassroom.org
-
high-powered
A study is referred to as high-powered, if the size of the sample from which data is collected is large enough that it becomes highly probable (at least 80% probability) that an effect of interest that exists in the population would actually be found in this data.
For example, let’s say we were interested in finding out whether cupcake consumption increases well-being. Because we cannot ask every person on the planet to please report their well-being, eat a cupcake, and then report their well-being again, we have to restrict our investigation to a certain sample of people.
-
- Jan 2018
-
www.scienceintheclassroom.org www.scienceintheclassroom.org
-
experimental
A study is referred to as experimental if it contains random allocation of participants to experimental conditions or treatments in which a variable of interest is manipulated. Such experiments can allow claims that the manipulation has caused changes in outcomes.
For example, if we wanted to study the influence of rewards during class on students’ biology exam scores in an experimental study, we would randomly assign students to two conditions: In condition 1, students would receive candy bars for active participation in class, whereas in condition 2, students would not receive any candy bars.
Then we would observe the exam scores for each group of students, to judge if our candy-bar treatment improved the scores compared to the no-candy-bar control condition. We could then conclude if rewards cause better exam scores in this context.
-
- Oct 2017
-
www.scienceintheclassroom.org www.scienceintheclassroom.org
-
innovation versus verification
Innovation refers to coming up with new ideas for research—in other words, generating new hypotheses. Verification refers to checking if a certain idea holds up in subsequent research—in other words, confirming hypotheses.
-
preregistration
A preregistration is a document in which researchers compile information on how their study will be run and analyzed before it is conducted. The document often contains information on which research question will be pursued; which hypothesis will be tested; how the data is collected and how the sample is generated; which data is excluded; and how the data will be prepared for analysis and ultimately analyzed. Documenting in advance helps separate confirmatory hypothesis testing from exploratory research.
-
(such as experience and expertise)
The expertise of researchers conducting the replication attempts has been the topic of much debate.
In a recent study, Protzko and Schooler have questioned whether researchers' "caliber" influences their success in reproducing studies.
Read more in PsyArXiv Preprints: https://osf.io/preprints/psyarxiv/4vzfs/?t=1&cn=ZmxleGlibGVfcmVjc18y&iid=c229da44ef46429fb9c1547524d11052&uid=1158994320&nid=244+276893704
-
repeated measurement designs
A repeated measurement design assesses the same outcome variable at several points in time. For example, let’s say we want to find out whether jogging before class improves students’ ability to follow a class. We might ask 20 students to jog before class and 20 students not to jog before class, and then after class ask them how easy it was for them to follow the class. However, we might be unlucky and conduct our experiment on a day where a particularly difficult topic was covered in class. No one—neither the joggers nor the nonjoggers—could understand the lecture, so all our subjects report they absolutely couldn’t follow the class.
This problem could be ameliorated if we used a repeated measurement design instead. We would ask our 20 joggers and 20 nonjoggers to either jog nor not jog before class on five days in a row, and then ask them for their ability to follow the class each time. Now, we would have not only one point of measurement from each student, but five points of measurement of their ability to follow the class at several points in time.
-
within-subjects manipulations
Within-subjects manipulations refer to situations in experiments where the same person is assigned to multiple experimental conditions.
For example, let’s say we want to find out which of two different learning techniques (A and B) is more effective in helping students prepare for a vocabulary test. If we conducted a within-subjects manipulation, each student would apply both learning techniques. Let’s say every student must first apply learning technique A, then take a vocabulary test, and then a week later for the next test apply learning technique B. We could now compare following which learning technique the students perform better with.
In contrast, if we conducted a between-subjects manipulation, each student would only apply one learning technique. We would split the group of students, so that half of them use learning technique A and then take the vocabulary test, while the other students use learning technique B and then take the vocabulary test. Again, we could compare which learning technique the students perform better with.
-
fixed-effect model
A fixed effect model is a statistical model which accounts for individual differences in the data which cannot be measured by treating them as nonrandom, or “fixed” at the individual level.
As an example, let’s say we wanted to study if drinking coffee makes people more likely to cross the street despite a red light. Our outcome variable of interest is how often each subject crosses a street despite a red light on a walk with 10 red traffic lights. The explanatory variable we manipulate for each participant is if they had a cup of coffee before the experiment or a glass of water (our control condition), and we would use this variable to try to explain ignoring red lights. However, there are several other influences on ignoring red lights which we have not accounted for. Next to random and systematic error, we have also not accounted for individual characteristics of the person such as their previous experience with ignoring red lights.
For instance, have the participants received a fine for this offense? If so, they might be less likely to walk across a red light in our experiment. Using a fixed effect model makes it possible to account for these types of characteristics that rest within each individual participant. This, in turn, gives us a better estimate of the relationship between coffee drinking and crossing red lights, cleaned from other individual-level influences.
-
Also, the replication “succeeds” when the result is near zero but not estimated with sufficiently high precision to be distinguished from the original effect size.
Here, the authors describe a problem of judging the replication success by evaluating the replication effect against the original effect size. When the replication effect size is near zero, it could be possible that the data shows no effect, and therefore we would find an unsuccessful replication attempt. However, the estimation of the effect size could be imprecise. This means that there could be a lot of “noise” in the data, from random or systematic errors in the measurement. If there was a lot of noise in the data, it could distort our impression of whether the effect is really zero or not. We might conclude that a replication with an effect size close to zero was sufficiently different from zero and thus successful, although the effect was really just produced by noise in the data, and the true effect is zero, meaning that the replication could be falsely taken as a success.
-
A key weakness of this method is that it treats the 0.05 threshold as a bright-line criterion between replication success and failure (28).
Braver, Thoemmes, and Rosenthal (28) argue that judging the success of a replication only by whether it shows a significant effect (in the current study, at the 0.05 threshold) would be inappropriate. They argue that replication success depends a lot on the statistical power and therefore on the sample size used in the replication study. The replication study must have sufficientl subjects so that it is probable enough that the effect in question, should it really exist in the population, can be found in this sample. If a replication study had low power, for example because the size of the original effect was overestimated and the replication sample size was consequently too small, this makes it less likely that the replication attempt will be successful and show a result that is statistically significant at the 0.05 threshold. For each individual replication study, the replication success therefore depends on the sample size. If you assess several replication attempts individually, the replication success rate could therefore be distorted to underestimate how reproducible an effect really is.
To circumvent this problem, the authors suggest using a different technique than counting if individual replications were significant at the 0.05 threshold. Their analysis is called “continuously accumulatingmeta-analysis.” The data of several replication attempts are combined, so that conclusions on whether the data of all the replication attempts supports the effect of interest.[[this sentence doesn't make sense to me]] After a new replication attempt is conducted, its data is added to the pool of data from previous replication attempts. This data is then taken together, and on the combined data, a test is run to estimate the effect of interest.
-
multivariate interaction effects
A multivariate interaction effect is an effect that is the product of several variables working together and influencing each other.
For example, we might be interested in finding out how water temperature (warm: 38°C; cold: 15°C) affects the body temperature of humans and sea lions. We might find that humans, on average, have a higher body temperature than sea lions, and that body temperature is higher when the body is immersed in warm compared to cold water. However, we might find that a human’s body temperature shows bigger differences between the warm and cold water conditions than the sea lion’s body temperature. Because sea lions have a substantive layer of protective fat, their body temperature does not change as much when water temperature changes, compared to humans. Here, species and water temperature show an interaction effect on body temperature.
-
standard error
When experiments are run using a sample instead of the entire population, each sample will show slightly different estimates of the true population parameter. The standard deviation of this range of estimates is called the standard error.
For example, if we wanted to know the average body mass of Chihuahuas, we couldn’t gather data from every single Chihuahua in the world. If we sampled 20 Chihuahuas, we might find that the average is 2.5 kg. If we sample 20 other Chihuahuas, their average weight might be 2.4 kg. Repeating this process, we would find a range of different average weights in the different samples. Taken together, these means are our estimates for the true average Chihuahua body mass in the population of all Chihuahuas. The dispersion, or the amount of variation in these means, is called standard error.
-
Wilcoxon signed-rank test
The Wilcoxon signed-rank test is a statistical procedure used with two related samples. It assesses the differences between each data pair with regard to both direction and size.
For example, if we wanted to find out if students prefer pasta or salad served in the school cafeteria, we could run an experiment where on three consecutive days, we invite 20 students for lunch and observe how many of them chose pasta and how many chose the salad option. We end up with three pairs of data: On the first day, 18 students chose pasta and two chose salad; on the second day, 15 students chose pasta and four chose salad; on the third day, four students chose pasta and 16 chose salad. The test now calculates the differences between each data pair: On the first day, the difference is 18 – 2 = + 16; on the second day, the difference is 15 – 4 = + 11; on the third day, the difference is 4 – 16 = - 12. Then, the differences are sorted by their absolute size (ignoring the sign: 11, 12, 16) and assigned a rank (11 gets rank 1, 12 gets rank 2, 16 gets rank 3). The sum of the ranks of the positive differences (1 + 3 = 4) is then compared to that of the negative differences (2). The smaller of the two sums of ranks (2) is then compared against a critical value, which informs us whether it is statistically different from zero. If we find a statistically significant result, we can conclude that students have a preference for pasta over salad.
-
df
Df is an abbreviation for the term “degrees of freedom.” The degrees of freedom are an important piece of information for a statistical test, which describes the number of values in the analysis that are free to vary. It depends on how many values are considered (that means, how big the sample size is), and which statistical test is used.
-
exploratory analyses
An exploratory analysis is conducted in the absence of a specific hypothesis you would like to confirm with your study. They are used to explore the data; that is, to see what data patterns can be found, without trying to prove a specific point.
-
generalizability
When we conduct a scientific study, it is often not possible to collect data from every person in the population in the exact situation we want to study. Instead, we often have only a sample of subjects, which we observe in a certain, typical situation. For example, if we want to study adherence to red lights in traffic, we cannot check if every human being will stop at every red light, when driving cars, riding a bike, walking, skateboarding, or using any other means of transportation. We could, however, test 200 pedestrians’ behavior at the traffic light in front of a university.
Generalizability refers to whether a study’s findings, given its own restricted circumstances, can be extended to make statements about what will be true for the population in general, and for similar situations. For example, imagine we want to study adherence to red lights in traffic by observing 200 pedestrians’ behavior at the traffic light in front of a university. Given that our sample size is small and not representative (because there are mostly students in front of a university, a very specific sample of people), and that the situation we observe is only one facet of participation in traffic (we ignore driving, cycling, skateboarding, etc.), we could not make very good statements about adherence to red lights in general.
-
predictors
A predictor (sometimes also called a predictor variable or an independent variable) is a variable that represents the potential reasons why we see a certain result.
For example, if we wanted to study which factors increase students’ performance in their final exams, we could consider a number of different potential reasons, or predictors, such as how often they did their homework during the past school year, how much time they spent reviewing the materials before the exam, or how well they slept the night before the exam.
-
random or systematic error
There are two sources of error which can occur in scientific studies and distort their results.
Systematic errors are inaccuracies that can be reproduced. For example, imagine we wanted to measure a participant’s weight and we make our participant step on five different scales and measure her weight on each scale 10 times. Four scales report that she weighs 74 kg each time she steps on them. The last scale shows that she weighs 23 kg each time she steps on it. We would say there is a systematic error involved in our study of her weight, because the last scale consistently and erroneously reports her weight as too low.
Random errors are inaccuracies that occur because there are unknown influences in the environment. For example, imagine we wanted to measure a participant’s weight and had her step on the same scale three times in a row, within one minute. The first time, the scale reports 74.43 kg, the second time 74.34 kg, the third time 74.38 kg. We don’t think that the participant's weight has actually changed in this 1 minute, yet our measurement shows different results, which we would attribute to random errors.
-
correlational
A study is referred to as correlational if it investigates if there is a relationship between two factors without assigning subjects to conditions manipulating a variable of interest. A causal interpretation (that changes in factor A cause changes in factor B) is not possible in correlational studies.
For example, if we wanted to study the influence of intelligence on students’ biology exam scores in a correlational study, we would first observe students’ intelligence via an IQ test, and then measure their score in the exam. Then we could judge if there was a positive relationship between IQ and exam score: Smarter students might be shown to score better on the test. However, since we did not manipulate students’ IQ to be high or low, we could not say that a higher IQ causes better test scores, only that the two variables are positively related.
-
- Jul 2017
-
www.scienceintheclassroom.org www.scienceintheclassroom.org
-
confidence intervals (CIs)
When studies are run, we aim at estimating values that are true for the population. However, we often cannot record data from everyone in the population, which is why we rely on drawing a random sample from the population. For example, while we may want to estimate the average difference in height between all men and all women in the world, we cannot possibly measure the height of all men and women in the world. Therefore, we draw a random sample of men and women. Let's say we collect data from 100 men and 100 women. The study reveals the average difference in height we find in this sample of 200 people, but it does not tell us what the true difference in height in the population of all men and women in the world is.
If we drew random samples of 200 people from the population of all men and women in the world again and again and again, and assessed their average difference in height each time, we would find a range of values. This range of values represents our estimates for the height difference in the population of all men and women in the world.
We refer to this range of values (interval) as the confidence interval. We want to make sure that it includes the true value of the variable we are estimating for the population sufficiently often. If we refer to a 95% Confidence Interval ('CI'), this means that our range of estimates from random samples contains the true value of the population in 95% of all cases.
If we calculate a CI from one study that we have run, it tells us the probability (e.g., 95%) that the CIs of repeated future samples would contain the true population value.
-
cumulative process
The term “cumulative process” here refers to taking an approach to research in which we try to gain insight not by interpreting strongly the results of one individual study at a time, but by integrating the results of several studies and broader research programs to gain an overview of the overall evidence.
-
validity
Validity refers to the degree to which a certain result or conclusion in research corresponds with reality. There are different aspects of a study which can improve or decrease its validity. For example, a study has high ecological validity if its results can be directly applied to real-life situations outside of the lab.
-
narrow-and-deep approach
This refers to results of studies that go into detail on a specific area, without covering a wide range of different topics.
-
broad-and-shallow evidence
This refers to results of studies that cover a wide range of different topics, without going into detail on a specific area.
-
upwardly biased effect sizes
Here, upwardly biased means that the effect sizes reported in the literature are distorted to appear bigger than they really are.
-
consistently
When results of several analyses point in the same direction, we say the results are consistent. For example, if we run three correlation analyses and find that enjoyment of hiking, self-assessed nature-lovingness, and number of times previously hiked all correlate positively with the probability that someone enjoys hiking holidays, we would say that the results are consistent. If we found that the number of times previously hiked was negatively correlated with the probability that someone enjoys hiking holidays, the results would be less consistent.
-
pre-analysis plans
A pre-analysis plan is a document that specifies which analyses will be run on the data, before these analyses are performed. This plan can specify which variables and analyses will be used, how data will be prepared for analyses, and in which cases data will be excluded from analyses. This tool helps researchers specify and commit to the way they want to run the analyses in their study.
-
confirmatory tests
A confirmatory test is a statistical analysis of a certain relationship which had previously been hypothesized to hold. The test tries to find out if the hypothesis is supported by the data.
-
publication bias
Publication bias is a type of distortion that can occur in making academic research public. When findings which show that a certain effect of interest was found to be statistically significant are more likely to be published than findings which show no evidence or even evidence against this effect, publication bias is present. In this case, if you only read the published papers, you would find a lot of papers showing support for an effect, while studies which do not show support for the same effect are not published, giving you the impression that the effect was less disputed and more consistently found than it actually is.
-
population effect size
The population effect size is the estimate of the strength of the effect in the population of all possible subjects (e.g., all humans).
-
goodness-of-fit χ2 test
A goodness-of-fit test indicates how well a statistical model fits the data. It shows whether the difference between the observed data and the predicted, expected values is too big, or if the difference is small enough that we could assume the model captures reality sufficiently well. A goodness-of-fit χ2 (chi-squared) test is a specific type of goodness-of-fit tests.
-
Spearman’s rank-order correlations
Spearman’s rank-order correlation is a specific type of correlation analysis, which assess the relationship between two variables with regard to its strength and direction.
-
standardizing
Standardizing refers to a procedure of preparing the data for analysis, in which all data are transformed such that their mean across the participants lies at 0 and that their standard deviation is 1.
-
sample size
The sample size refers to the number of people from whom data is collected in a study.
-
R script
An R script is a document written in the programming language R which contains a number of commands that the computer should execute. For this study, all commands necessary to run the analyses reported here are compiled in such a script, which is available online, so that everyone who is interested in them can download the script and rerun all analyses on their own computer.
-
R statistical programming language
R is a computer program that can produce statistical analyses. To run an analysis, scientists tell this program what they want to do in a specific programming language which the computer speaks.
-
accumulated evidence
Accumulated evidence refers to the results of several studies taken together.
-
- Jun 2017
-
www.scienceintheclassroom.org www.scienceintheclassroom.org
-
Results
The authors used 5 measures for replication success to check to what extent the 100 original studies could be successfully replicated.
-
- Jan 2017
-
scienceintheclassroom.org scienceintheclassroom.org
-
Abstract
Video recording of a video symposium explaining the motivation and methodology of the reproducibility project, previewing preliminary results and offering discussion points on implications:
-
Such debates are meaningless, however, if the evidence being debated is not reproducible.
An example of a case where psychologists are currently face vivid debates about the replicability of an effect is the Ego Depletion literature, as explained in this video:
Why an Entire Field of Psychology Is in Trouble (by SciShow)
-
Reproducibility
Introductory video summarizing the Reproducibility Project:
Science 101: The Basics of Reproducibility/Replicability (by Public Domain TV)
-
T. M. Errington et al., An open investigation of the reproducibility of cancer biology research. eLife 3, e04333 (2014). doi: 10.7554/eLife.04333; pmid 25490932
Similarly to the reproducibility project in psychology, Errington and colleagues planned to conduct replication attempts on 50 important papers from the field of cancer biology. While the registered reports are already available online, the replication studies themselves are currently still being conducted.
Read more on eLife: https://elifesciences.org/collections/reproducibility-project-cancer-biology
-
B. A. Nosek, J. R. Spies, M. Motyl, Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspect. Psychol. Sci. 7, 615–631 (2012). doi: 10.1177/1745691612459058; pmid 26168121
Nosek and colleagues argue that scientists are often torn between "getting it right" and "getting it published": while finding out the truth is the ultimate goal of research, more immediately, researchers need to publish their work to be successful in their profession.
A number of practices, such as the establishment of journals emphasizing reports of non-significant results, are argued to be ill suited for improving research practices. To reconcile the two seemingly-at-odds motives, Nosek and colleagues suggest measures such as lowering the bar for publications and emphasizing scientific rigor over novelty, as well as openness and transparency with regard to data and materials.
-
L. K. John, G. Loewenstein, D. Prelec, Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23, 524–532 (2012). doi: 10.1177/ 0956797611430953; pmid 22508865
John, Loewenstein and Prelec conducted a survey with over 2,000 psychologists to identify to which extent they used questionable research practices (QPRs). The respondents were encouraged to report their behavior truthfully, as they could increase donations to a charity of their choice by giving more truthful answers.
Results showed that a high number of psychologists admitted to engaging in QRPs such as almost 70% of all respondents admitting to not reporting results for all dependent measures, or around 50% of respondents admitting to reporting only studies that showed the desired results. Moreover, results showed that researchers suspected their peers also occasionally engaged in such QRPs, but that psychologists thought that there was generally no good justification for engaging in QRPs.
-
research community is taking action
An important part of taking action to advance psychological research is establishing an open discussion and dialogue about the directions the field could take. In the course of this movement, several researchers' blogs have become an increasingly popular medium for such debate.
Read more on the topic of reproducibility in Andrew Gelman's Blog: http://andrewgelman.com/?s=reproducibility and in Uri Simonsohn's Blog Data Colada: http://datacolada.org/?s=reproducibility .
-
Moreover, correlational evidence is consistent with the conclusion that variation in the strength of initial evidence (such as original P value) was more predictive of replication success than was variation in the characteristics of the teams conducting the research (such as experience and expertise).
Third, from this study we can conclude that the precautions the authors took against replication success depending on the team of researchers who conducted the replication study were quite successful: there is no evidence that characteristics of the replication team influenced the outcomes of the replication attempt.
Rather than replication-team-specific differences that influenced the replication outcome, there were systematic differences between successfully replicated and non-replicable studies based on characteristics of the original study. Therefore, the fourth conclusion of this paper is that original studies which showed stronger evidence for the effect they investigated were also more likely to be successfully replicated.
-
Meta-analysis combining original and replication effects
Moreover, the authors planned to combine the results of each original and replication study, to show if the cumulative effect size was significantly different from zero. If the overall effect was significantly different from zero, this could be treated as an indication that the effect exists in reality, and that the original or replication did not erroneously pick up on an effect that did not actually exist.
-
hypothesis that this proportion is 0.5
In this case, testing against the null hypothesis that the half of the replication effects are stronger than the original study effects means assuming that there is only a chance difference between the effect sizes. The alternative hypothesis is that the replication effects are on average stronger, or weaker, than the original study effects.
-
Fisher’s method
Fisher's method is a statistical procedure for conducting meta analyses, in which the results of all included studies are combined. The procedure examines the p-values of the individual studies, and allows inferences on whether the null hypothesis (that there are, in fact, no effects) holds.
-
subjective assessments of replication outcomes
One of the indicators for whether a study was replicated successfully or not was a subjective rating: each team of replicators was asked if their study replicated the original effect (yes or no).
-
sampling frame and selection process
The authors wanted to make sure that the studies that were selected for replication would be representative of psychological research, that means they would give a good picture of which kinds of studies psychologists typically run. Representativeness was important because it would mean that the conclusions that would be drawn from the replication outcomes could be cautiously extended to assumptions about the state of the field overall.
At the same time, they had to make sure that the studies selected could also be conducted (that is, that one of the coauthors had the necessary skill or equipment to collect the data).
To achieve this goal, a step-wise procedure was used: starting from the first issue of 2008 from three important psychology journals, 20 studies were selected and matched with a team of replicators who would conduct the replication attempt. If articles were left over because no-one could conduct the replication, but more replication teams were willing to conduct a study, another 10 articles were made available. In the end, out of 488 studies drawn from the population of studies, 100 studies were attempted to be replicated.
-
transparency
Transparency here means that the process in which a specific result was achieved is made as accessible for other researchers as possible, by explaining publicly, and in detail, everything that was done in a study to arrive at a specific result..
-
11
Prinz and colleagues comment on their experience as employees of a pharmaceutical company, which relies on preclinical research to decide whether to invest into the exploration and development of new drugs. Because companies find many preclinical research findings unreliable, they now often conduct their own research to reproduce the original findings before they decide to move on and invest large sums of money into the actual drug development. Only in about 20% to 25% of the cases did the company scientists report finding results of the reproduction that were in line with the originally reported findings.
-
10
Begley and Ellis are cancer researchers, who propose ways for research methods, publication practices and incentives for researchers to change so that research would yield more reliable results, such as more effective drugs and treatments. They argue that often new drugs and treatments enter clinical trials, which test their effectiveness to treat cancer in humans, before they reach sufficient standards in preclinical testing, leading to non-reproducible findings. To achieve more reliable preclinical results, they argue that more focus should be placed on reproducing promising findings in the preclinical phase.
-
8)
Schmidt argues that, although replication is critical for scientific progress, little systematic thought had been applied to how to go about replications.
He suggests to differentiate direct replication (the repetition of an experimental procedure) and conceptual replication (the repeated test of a hypothesis or result using different methods).
Moreover, he summarizes five main functions that replications serve: to control for sampling error, artifacts or fraud, to extend results to a larger or different populations and to check the assumptions earlier experiments made.
Schmidt concludes that, although a scientific necessity, replications can be practically difficult to conduct, in particular because this type of work is not always easy to publish or highly regarded. Instead, he recommends that studies which include novel research questions could also include elements of replication of previous findings.
-
or other disciplines
Camerer and colleagues conducted a project aimed at evaluating the reproducibility of studies in experimental economics, using a somewhat different methodology.
Read more in Science: http://science.sciencemag.org/content/351/6280/1433.full.pdf+html
-
meta-analyses
Meta-analyses integrate the results of multiple studies to draw overall conclusions on the evidence.
-
P value
A p-value is a statistical threshold for determining if a result is extreme enough to be considered compelling evidence, because it is unlikely that this result would manifest in the data if the effect did not exist in reality.
-
eye tracking machines
Eye tracking machines are devices that can record eye-movements and make it possible to show what information people look at without asking them explicitly what they are attending to.
-
social psychology
Social psychology is a subdiscipline of psychology that studies how people interact with their social environment, and how their thoughts and behaviors are affected by others.
-
correlation coefficient (r)
A correlation coefficient describes the linear interdependence of two variables. It shows both the direction (positive coefficient: as A increases, B increases as well; negative coefficient: as A increases, B decreases), and the strength of the relationship (coefficient close to zero: strong relationship; coefficient close to +/- 1: weak relationship).
For example, there might be a positive correlation between years of attendance to school and crystallized intelligence: with increasing school attendance, people could acquire more knowledge. On the other hand, there could be a negative correlation between age and fluid intelligence: with increasing age, people could get worse at solving problems in new situations.
-
9
Ioannidis conducted simulations to show that for most studies, it is more likely for a finding to be a false positive than true identification of an effect. Among the factors that make it more likely for research findings to be false are a small size of the sample or the underlying effect, and when designs, definitions and analyses are more flexible rather than rigorously objective.
-
13)
In this editorial, Prahler and Wagenmakers argue that doubts about the reproducibility of findings in psychology became increasingly critical after events such as the fraud case of Stapel in 2011, where fabricated and manipulated data resulted in numerous retractions of journal articles, or the debate around findings published by Bem in 2011, where claims that people had an ability to forsee the future were shown not to be replicable. The suspicion that researchers engaged in "questionable research practices"(QRPs) turned out to be more justified than the field had hoped for, for instance based on Simonsohn's findings that many psychologists admitted to engaging in some of these QRPs.
-
sufficient
Sufficient conditions are the circumstances that are enough to find a specific effect. If these conditions are not met, the effect could still be found in another way.
For example, to find the effect that people can be manipulated to be more prosocial, it is sufficient to study what happens after you ask participants to think about a time where someone was generous to them, and then ask them to make a donation. This circumstance could be enough to make people more prosocial (it would therefore be sufficient), but you could think of other circumstances that could achieve the same result.
-
necessary
Necessary conditions are the circumstance that must be met in order to find a specific effect. If these conditions are not met, the effect cannot be found.
For example, to find the effect that prosocial people are more likely to give change to a beggar, a necessary condition would be studying human subjects, not penguins.
-
moderate
In statistics, moderations refers to the dependence of the relationship between two variables on a third variable.
For example, the positive relationship between socioeconomic status and health (the higher one's status, the better one's health) could be moderated by one's sense of control: people in low income groups with high sense of control might show health levels comparable with people from high-income groups, whereas people in low income groups with low sense of control have worse health (Lachman & Weaver, 1998).
-
Editor's Introduction
Reproducibility in psychology: How much research holds up when run again?
The field of psychology has seen troubling news of researchers faking data or of questionable findings that wouldn't hold up when other researchers tried to run the studies again. A critical mass of such new was reached, and some argued that psychology was in crisis, that researchers were pushed to prioritize getting published over investigating the truth, and that faulty research reports could hinder the progress of psychological science. But how do you estimate if bad news only came from a few black sheep, and if the field of psychology overall still does good work?
-
Reproducibility
Reproducibility is the feature of an experiment that speaks to whether it can be run again, and if the same results as in previous runs of this experiment can be found. If an experiment has been reproduced successfully, it has been conducted more than once, and the overall evidence suggests that the original findings holds in the reproducing studies (also referred to as replication studies, or replications) as well.
-
D. Fanelli, “Positive” results increase down the hierarchy of the sciences. PLOS ONE 5, e10068 (2010). doi: 10.1371/journal. pone.0010068; pmid 20383332
Fanelli assessed more than 2000 papers from different scientific disciplines and found that the proportion of studies reporting support for their hypotheses increased in the disciplines such as psychology or economics compared to disciplines such as space science. It is concluded that both the type of hypotheses tested and the rigor applied in these tests differ between fields.
-
K. S. Button et al., Power failure: Why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376 (2013). doi: 10.1038/nrn3475; pmid 23571845
Button and colleagues study the average statistical power of studies in neuroscience, and conclude that it is low. They highlight that low power does not only mean that studies have a lower chance of detecting a true effect, but that when using such low-powered studies, it is also becomes less likely that a significant effect indeed reflects a true effect. It is argued that using studies with low power may seem efficient at the first glance, because less money is spent on subjects, but that indeed, because future research could be building on an erroneous line of investigations, low-powered studies are inefficient in the long-run.
-
J. K. Hartshorne, A. Schachner, Tracking replicability as a method of post-publication open evaluation. Front. Comput. Neurosci. 6, 8 (2012). doi: 10.3389/fncom.2012.00008; pmid 22403538
Hartshorne and Schachner suggest that replication success should be traced in a database connecting replication attempts with original studies. Based on this information, a replication success score could be computed, which could be used as a criterion for a journal's quality alongside other indicators such as citation counts.
-
G. S. Howard et al., Do research literatures give correct answers? Rev. Gen. Psychol. 13, 116–121 (2009). doi 10.1037/a0015468
Howard and colleagues examine how the file drawer problem affects a research literature. They compare "old", existing bodies of literature that could be suffering from the file drawer problem (i.e., which could include the few studies were an effect was found, while studies yielding non-significant results on the same effect were never published) with a newly constructed body of literature guaranteed to be free of the file drawer problem, which they achieved by conducting new studies. This investigation suggests that some bodies of literature are supported as relatively file-drawer-free, while other bodies of literature raise concern and kindle further studies on the effects they include.
-
A. G. Greenwald, Consequences of prejudice against the null hypothesis. Psychol. Bull. 82, 1–20 (1975). doi 10.1037/h0076157
Greenwald examines how research practices discriminate against accepting the null hypothesis (that an effect does not exist). Using a simulation, he suggests that too few publications accept the null hypothesis, and that the proportion of publications which falsely reject the null hypothesis although it would have been true is high.
Greenwald further debunks traditional arguments why a null hypothesis should not be accepted, and suggests ways to improve research practices to improve the acceptance of accepting the null hypothesis.
-
Many Labs replication projects (32)
Many Labs replication projects are studies in which multiple labs attempt to replicate the same effect. In this example, 36 teams of researchers from different countries attempted to replicate the same 13 effects, with more than 6000 participants.
The data revealed that 10 effects could consistently be replicated, while one effect showed only weak support for replication and two effects could not be replicated successfully.
-
J. P. Simmons, L. D. Nelson, U. Simonsohn, False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22, 1359–1366 (2011). doi: 10.1177/0956797611417632; pmid 22006061
Simmons and colleagues conduct computer simulations and two experiments that show how significant results can easily be achieved for a research hypothesis that is false. They show that flexibility - or as they call it: researcher degrees of freedom - in four areas contributes to making it more likely to find significant effects for a false hypothesis:
- Flexibility in choosing the dependent variables reported: When researchers flexibly analyze two related dependent variables, this already almost doubles the probability of finding a positive result for a false hypothesis. 2. Flexibility in choosing the sample size: When researchers stop data collection, find no significant result, and collect additional data before checking for the same effect, this increases the probability of finding a positive result for a false hypothesis by 50%.
- Flexibility in the use of additional variables included in the analyses: When researchers include additional variables in the analyses, false positive rates more than double.
- Flexibility in the number of experimental conditions reported: When researchers collect data in three experimental conditions and flexibly decide whether to report the result of comparisons between any two conditions or all three, this more than doubles the false positive rate.
If researchers used research practices where they used all four flexibilities, they would, overall, be more likely than not to find positive results although the underlying hypothesis was indeed false.
-
R. Rosenthal, The file drawer problem and tolerance for null results. Psychol. Bull. 86, 638–641 (1979). doi: 10.1037/0033- 2909.86.3.638
Rosenthal addresses the 'file drawer problem', a questionable research practice where only studies that showed the desired result would be published and all other studies would land in the 'file drawer' and would not be known to the scientific community.
In the extreme case, this could mean that, if a specific effect did not exist in reality, the 5% of studies that (due to statistical error allowed) find this effect get published and discussed as if the effect were true, whereas 95% of studies do not (and rightly so) find the effect, but are tucked away in a file drawer. This problem hinders scientific progress, as new studies would build on old, but false, effects.
Rosenthal introduces a way to assess the size of the file drawer problem, the tolerance to future null results: calculating the number of studies with null results that would have to be in a file drawer before the published studies on this effect would be called into question.
-
Transparency and Openness Promotion (TOP) Guidelines (http://cos.io/top) (37)
Nosek and colleagues summarize eight standards for transparency and openness in research that focus on citations, data accessibility, accessibility of computational resources, making research materials like participant instructions available and giving access to the design and analyses, study and analysis plan pre-registration, and the use of replication studies over all. They argue that journals should require and enforce adherence to transparency guidelines, and that the submission of replication studies, in particular in the Registered Report format, should be an option.
-
Scientific incentives
Incentives for working in scientific research often differ greatly by country and institution. In the UK, for instance the allocation of research funding and institutional positions depends on the number of published papers which are rated as highly original, significant and rigorously conducted.
Read more in The Guardian: https://www.theguardian.com/higher-education-network/2016/oct/17/why-is-so-much-research-dodgy-blame-the-research-excellence-framework
-
procedures that are more challenging to execute may result in less reproducible results
Making an experimental setup transparent and reproducible can be quite difficult, because undetected but theoretically relevant variations to the study protocol could produce different results. However, there are some new ideas about how such transparency could be achieved.
Read more in The New Yorker: http://www.newyorker.com/tech/elements/how-methods-videos-are-making-science-smarter
-
improve the quality and credibility of the scientific literature
Improving the quality and credibility of scientific literature can be accomplished through improving the daily practices involved in the research process. Improved reporting and registering hypotheses and sample sizes are some ideas for such improvements.
Read more in Nature Human Behavior: http://www.nature.com/articles/s41562-016-0021
-
Other investigators may develop alternative indicators to explore further the role of expertise and quality in reproducibility on this open data set.
In a later approach to estimating how researchers assess the reproducibility of science, a large-scale survey was conducted with more than 1500 researchers answering questions such as "Have you failed to reproduce an experiment?".
Read more in Nature: http://www.nature.com/polopoly_fs/1.19970!/menu/main/topColumns/topLeftColumn/pdf/533452a.pdf
-
The present results suggest that there is room to improve reproducibility in psychology. Any temptation to interpret these results as a defeat for psychology, or science more generally, must contend with the fact that this project demonstrates science behaving as it should.
The fifth and final conclusion of the paper regards the question what psychologists and other researchers should take from these results regarding their overall research practices. The conclusion is mixed. On the one hand, the authors recognize that research is a process where new ideas have to be explored and sometimes might turn out not to be true. Maximum replicability is therefore not desirable, because it would mean that no more innovations are being made. On the other hand, the authors also conclude that there is room for improvement: stronger original evidence and better incentives for replications would put progress in psychological research on a stronger foundation.
-
Nonetheless, collectively, these results offer a clear conclusion: A large portion of replications produced weaker evidence for the original findings
Because there is some uncertainty about how exactly the replication success rate in psychological research should be determined, the authors go about the interpretation of the results of this study very conservatively. This very careful interpretation of the data, and the second conclusion of this study, is that the replication studies yielded largely weaker evidence for the effects studied than the original studies.
-
No single indicator sufficiently describes replication success, and the five indicators examined here are not the only ways to evaluate reproducibility.
The discussion section begins with a cautionary sentence reminding the reader that it is difficult to say exactly how many original studies were successfully replicated. The precise conclusions drawn from this paper depend a lot on which of the 5 measures used to determine replication success you think is the most appropriate measure. The results of measure indicate a replication success rate as low as 36%, while another measure suggests a success rate of 68%. Perhaps some researchers would even say that another measure not included in this study would have made it possible to draw more meaningful conclusions. The scientific community has so far not agreed on what measure should be used to evaluate replication success rates.
Moreover, there are other limitations to this approach to studying reproducibility (see paragraph "Implications and limitations"), which make it difficult to generalize the findings of this study to psychological research in general, or even to other disciplines. It is also difficult to evaluate from the findings in this study whether the evidence indicates a specific effect is true or does not exist.
Therefore, the first conclusion of this paper is that all interpretations of the data are only an estimation of how reproducible psychological research is, not an exact answer.
-
Subjective assessment of “Did it replicate?”
Finally, the authors used the subjective rating of whether the effect replicated as an indicator of replication success. Out of 100 replication teams, only 39 reported that they thought they had replicated the original effect.
-
Comparing original and replication effect sizes
With this third measure for replication success, the authors further compared the sizes of the original and replicated effects. They found that the original effect sizes were larger than the replication effect sizes in more than 80% of the cases.
-
Evaluating replication effect against original effect size
In a second way to look at the replication success, the authors checked if the sizes of the effects of the original studies weren't too far off from those of the replication studies. Using this measure, they found that less than half the replications showed an effect size that was not too different from the original effect size to speak of a successful replication.
-
Evaluating replication effect against null hypothesis of no effect
First, the authors used the 5 measures for replication success to check to what extent the 100 original studies could be successfully replicated.
In a first glance at the results, the authors checked how many replications "worked" by analyzing how many replication studies showed a significant effect with the same direction (positive or negative) as the original studies. Of the 100 original studies, 97 showed a significant effect. Because each replication study did not have a probability of 100% to find a positive result even if the investigated effect was true, if all original effects were true, we could have maximally expected around 89 successful replications. However, results showed that only 35 studies were successfully replicated.
-
Analysis of moderators
Last, the authors wanted to know if successfully replicable studies differed from studies that could not be replicated in a systematic way. For this, they would check if a number of differences in the original studies would be systematically related to successfully replicable studies.
-
Subjective assessment of “Did it replicate?”
Finally, the authors included a last measure for replication success: a subjective rating. All researchers who conducted a replication were asked if they thought their results replicated the original effect successfully. Based on their yes or no answers, subjective replication success would be calculated.
-
“coverage,” or the proportion of study-pairs in which the effect of the original study was in the CI of the effect of the replication study
In this test for replication success, the authors will compare the size of the original study effect and the effect of the replication study to identify if there are indications that they are not too different, so that it is likely that the effect sizes in both samples correspond to the same effect size in the population.
-
Correlates of reproducibility
Finally, the authors wanted to know if successfully replicable studies differed from studies that could not be replicated in a systematic way. As the criterion for replication success, they used their first analysis (significance testing).
They found that studies from the social psychology journal were less likely to replicate than those from the two journals publishing research in cognitive psychology. Moreover, studies were more likely to replicate if the original study reported a lower p-value and a larger effect size, and if the original finding was subjectively judged to be less surprising. However, successfully replicated studies were not judged to be more important for the field, or to have been conducted by original researchers or replicators with higher expertise than failed replications.
-
The last measure for the success of the replications was a subjective rating from the replication teams. Each team was asked if they thought they had replicated the original effect. Out of 100 studies, 39 were judged to be successful replications.
-
Combining original and replication effect sizes for cumulative evidence
Fourth, the authors combined the original and replication effect sizes and calculated a cumulative estimation of the effects. They wanted to see how many of the studies that could be analyzed this way would show an effect that was significantly different from zero if the evidence from the original study and that of the replication study was combined.
Results showed that 68% of the studies analyzed this way indicated that an effect existed. In the remaining 32% of the studies, the effect found in the original study, when combined with the data from the replication study, could no longer be detected.
-
Statistical analyses
Because the large-scale comparison of original and replication studies is a new development in the field of psychology, the authors had to formulate a plan for their analysis that could not rely much on previous research. They decided to use 5 key indicators for evaluating the success of the replications. They will compare the original and the replicated studies in terms of the number of significant outcomes, p-values and effect sizes, and they will assess how many studies were subjectively judged to replicate the original effect. Finally, they also run a meta-analysis over the effect sizes.
-
Aggregate data preparation
After each team had completed the replication attempt, independent reviewers checked that their procedure was well documented and according to the initial replication protocol, and that the statistical analysis on the effects selected for replication were correct.
Then, all the data were compiled to conduct analyses not only on the individual studies, but about all replication attempts made. The authors wanted to know if studies that replicated and those that did not replicate would be different. For instance, they investigated if studies that replicated would be more likely to come from one journal than another, or if studies that did not replicate would be more likely to have a higher p-value than studies which could be replicated.
-
constructed a protocol for selecting and conducting high-quality replications
Before collecting data for the replication studies, the authors produced a detailed protocol that described how they were going to select the studies that were available for replication and how it would be decided which effect in these studies would be attempted to be replicated, and which principles would guide all replication attempts. Importantly, this protocol was made public, and all individual replication attempts had to adhere to it.
-
1–6)
These articles provide an overview of arguments calling for reproducibility from the perspective of philosophy of science, arguing that scientific theory and explanation require reproducibility to enable scientific progress.
-
t test for dependent samples
The t-test for dependent samples is a statistical procedure that is used on paired data to compare the means of two groups.
-
24
The Open Science Collaboration published its plan for the Reproducibility Project. They announced how they would select the studies to be replicated, basic principles for how the replications would be conducted and how the results would be evaluated, and invited researchers to join the team conducting the replications.
-
7
Nosek and Lakens argue in this editorial that registered reports are a partial solution to the problem of few incentives for researchers to conduct replications. A registered report is an article format, where a proposal for replication is peer-reviewed before data is collected, and the pre-registered report of the replication will be published no matter what the data shows.
-
- Oct 2016
-
scienceintheclassroom.org scienceintheclassroom.org
-
z, F, t, and χ2
z, F, t and X2 test statistics are parameters that are calculated from a sample and compared with what is expected given the null hypothesis (that there is no effect in reality). The test statistic allows inferences on whether the data allows us to reject the null hypothesis and assume an effect is present.
-
We transformed effect sizes into correlation coefficients whenever possible.
For the third indicator for replication success, the effect sizes of original and replication studies, the authors will calculate correlation coefficients to indicate effect sizes. In a single study, when the means of two groups are similar, the correlation coefficient will be close to 1, and when the means of two groups are very different, the correlation coefficient will be close to 0.
The effect size of original studies was always coded as positive (values between 0 and 1). When the effect in the relevant replication study went in the same direction, the effect size was also coded as positive (values between 0 and 1), but when the effect in the replication went in the other direction, the effect size was coded as negative (values between -1 and 0).
-
Cohen’s d
Cohen's d is a measure for the size of an effect, used to report the standardized difference between two means. It is used to make a judgment if an effect is small (d>0.20), medium (d>0.50) or large (d>0.80).
-
Using only the nonsignificant Pvalues of the replication studies and applying Fisher’s method (26), we tested the hypothesis that these studies had “no evidential value” (the null hypothesis of zero-effect holds for all these studies).
The first analysis that will be run on the data assesses all replication studies that yielded non-significant results. By applying Fisher's method, the authors tested if based on these failures to replicate, there is evidence that effects are present in reality for all assessed studies, or if these studies cannot justify a deviation from the null hypothesis that there are no effects in reality.
-
Wilcoxon signed-rank test
The Wilcoxon signed-rank test is a statistical procedure that is used on paired data. The test compares the differences between paired data points, ranks the differences by size and considers the direction of the difference by retaining its sign (+ or -), to then allow an inference on whether the mean ranks differ or are the same.
-
Second, we compared the central tendency of the distribution of P values of original and replication studies using the Wilcoxon signed-rank test and the t test for dependent samples.
Moreover, the original and the replication studies are compared in terms of the p-values they yield: are the p-values very similar, or extremely different from each other?
-
central tendency
The central tendency of a distribution is captured by its central, or typical values. Central tendency is usually assessed with means, medians ("middle" value in the data) and modes (most frequent value in the data).
-
nominal data
Nominal data is the simplest form of data, since it implies no natural ordering between values. For instance, consider subject gender (male and female), which is a nominal variable: neither male nor female comes first, and neither male nor female is larger than the other.
-
McNemar test
McNemar's test is a statistical procedure for analyzing data that is measured on a nominal scale and where pairs of data points exist. In this example, we have pairs of data points when we consider that each original study and its replication belong together. The test assesses if the outcomes (proportion of significant vs. non-significant results) are the same in the original and the replication studies.
-
We tested the hypothesis that the proportions of statistically significant results in the original and replication studies are equal using the McNemar test for paired nominal data and calculated a CI of the reproducibility parameter.
Next, the authors conducted another test to find out if the number of results in the original studies that produced significant results was equal to or different from the number of replication studies that produced significant results.
-
confidence interval
A confidence interval is the range of values in which the true value of the variable of interest would fall, if the experiment were to be repeated again and again. In the case of the 95% confidence interval, the true value would fall in this range in 95% of all cases. Confidence intervals are often referred to with the abbreviation "CI".
-
However, original studies that interpreted nonsignificant P values as significant were coded as significant (four cases, all with P values < 0.06).
Here, the authors explain how they deal with the problem that some of the original studies reported results as significant, although in fact, they were non-significant. In each case, the threshold that is customarily set to determine statistical significance (p<0.05) was not met, but all reported p-values fell very close to this threshold (0.06>p>0.05). Since the original authors treated these effects as significant, the current analysis did so as well.
-
two-tailed test
A two-tailed test looks for a hypothesized relationship in two directions, not just one. For example, if we compare the means of two groups, the null hypothesis would be that the means are not different from each other. The alternative hypothesis for a two-tailed test would be that the means are different, regardless if the one is bigger or smaller than the other. For a one-tailed test, one would formulate a more specific alternative hypothesis, for instance that the mean of the first group is bigger than the mean of the second group.
-
cognitive psychology
Cognitive psychology is a subdiscipline of psychology that studies mental processes like perception, problem solving, attention or memory.
-
within-subjects designs
Within-subjects designs vary the predictor in question within each subject: each participant will complete all experimental procedures, in all different conditions. In contrast, between-subjects designs vary the predictor in question between the subjects: each participant completes only one experimental condition.
For example, if a study wanted to test how eating an apple or eating a banana impacts the performance in a subsequent math test, a within-subjects design would have all participants first eat one fruit and complete a test, and then eat the other fruit and complete an equivalent test. A between-subjects design would have half of the participants eat an apple and complete the test, and the other half of the participants eat a banana and complete the test. Some questions are better suited to be studied with a within-subjects design, others are better studied with a between-subjects design.
-
covaries
Covariation indicates how two variables change together, and is the basis needed to calculate a correlation.
-
predictors
Predictors are variables that could affect an outcome of interest.
-
null hypothesis
The null hypothesis is the assumption that a certain effect does not exist in reality, and that any observations of this effect in data is due to error.
-
citation impact
Citation impact refers to the importance of an effect deduced from how much of the subsequent body of literature refers to and builds on it by including a reference to the original paper.
-
functional magnetic resonance imaging
Functional magnetic resonance imaging is a procedure that detects the activity of areas in the brain by measuring blood flow.
-
macaques
Macaques are a type of monkeys.
-
autism
Autism is a mental disorder characterized by difficulties with of social communication and interactions, as well as restricted and repetitive behaviors.
-
F test
An F-test is a statistical procedure that assesses if the variance of two distributions are significantly different from each other.
-
t test
A t-test is a statistical procedure that assesses if the means of two distributions are significantly different from each other.
-
a priori
A priori means something was deduced or determined from theoretical considerations, before collecting data.
-
selection biases
Selection bias here refers to systematic error in the way studies are included or excluded in the sample of studies which would be replicated. An unbiased selection would be truly random, such that the sample of studies used for replication would be representative of the population of studies available.
-
false positive
A false positive is a result that erroneously indicates an effect exists: although the data suggests an effect exists, in reality, the effect does not exist.
-
false negative
A false negative is a result that erroneously indicates no effect exists: although the data do not suggest that an effect exists, in reality, this effect does exist.
-
bias
Bias refers to a systematic error or a process that does not give accurate results.
-
effect sizes
The size of an effect allows us to say whether an effect is big or small, compared to other effects.
-
effects
An effect is an observed phenomenon, where differences in one circumstance lead to observable differences in an outcome.
-
statistically significant results
Results are referred to as statistically significant when we find the result convincing because it is extremely unlikely to find this data if the effect did not really exist.
-