The Replication Paradox — Meta-Research Center

Guest blog for The Replication Network by Michèle Nuijten

Lately, there has been a lot of attention for the excess of false positive and exaggerated findings in the published scientific literature. In many different fields there are reports of an impossibly high rate of statistically significant findings, and studies of meta-analyses in various fields have shown overwhelming evidence for overestimated effect sizes.

Originally Published on The Replication Network

The suggested solution for this excess of false postive findings and exaggerated effect size estimates in the literature is replication. The idea is that if we just keep replicating published studies, the truth will come to light eventually.

This intuition also showed in a small survey I conducted among psychology students, social scientists, and quantitative psychologists. I offered them different hypothetical combinations of large and small published studies that were identical except for the sample size – they could be considered replications of each other. I asked them how they would evaluate this information if their goal was to obtain the most accurate estimate of a certain effect. In almost all of the situations I offered, the answer was almost unanimously: combine the information of both studies.

This makes a lot of sense: the more information the better, right? Unfortunately this is not necessarily the case.

The problem is that the respondents forgot to take into account the influence of publication bias: statistically significant results have a higher probability of being published than non-significant results. And only publishing significant effects leads to overestimated effect sizes in the literature.

But wasn’t this exactly the reason to take replication studies into account? To solve this problem and obtain more accurate effect sizes?

Unfortunately, there is evidence from multi-study papers and meta-analyses that replication studies suffer from the same publication bias as original studies (see below for references). This means that bothtypes of studies in the literature contain overestimated effect sizes.

The implication of this is that combining the results of an original study with those of a replication study could actually worsen the effect size estimate. This works as follows.

Bias in published effect size estimates depends on two factors: publication bias and power (the probability that you will reject the null hypothesis, given that it is false). Studies with low power (usually due to a small sample size) contain a lot of noise, and the effect size estimate will be all over the place, ranging from severe underestimations to severe overestimations.

This in itself is not necessarily a problem; if you would take the average of all these estimates (e.g., in a meta-analysis) you would end up with an accurate estimate of the effect. However, if because of publication bias only the significant studies are published, only the severe overestimations of the effect will end up in the literature. If you would calculate an average effect size based on these estimates, you will end up with an overestimation.

Studies with high power do not have this problem. Their effect size estimates are much more precise: they will be centered more closely on the true effect size. Even when there is publication bias, and only the significant (maybe slightly overestimated) effects are published, the distortion would not be as large as with underpowered, noisier studies.

Now consider again a replication scenario such as the one mentioned above. In the literature you come across a large original study and a smaller replication study. Assuming that both studies are affected by publication bias, the original study will probably have a somewhat overestimated effect size. However, since the replication study is smaller and has lower power, it will contain an effect size that is even more overestimated. Combining the information of these two studies then basically comes down to adding bias to the effect size estimate of the original study. In this scenario it would render a more accurate estimation of the effect if you would only evaluate the original study, and ignored the replication study.

In short: even though a replication will increase precision of the effect size estimate (a smaller confidence interval around the effect size estimate), it will add bias if the sample size is smaller than the original study, but only if there is publication bias and the power is not high enough.

There are two main solutions to the problem of overestimated effect sizes.

The first solution would be to eliminate publication bias; if there is no selective publishing of significant effects, the whole “replication paradox” would disappear. One way to eliminate publication bias is to preregister your research plan and hypotheses before collecting the data. Some journals will even review this preregistration, and can give you an “in principle acceptance” – completely independent of the results. In this case, studies with significant and non-significant findings have an equal probability of being published, and published effect sizes will not be systematically overestimated. Another way is for journals to commit to publishing replication results independent of whether the results are significant. Indeed, this is the stated replication policy of some journals already.

The second solution is to only evaluate (and perform) studies with high power. If a study has high power, the effect size estimate will be estimated more precisely and less affected by publication bias. Roughly speaking: if you discard all studies with low power, your effect size estimate will be more accurate.

A good example of an initiative that implements both solutions is the recently published Reproducibility Project, in which 100 psychological effects were replicated in studies that were preregistered and high powered. Initiatives such as this one eliminates systematic bias in the literature and advances the scientific system immensely.

However, before preregistered, highly powered replications are the new standard, researchers that want to play it safe should change their intuition from “the more information, the higher the accuracy,” to “the more power, the higher the accuracy.”

This blog is based on the paper “The replication paradox: Combining studies can decrease the accuracy of effect size estimate” (2015) by Nuijten, van Assen, Veldkamp, Wicherts (2015). Review of General Psychology, 19 (2), 172-182.

Literature on How Replications Suffer From Publication Bias

Francis, G. (2012). Publication bias and the failure of replication in experimental psychology. Psychonomic Bulletin & Review, 19(6), 975-991.
Ferguson, C. J., & Brannick, M. T. (2012). Publication bias in psychological science: Prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. Psychological Methods, 17, 120-128.