When Good Tools Fail: The Missing Step in Scientific Reform

June 16, 2025 by Iris Willigers

This blogpost was written by Anouk Bouma. Anouk her PhD project focuses on investigating the trustworthiness of simulation studies from a meta-scientific perspective, supervised by Marcel van Assen, Robbie van Aert & Lieke Voncken.

Prologue
If you're a researcher, like me, you've probably had an experience that goes something like this: say you want to conduct a power analysis—perhaps for a mediation model—so you head to the internet in search of a suitable tool to help you. You dig through pages of search results, and eventually, you find something promising: a Shiny app that claims to do exactly what you need. Perfect! You click on it, eager to get started.

But then… the enthusiasm quickly fades.

You try to input the required information, only to be met with cryptic labels and unclear fields. You wanted to calculate the required sample size, but the box labeled “N” seems to be asking you to enter a sample size instead. Wait…what?

Still, you don't give up easily. So, you try out different inputs to see what they do. Eventually, some output appears. Actually, a lot of output appears. But instead of clarity, you're left staring at a wall of unexplained numbers, unsure what any of it means. Feeling frustrated, you close the tab and return to your search.

Sure, maybe you haven’t had the experience of being frustrated over Shiny apps for power analyses. Maybe for you, frustration came from trying to navigate a confusing preregistration template, the OSF website, or an R package with cryptic, underexplained arguments. Whatever the specific case, the underlying issue is the same.

This isn’t just a story about power analyses or my personal struggle with conducting them. It’s a story about how researchers, especially metascientists, are increasingly stepping into a role they haven’t been trained for: that of product designers. In this blog I will argue that metascientists could benefit from adopting a marketing perspective once in a while—especially when developing tools, guidelines, and reforms aimed at improving scientific practices.

A little bit of backstory
First, let me tell you a little bit about myself. I joined the Tilburg University Meta-Research Center as a PhD candidate in September 2024, almost ten months ago. My background, like many in our group, is primarily in psychology (having completed both a bachelor’s and master’s in this direction). But in contrast to my fellow lab mates, I also obtained a bachelor’s in marketing management (at a university of applied sciences, for those familiar with the Dutch education system).

For a long time, I didn’t think the marketing part of my education was particularly meaningful or had much impact on my way of thinking—especially in the context of my goal to become a metascientist.

It turns out I was wrong.

I’ve come to see how surprisingly relevant that background can be, especially when thinking about one of metascience’s key challenges: not just to understand how science works but actually improving it. Which brings us to the bigger picture.

The two goals of metascience
Metascience, at its core, has two goals: to study science, and to improve it (Ioannidis et al., 2015). When we're in "improvement" mode, we are often developing 'products' (like a preregistration template, a data-sharing guideline, or a statistical tool) designed for specific 'users' (e.g., researchers, journal editors, or institutions). And for our 'products' to succeed they need to be used as intended by the users they are developed for.

This mindset, focusing first and foremost on the needs and perspective of your user, was a core lesson in my marketing education. You don’t just build something and assume it solves the user’s problem. You talk to your target audience, observe how they work, and check whether your solution actually fits their needs and constraints.

As scientists, we mostly focus on the correctness of our tools and if they are based on sound empirical evidence. But when developing a product, our responsibility doesn’t end with a solution that’s technically correct or theoretically justified. It also includes ensuring that our solutions are received, understood, and applied effectively in real-world settings. If the tools or reforms we create are confusing, time-consuming, or poorly aligned with researchers’ workflows, they risk being ignored, rejected, or misused. That doesn’t just limit our impact; it directly undermines our goal of improving scientific practice.

An example from the metascientific simulation literature
Let me share an example from the topic of my own PhD project: a metascientific investigation of simulation studies. There’s a growing body of work documenting concerns about simulation-based method evaluations. These metascientists raise important concerns, like how questionable research practices in simulation studies can hamper the validity of the conclusions (Pawel et al., 2024). Various metascientists propose solutions like preregistration templates (Siepe et al., 2024), guidelines (e.g., Kelter, 2024), and improved reporting standards (Morris et al., 2019). In doing so, they seem to tick off both of metascience’s core boxes: they study how simulation research is currently done, and they offer ways to improve it.

These papers make valuable contributions to evaluate and strengthen the robustness of simulation studies. At the same time, this line of research could benefit from systematic usability testing of proposed solutions, as well as more attention to the needs and perspectives of the intended users. This might mean focusing on understanding the challenges researchers face when conducting simulation studies, the tools they currently rely on, and the barriers that may prevent them from adopting new practices.

We should be careful not to design a key without ever having seen the lock.

Let’s think about our users
So, what would it look like if we develop metascientific solutions from a user-centered perspective? Let’s go back to the example of power analyses for mediation models.

1. Define your goal
Be specific. What change do you want to see? The actual end-goal of a project is often not to develop a working tool, but to solve a certain problem. Clearly defining your goal will help you evaluate if that goal is met in later stages.

Example: “I want more applied researchers to conduct power analyses when they run mediation models.”

2. Identify your target audience
Who are you designing your solution for? What field are they in? What type of research do they conduct?

It’s tempting to assume that your audience is “all researchers” or even “researchers like me,” but this often leads to tools that are too broad or mismatched. A clear definition of your audience helps guide design decisions and determines who to involve in the next steps.

Example: “Researchers in applied clinical psychological science who regularly conduct mediation analyses.”

3. Get to know your audience
Once you know who your target audience is, get to know them. What tools do they use? What’s their level of expertise? What’s stopping them from doing what you want them to do? Lack of time? Confusion? Skepticism? Overwhelm?

Conduct user interviews, survey potential users, or observe how they work. Don't assume. Ask.

If your audience struggles with statistical jargon, then improving the readability of your solution’s instructions might matter more than adding new features.

If they prefer SPSS, then creating an R package, even a brilliant one, will be a mismatch. A Shiny app with a simple interface, or a decision tree embedded in a tutorial, might be more effective.

Example: “My target audience primarily uses SPSS and is familiar with mediation analysis but not necessarily with programming. They generally know how to decide on the inputs required to conduct a power analysis, but are unsure about conducting the analysis itself.”

4. Design, test, and iterate
Build a prototype. Then test it! Don’t just test it with your collaborators, but with actual members of your target audience.

Sit next to them as they use your solution (e.g., using the cognitive interviewing technique (Balza et al., 2022)). Watch where they get stuck. Do they understand what each input field requires, what buttons to click, and where to find information? Can they interpret the output correctly? Do they need external help to use it?

You might also need to produce supplementary materials—short explainer videos, example use cases, and FAQ sheets to help new users get started.

Use that feedback to revise. Don’t stop until most users can use your tool successfully on their own.

5. Plan for dissemination
Even the best solution won’t help anyone if nobody knows it exists. Think carefully about how and where to promote it.

Which journals are most relevant to your audience? What conferences do they attend? Are they active on social media? Do they go to workshops where you can introduce your solution?

And when you produce output to disseminate, make sure it fits the language and level of technicality that your audience expects.

Some wonderful examples
The good news is that some projects have already used a user-centered approach in developing their solutions. One strong example is Spitzer et al. (2024), who ran two consecutive studies to evaluate the usability of their preregistration template and researchers’ intention to use it. This allowed them to iteratively refine the template based on real user feedback. Another example comes from Haven et al. (2020), a member of our lab group, who conducted a Delphi study to explore what a preregistration template for qualitative research should look like by consulting qualitative researchers themselves.

Conclusion
In the end, this isn’t just about power analysis apps, preregistration templates, or simulation guidelines. It’s about how we, as metascientists, approach the challenge of improving science. If we want our tools and reforms to actually make a difference, we need to stop thinking like toolmakers and start thinking like product designers. That means being curious about the people we’re trying to help, understanding their workflows, testing our assumptions, and iterating based on their feedback, not just our theory.

My hope is that one day, when a researcher sits down to calculate the power for a mediation model, the process will be straightforward. Not confusing or frustrating. That they’ll have access to a tool that’s clear, intuitive, and designed with them in mind. Because when we remove unnecessary hurdles, we make space for better research.

References

Balza, J. S., Cusatis, R. N., McDonnell, S. M., Basir, M. A., & Flynn, K. E. (2022). Effective questionnaire design: How to use cognitive interviews to refine questionnaire items. Journal of Neonatal-Perinatal Medicine, 15(2), 345–349. https://doi.org/10.3233/NPM-210848

Haven, T. L., Errington, T. M., Gleditsch, K. S., van Grootel, L., Jacobs, A. M., Kern, F. G., Piñeiro, R., Rosenblatt, F., & Mokkink, L. B. (2020). Preregistering Qualitative Research: A Delphi Study. International Journal of Qualitative Methods, 19, 1609406920976417. https://doi.org/10.1177/1609406920976417

Ioannidis, J. P. A., Fanelli, D., Dunne, D. D., & Goodman, S. N. (2015). Meta-research: Evaluation and Improvement of Research Methods and Practices. PLoS Biology, 13(10), e1002264. https://doi.org/10.1371/journal.pbio.1002264

Kelter, R. (2024). The Bayesian simulation study (BASIS) framework for simulation studies in statistical and methodological research. Biometrical Journal, 66(1), 2200095. https://doi.org/10.1002/bimj.202200095

Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11), 2074–2102. https://doi.org/10.1002/sim.8086

Pawel, S., Kook, L., & Reeve, K. (2024). Pitfalls and potentials in simulation studies: Questionable research practices in comparative simulation studies allow for spurious claims of superiority of any method. Biometrical Journal, 66(1), 2200091. https://doi.org/10.1002/bimj.202200091

Siepe, B. S., Bartoš, F., Morris, T. P., Boulesteix, A.-L., Heck, D. W., & Pawel, S. (2024). Simulation studies for methodological research in psychology: A standardized template for planning, preregistration, and reporting. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000695

Spitzer, L., Bosnjak, M., & Mueller, S. (2024). Testing the Usability of the Psychological Research Preregistration-Quantitative (PRP-QUANT) Template. Meta-Psychology, 8. https://doi.org/10.15626/MP.2023.4039

Multi-Stage Registration: A Better* Way to Pre-Register

May 16, 2025 by Iris Willigers

This blogpost was written by Cas Goos. Cas his PhD project focuses on studying and enhancing various interventions for improving scientific robustness at journal level, supervised by Michèle Nuijten and Jelte Wicherts.

Pre-registration is a well-known way for researchers to limit their ability to opportunistically analyze their data. I believe an alternative way to register can prove more versatile and realistic: multi-stage registration, where initial registrations can be updated transparently with minimal risk of bias.

The story of Reggie
While registration has been a great step in the right direction, in practice things don’t always go as planned, and the researcher might by necessity have to deviate from the initial registration. But with each deviation, the pre-registration’s guarantee that data were not analyzed opportunistically becomes less and less convincing.

To illustrate, let’s say a researcher Reggie wants to test if:

Mindfulness-based therapy’s effectiveness at reducing burnout is moderated by trait mindfulness.

Wanting to do things right, he pre-registers his hypotheses, experimental design, sample size of 100, and the moderation analysis. However, after Reggie completes his study and submits his article for review, a diligent reviewer notices some key differences from his pre-registration:

The sample size was larger than planned.
Several participants were removed.
Simple effects comparisons were added to further examine the statistically significant interaction effect.

Are the deviations cause for alarm? Will they ruin Reggie’s study as a confirmatory test? Well, that depends. First, on the extent of the deviation. Was the new sample size 104 or 140? Were 4 or 40 participants removed afterwards? Are the simple effects treated as confirmatory or not?

But perhaps even more important is when and why the deviation occurred. Were participants added before or after initial results were observed to be non-significant? Were the participants removed as invalid data patterns observed before any analyses or as “outliers” after initial results were not as expected? And were simple effect analyses included before or after Reggie knew the moderation was statistically significant?

Reggie, wanting to do good, reports on all his deviations. Still, he can kiss the label confirmatory test goodbye. After all, if anyone could deviate as much as desired and declare it afterwards, pre-registrations wouldn’t limit opportunistic data manipulation or analyses. But for Reggie, his additionally collected participants and data removal both happened pre-analyses and he would argue were not opportunistic at all. Unfortunately, people can only take his word for it, as standard pre-registration offers no way to transparently document the when, why, and how of deviations. Multi-stage registration does.

What is multi-stage preregistration?
Multi-stage registration, just like pre-registration, starts with a registration before the study is conducted. The researcher may leave parts open if those are challenging to predict without seeing the data first. The researcher then conducts their research in stages. For example, starting with data collection, then data cleaning, then hypothesis testing, and finally exploratory analyses. After completing a stage, the researcher chooses to revise subsequent based on what they can/should do with their data.

Multi-stage registration is comparable to incremental pre-registration as proposed elsewhere (Lindsay, Simons, Lilienfeld, 2016; Waldron, & Allen, 2022, & Section 2.13 of PCI Guide for authors). However, the focus of incremental pre-reg lies on follow-up studies to an initially pre-registered study. Furthermore, to my knowledge, incremental pre-reg has few guiding principles on how to conduct it.

Regardless, you may still have concerns about multi-stage registration, after all…

Adjusting your analyses based on what you find in the data, that’s exactly what pre-registration is supposed to prevent!

Which is a realistic risk, and that is why the following guidelines are important:

Register the project’s content (writings, code, (meta)data, supplementary materials), before you start a stage.
Document each deviation from the registration in a publicly available logbook as you complete each stage.
If an analysis or test is meant as confirmatory, then the information available before registering the final version of that analysis or test should not be enough to predict the outcome reliably (This is unfortunately a subjective judgement call, and it is therefore important to clarify this to your readers, so they can judge if they agree).

Hence, in our example, Reggie can start registration just as before. Then after collecting 40 more participants than planned, and removing 4 participants, he can make these deviations from his initial stages, document how, why, and when, all before registering his analysis for the final stage. The moderation analysis in that stage remains confirmatory, as long as all deviations from the initial registration up to that point were documented before results were known. The simple-effects analyses however, Reggie included after the moderation results were known and should therefore – as with pre-registration – be reported as exploratory.

Still if we look critically, someone can say that results were not known when deviating even if it’s not true. However, I don’t believe that any registration’s goal is to make it impossible to lie or make a mistake. Instead, more realistic aims for registration are to make not declaring deviations run directly counter to the chosen workflow, sufficiently address the consequences of these deviations, and allow others to check this. In case no deviations occur, pre-registration and multi-stage registration work essentially the same. But when deviations do occur, then pre-registration doesn’t come with a systematic way to document and manage deviations. But by maintaining a logbook of all deviations’ when and why, multi-stage registration makes deviating at the correct stage an integral part of your workflow, with the logbook allowing others to check you.

I am game, so how do I start?
If you are interested, you may still be wondering how to conduct multi-stage registration. Especially in a world where pre-registration (if any registration) is what journals, reviewers, readers, etc. expect.

My advice: when in Rome, do as the Romans do.

If people expect pre-registration as part of a transparent research process, then pre-registration is what they will get. In practice this means that your first-stage registration will function as your pre-registration too. Therefore, you should also make sure to register the first stage on its own on a trusted third-party platform for pre-registration like the OSF, and make sure that it meets relevant pre-registration standards. You can then share your logbook alongside this registration as a supplement. This way you also showcase multi-stage registration within your pre-registration to those interested.

That covers satisfying the pre-registration demand, but how do you set up a multi-stage registration workflow and reap its unique benefits, when registries like the OSF, Zenodo, and clinicaltrials.gov consider updates to the registration an exception rather than the rule?

Currently, I believe we can use version control systems (VCSs) like Git for multi-stage registration. VCSs are often used in software development to maintain multiple (earlier) versions of software, track changes across these versions, document each version, and more. But VCSs are not limited to software, they can work for research projects too. To make it more intuitive think of collaboratively writing an article. But, instead of fumbling through documents like article_final_version.docx, article_final_final_version.docx, article_definitely_final_version.docx and so on, a VCS provides an automated system that stores all the versions of every file in a research project with the changes across version marked (like Word Track changes). VCSs are especially suitable for multi-stage registration because:

Old versions of project files can be stored. Different versions can even be labelled as “releases”, which we can use to label stages.
Changes across versions are tracked.
On platforms like GitHub, researchers can publicly share all their project files in a single repository.
Finally, and most importantly, updating a repository is permanent and timestamped, so it is not possible to go back and change what you had registered after the fact without deleting the entire repository.

Below is a diagram overview of the VCS registration workflow:

Figure 1. Diagram of a multi-stage registration implemented through a VCS. After each contribution to the project, the changed files are uploaded to the VCS servers with all changes tracked, timestamped, and accompanied by a logbook for explanation. The start of each stage is indicated by a release label.

If you want to know how to implement Git in your workflow, you can look at the following page: https://happygitwithr.com/, or use the WORCS R Package, created by a member of our group: https://cjvanlissa.github.io/worcs/ to make reproducible R projects maintained with Git.

Finally, just like with pre-registration, multi-stage registration takes time to learn and implement. But in the long run, not only will it make your research more transparent, tracked changes and a logbook can also be a life saver when you come back to a project.

The asterisk in the title
Throughout this blogpost, I have proposed multi-stage registration as a “better” alternative to pre-registration. But these are uncharted waters. To my knowledge, there is little to no evidence about the effectiveness of multi-stage registration or anything similar. I am however confident that deviations are almost inevitable in practice and that pre-registration provides little aid in navigating deviations within confirmatory research. It remains to be seen whether multi-stage registration will be more helpful, while not succumbing to a host of new problems. We will only learn though if enough of us try multi-stage registration. I know I will.

Acknowledgements
I want to give my thanks to Marcel van Assen and Ben Kretzler for their comments on the initial draft of this blogpost, and to my supervisors Michèle Nuijten and Jelte Wicherts for eliciting my interest in the topic.

References

Lindsay, D. S., Simons, D. J., & Lilienfeld, S. O. (2016). Research Preregistration 101. APS Observer, 29. https://www.psychologicalscience.org/observer/research-preregistration-101

Peer Community In (n.d.) Guide for Authors. Retrieved April 8, 2025, from https://rr.peercommunityin.org/PCIRegisteredReports/help/guide_for_authors#h_22556996329061613309583773

Waldron, S., Allen, C. Not all pre-registrations are equal. Neuropsychopharmacol. 47, 2181–2183 (2022). https://doi-org.tilburguniversity.idm.oclc.org/10.1038/s41386-022-01418-x

Should meta-scientists hold themselves to higher standards?

March 19, 2025 by Iris Willigers

This blogpost was written by Michèle Nuijten. Michèle is an assistant professor of our research group who currently studies reproducibility and replicability in psychology. She is also the developer of the tool “statcheck”. This tool automatically checks whether reported statistical results are internally consistent.

As a meta-scientist, I research research itself. I systematically examine the scientific literature to identify problems and apply the scientific method to design and test solutions. Inherent to meta-science is that it can also include critique of other people’s research and advice on how to improve.

But what if meta-scientists don’t always follow the very best practices they promote?

This question came up recently at an event where a high-profile meta-scientific paper was retracted due to misrepresentations of what had and hadn’t been preregistered. The backlash on social media included concerns that the authors had lost (at least some) credibility as advocates of responsible and transparent research.

That reaction stuck with me because it seemed to suggest that, for a meta-scientist to be credible and for their advice to be taken seriously, their own work must be flawless.

So I began to wonder: Should meta-scientists hold themselves to higher standards to maintain credibility? Should I? And how would it affect my own credibility as a meta-researcher if I dropped the ball somewhere along the way?

Dropping the ball

And I definitely did drop the ball. More than once.

For example, I discovered that my dissertation contained statistical reporting errors. The same dissertation in which I studied statistical reporting errors. And for which I developed statcheck, a tool I specifically designed to detect and prevent them.

In that same dissertation, I also examined bias in effect size estimates and advocated for preregistration: publishing a research plan before data collection to reduce analytical flexibility. But, as one of my opponents subtly pointed out during my defense, I hadn’t preregistered that study myself.

Criticizing the field for its high prevalence of statistical errors while making those same errors, or failing to preregister a study that promotes preregistration; what does that say about the validity of my recommendations? And what does it mean for my credibility as a meta-scientist?

Practice what you preach

Technically, you could make strong claims about the benefits of, say, data sharing without ever having shared a single data point yourself, but it’s not a great look.

To some extent, if you want your advice to be taken seriously, you need to practice what you preach. If you keep insisting that everyone should share their raw data, your credibility takes a hit if you never do it yourself. If you argue that all studies should have high statistical power, you’re probably less convincing if your own sample sizes are consistently tiny. And if you’re a vocal advocate for preregistration, it helps to have preregistered at least one of your own studies—if only to truly understand what you’re talking about.

Following your own advice doesn’t just strengthen your credibility; it also gives you firsthand experience with the challenges that can arise in practice. What looks great on paper isn’t always easy to implement. More importantly, by leading by example, you can show others what these practices look like in action and inspire them to follow suit. In this spirit, striving for the highest standards—whether through a flawless preregistration or a meticulously documented open dataset—makes perfect sense.

What is “perfect science”?

While I believe it’s important for meta-scientists to make a genuine effort to implement the changes they advocate, I don’t think it’s necessary—or even possible—to maintain a perfect track record.

One challenge is that best practices evolve over time. Early preregistrations, for example, seem rudimentary compared to today’s standards. What once qualified as “data sharing” may no longer meet current expectations. These shifts reflect a growing understanding of these practices: how they work in practice, their intended and unintended effects, when they are effective, and in some cases, whether they should be abandoned altogether. And both standards and practices are still evolving

Another issue is that not all best practices are feasible for every project. Some conflict with each other (e.g., data sharing vs. privacy concerns), while others may be ethically or practically impossible in certain contexts (e.g., large sample sizes in rare populations). I’ve advocated for a wide range of best practices: open data and materials, preregistration, Bayesian statistics, using statcheck to detect errors, high statistical power, multi-lab collaborations, verification of claims through reanalysis and replication, and more. While I’ve applied all of these practices in at least some of my work, I haven’t (and often couldn’t) apply them all at once. And that’s just the best practices I’ve personally written about; the broader literature contains many more good recommendations.

Credibility of claims, not people

So what does it mean for the credibility of meta-scientists if their own projects can’t even adhere to all best practices? In my view, that’s not the right question to ask. In science, what matters most is not the credibility of a person, but the credibility of a claim. If a claim—whether in applied science or meta-science—is built on shaky evidence, it should carry less weight. The key is to evaluate which best practices are essential for assessing the trustworthiness of a claim and which are less relevant.

For example, the statistical reporting error in my dissertation appeared in a secondary test, buried in a footnote, and had no impact on the main findings. Missing it was an unfortunate oversight, but it didn’t weaken my core claim that a large share of published papers contain reporting errors. On the other hand, one could argue that the claims from my unregistered study should be interpreted with more caution than if it had been preregistered.

Shared standards for good science

At its core, meta-science is simply science, just like applied research. The fundamental principles are the same; the main difference is that our "participants" are often research articles, and our recommendations aim not at populations of patients, adolescents, or consumers, but at the scientific process itself.

With that in mind, both meta-researchers and applied researchers should strive to follow best practices as they currently stand and as they are feasible for their projects. But perfection isn’t the goal. What matters most is transparency: acknowledging when best practices weren’t followed, explaining why, and adjusting how claims are interpreted accordingly. After all, the credibility of a claim should rest on the strength of its evidence, not on whether the person making it has a spotless record.

Science is always evolving, and so are its standards. What matters is not stubbornly trying to check all the boxes of an ever-changing checklist, but a commitment to honesty, critical reflection, and continuous improvement. The best way to build trust in meta-science—or any science—is not to appear flawless, but to openly engage with the challenges, trade-offs, and limitations of our own work. Good science isn’t about being perfect, it’s about being transparent, adaptable, and striving to do better.

Low(er) precision of estimators of selection models: intuition and a preliminary analysis

February 14, 2025 by Iris Willigers

This blogpost was written by Marcel van Assen. Marcel is a professor of our research group that focuses his research on statistical methods to combine studies, publication bias, questionable research practices, fraud, reproducibility, improving pre-registration & registered reports.

Estimates of random-effects meta-analysis are negatively affected by publication bias, generally leading to overestimation of effect size. Selection models address publication bias and can yield unbiased estimates of effects, but the precision of their estimates is lower than of random-effects meta-analysis. Why and how much precision is lowered has been unclear, and I here provide an intuition and analysis to preliminarily answer these questions. If you are not interested in the details of the analysis, please go directly to the conclusions below.

1. Accurately estimating effect size with selection models in the context of publication bias

Meta-analysis is used to statistically synthesize information of different studies on the same effect. Each of these studies yields one or more effect sizes, and random-effects meta-analysis combines these effect sizes into one estimate of the average true effect size and an estimate of the variance or heterogeneity of the effect size.

A well-known and serious problem of both the scientific literature and meta-analyses is publication bias. Typically, publication bias amounts to the overrepresentation of statistically significant findings in the literature, or alternatively, the underrepresentation of nonsignificant findings. Because of publication bias meta-analyses generally overestimate the true effect size, and in case of a true zero effect size publication bias likely results in a false positive. Both these consequences are undesirable; we need an accurate estimate of the effectives of an intervention for adequate cost-benefit analyses, and implementing interventions that do not work is very harmful.

One solution of the problem is applying meta-analysis using models that account for the possible effects of publication bias. For example, selection models (e.g., Hedges and Vevea, 2005), including p-uniform* (van Aert & van Assen, 2025). Characteristic of these models is that they categorize effect sizes into at least two different intervals, and that estimation occurs by treating both intervals independently. In the most parsimonious variant of these models, two intervals of effect sizes are distinguished, for instance “effect sizes statistically significant at p < .025, right-tailed” and “other effect sizes”. The critical assumption in all these models is then that the probability of publication of effect sizes within one interval is constant but may differ across different intervals. Indeed, if this assumption holds, and all intervals contain at least some effect sizes, and there is publication bias, selection models and p-uniform* accurately estimate average effect size as well as heterogeneity of effect size (e.g., Hedges & Vevea, 2005; van Aert & van Assen, 2025). Problem solved?

2. Accurate estimation with selection models comes at a prize: less precision

The price that we pay for accurate estimation with selection models is lower precision of the estimates. Consider the following two examples to consider if we are willing to pay this price. The field of both examples is plagued with publication bias. Hence it can be expected that random-effects meta-analysis overestimates effect size,

In example A the random effects meta-analysis yields an estimated Hedges’ g = 0.7 with SE = 0.1, whereas a selection model yields an estimate equal to 0.6 with SE = 0.2. Both models strongly suggest that the true effect is positive, although the selection model’s estimate is somewhat lower and considerably less precise.

In example B we obtain Hedges’ g = 0.25 with SE = 0.1 using random-effects meta-analysis and with a selection model we obtain estimate -0.12 with SE = 0.3. Here, random-effect meta-analysis suggests a small and positive true non-zero effect size (z = 2.5, p = .012), whereas the selection model doesn’t lead to a rejection of the null-hypothesis.

The performance of estimators is evaluated using different criteria. One criterion is bias. Concerning bias, selection models outperform random-effects meta-analysis. But as we also prefer precision, we may also use another criterion that combines bias and precision. A criterion that does that is the mean squared error (MSE)

with X being the estimator of parameter µ, and bias equal to (E(X) - µ)^2) .

Let’s apply the MSE to both our examples. For example A, assume that = .6 and that random-effects meta-analysis overestimates with 0.1 because of publication bias. Then, MSE of random-effects meta-analysis equals .1² + .1² = .02, and MSE of the unbiased selection model equals .2² + 0² = .04, with random-effects meta-analysis being the “winner” with the lowest value of MSE. Concerning example B, assume that = .2 and the bias equals .2. Then, MSE = .1² + .2² = .05 for the regular meta-analysis, whereas it equals .3² + 0² = .09 for the selection model, with again random-effects meta-analysis being the “winner”.

I, however, want to argue that the MSE is not a good criterion to evaluate the performance of estimators of meta-analytic effect size in the context of publication bias. Where it is relatively inconsequential to pick one estimator over the other in example A as both point at a positive effect of considerable size, it is surely consequential in cases like situation B. If = 0, random-effects meta-analysis overestimates effect size with type I error rates (another relevant performance criterion) close to 1, rather than α, in case of publication bias (e.g., Carter et al., 2019); it provides a precise but very wrong estimate, leading to harmful conclusions about the effectiveness of interventions. Thus, the precision of estimators is important, but its accuracy is sometimes (much) more important. Perhaps we should use another performance criterion MSEMA for effect size estimators in meta-analysis, such as

with a and b being positive constants. Parameter a signifies the extent to which precision is taken into account in the calculation of MSE anyway, whereas b signifies how much of precision is taken into account depending on true effect size. For a = 1 and b = 0, MSE = MS^EMA. Consider MS^EMA with a = 0 and b = 4. Then, for = .5 it holds that MSEMA = MSE, but precision gets less emphasis relative to accuracy for < .5, with only accuracy being relevant for = 0. For the latter MS^EMA, the random-effects estimator is preferred in Example 1 (1.44 × 0.1² + 0.1² = 0.0244 versus 1.44 × 0.2² = 0.0576), but the selection model estimator is preferred in Example 2 (0 × 0.2² + 0.2² = 0.04 versus 0 × 0.3² + 0² = 0). I argue that more research is needed in developing sensible alternatives to MSE in the context of meta-analysis with publication bias. Estimators belonging to class MSEMA may be a start in that direction.

3. Why the selection model’s effect size estimate is less precise: an intuitive explanation

To my knowledge, neither an intuition has been provided nor an examination has been conducted on the reasons for the lower precision of the selection model’s estimate. In this section I hope to provide some intuition, in the next section the results of a preliminary analysis.

Consider a random draw of four observations from a normal distribution with mean 0 and standard deviation 1:

set.seed
help <- rnorm(4)
x <- sort(help)

I initially selected seed 37, but with that seed I did not end up with two positive and two negative observations. Hence, I increased the seed by one, to obtain the following values of x:

-1.0556027 -0.2535911  0.0251569  0.6864966

Now, let us think about estimating µ, assuming a normal distribution with a variance equal to 1, or N(0,1). We can estimate µ using regular maximum likelihood estimation or using the approach of a selection model with two intervals, say one below and one above x = 0. For illustrative purposes, rather than estimating µ, we compare the fit or likelihood of the data for three values of µ (-1, 0, 1) for both approaches (regular, selection model). The selection model is based on two intervals, one for negative and one for positive values of x.

Figure 1 shows the four x-values and their likelihoods for the three values of µ for the regular approach, which are also presented in the following table:

x	f(mu = 0)	f(mu = 1)	f(mu = 1)
-1.0556027	0.2285302	0.39832606	0.04823407
-0.2535911	0.3863186	0.30194764	0.18182986
0.0251569	0.3988161	0.23588477	0.24805667
0.6864966	0.3151907	0.09622425	0.37981130

Figure 1: Likelihoods of four x-values (vertical lines) for models N(-1,1) in red, N(0,1) in green, N(1,1) in blue.

Note that particularly x = .686 is unlikely under µ = -1 (the red curve), and x = -1.056 is unlikely under µ = 1 (the blue curve). Because of this, the likelihood of all four observations (i.e., simply the product of all four observations’ likelihoods) is much higher for µ = 0 than for the other two values. The likelihood ratio for µ = 0 compared to µ = -1 is 4.07, and 13.43 compared to µ = 1. To conclude, we have clear evidence in favor of µ = 0 relative to these two other values of µ.

This result is not surprising. We know from standard statistical theory that the sampling error of the mean equals , meaning that the other values of µ are two units of standard error away from the true value of µ=0. Hence it would be rather unlikely to obtain strong evidence in favor of a wrong model in this case.

Let us now consider the likelihood of the same data under a selection model based on two intervals, one for negative and one for positive values of x. In this approach the likelihood is considered for each interval independently. In p-uniform* this means that an observation’s likelihood is conditional on the probability of being in that interval, given the parameters. This means that the likelihoods of the two positive observations are divided by P(X > 0), which is .841, .5, .159 for µ equal to 1, 0, -1, respectively, and that the likelihoods for the two negative observations are divided by P(X < 1), or .159, .5, .841. As the sum of the two densities equals 2, and we want to compare the likelihoods to those under the regular approach, without loss of generality we divided the resulting likelihoods by 2 to obtain:

x	f(mu = 0)	f(mu = 1)	f(mu = 1)
-1.0556027	0.2285302	0.2367199	0.1520090
-0.2535911	0.3863186	0.1794435	0.5730345
0.0251569	0.3988161	0.7433878	0.1474168
0.6864966	0.3151907	0.3032495	0.2257168

These densities or likelihoods are also shown in Figure 2.

Figure 2: Likelihoods of four x-values (vertical lines) for models N(-1,1) in red, N(0,1) in green, N(1,1) in blue, under selection model p-uniform* based on two intervals (until and from 0).

Note that the shape of the density for µ = 0 is unaffected in Figure 2 and equal to that in Figure 1, whereas the density for below (above) 0 is “inflated” for the normal model with µ = 1 (µ = -1). Note that “inflation” is misleading; µ is estimated merely based on observations in these two independent intervals.

Computing the likelihood ratio for µ = 0 compared to µ = -1 yields 0.072, and 0.239 compared to µ = 1. That is, the best fitting value of µ is -1! Paradoxical, as the smallest of the four observations is just below -1 (i.e., -1.056), and the other three values are well above -1…

The example suits well to provide an intuition of why the effect size estimator of the selection approaches is less precise than under the regular approach. Foremost, recall that p-uniform*’s estimate of µ is unbiased, although in this example p-uniform*’s estimate of µ will be very far off the mark. The estimate is (very) imprecise for two related reasons. First, information on the likelihood of an observation in an interval is lost, or not considered anymore. For instance, that the probability of having an observation below 0 only equals .159 if µ = 1, does no longer enter the likelihood calculations. Note that not considering this probability is good if it is incorrect, as it is incorrect to consider the regular density in case of publication bias; in our example it is suboptimal, as there is no selection bias in our example.

Second, as the two intervals are smaller than the complete interval, the likelihood function of these intervals is more sensitive to changes in the values of the observations, which also leads to less precision. For instance, consider the two positive observations in our example. As seen from Figure 2, x = 0.025 is most likely under µ = -1, whereas x = 0.686 is about equally likely under all three models. As the two positive observations are, by chance, close to 0, the estimate of µ will be (very) negative. That is, conditional on x > 0, the x-values are more likely to be just above 0 for strongly negative values of µ than for positive values of µ. For instance, for µ = -2, the likelihood for both positive observations equals 1.128 × 0.238 = 0.268, which is higher than the likelihood for µ=-1 (0.743 × 0.303 = 0.225), which means an increase in likelihood, although the probability of an observation in this interval decreases from .159 (for µ=-1) to .023 (µ=-2). But note that the probability of an observation in this interval is ignored.

4. How much does precision of estimators decrease in selection models?

We conducted a small simulation study to examine the precision of estimators for μ and σ2 of selection models, relative to the precision of a regular model. We again used the N(0,1) distribution as in our example, and estimated both the mean and variance of the distribution based on N (10, 100, 1,000, 100,000) observations and the following selection models that vary both the number intervals and the positioning of the intervals:

2_eq:                   (<-,0) and [0,->), each with 0.5 probability
3_eq:                   (<-,-0.43), [-0.43, 0.43), [0.43, ->) each with 1/3 probability
4_eq:                   (<-,-0.67), [-0.67, 0), [0, 0.67), [0.67, ->) each having 1/4 probability
2_un:                  (<-,1.96), [1.96, ->), with .025 in the last interval
3_un:                  (<-,-1.96), [-1.96,1.96), [1.96, ->), with .025 in first and last interval
4_un:                  (<-,-1.96), [-1.96,0), [0,1.96), [1.96, ->), splitting the middle interval

The “_eq” scenarios create equally large intervals with respect to the expected number of observations, whereas the “_un” scenarios correspond to selection models with unequal intervals with regions for statistical significance.

See here for the code that my colleague Robbie van Aert from the meta-research center wrote for this small simulation study, and for all the resulting tables with results as well. Here I only briefly discuss the most important results.

First, the parameters could not be estimated well for three and four intervals, in case of N=10. Hence selection models with more than two intervals are not recommended in case of a small number of observations. Clear guidelines on data requirements for selection models still need to be developed, based on research as presented here.

Second, precision of estimators decreases in the number of intervals, and precision is lower in the scenarios with equal intervals. The table below shows the ratio of the sampling variance of the selection model and of a regular model (1/N) for estimating µ.

	2_eq	2_un	3_eq	3_un	4_eq	4_un
n = 100	2.906	1.281	4.711	1.462	7.527	4.701
n = 1,000	2.887	1.132	5.099	1.256	7.589	4.439
n = 100,000	2.767	1.193	5.153	1.302	7.324	3.911

For instance, the variance of the estimate of µ is a bit less than 3 times (if N = 100,000, 2.767) as large as this variance under a regular model with one interval. Note that precision is not much worse for two or three unequal intervals. However, data were simulated under the null-hypothesis resulting in almost all observations (95% or 97.5%) ending up in the largest interval. This may not occur in an application where the null is false, hence the estimator’s precision can be expected to be lower in most applications.

The next table shows that the precision of the variance parameter σ2 also suffers estimation in intervals, but compared to the estimation of µ (i) precision suffers less, and (ii) precision suffers most in the case of unequal intervals.

	2_eq	2_un	3_eq	3_un	4_eq	4_un
n = 100	1.048	1.437	1.358	2.567	1.743	2.581
n = 1,000	1.052	1.547	1.391	2.292	1.722	2.292
n = 100,000	1.063	1.507	1.316	2.162	1.602	2.163

5. Conclusions

Selection models including p-uniform* provide unbiased estimates of µ and τ2 in the context of publication bias, as opposed to random-effects meta-analysis. However, selection models need sufficient data, but currently there are no clear guidelines concerning data requirements.

Accurate estimation comes at a price of lower precision, particularly for estimating µ. As precision also suffers from adding intervals to the model, intervals should only be added in case of a strong suggestion of differential publication for these intervals.

More research on how much precision of estimators of selection models suffer from adding one or more intervals to the model. This is important as we want to select the appropriate estimators for meta-analysis, estimators that balance accuracy and precision. When examining this balance, a better performance measure should be developed that than the MSE, for instance from the MSEMA family in (2).

References

Carter, E. C., Schönbrodt, F. D., Gervais, W. M., & Hilgard, J. (2019). Correcting for bias in psychology: A comparison of meta-analytic methods. Advances in Methods and Practices in Psychological Science, 2(2), 115–144. https://doi.org/10.1177/2515245919847196

Hedges, L. V., & Vevea, J. L. (2005). Selection method approaches. In H. R. Rothstein, A. J. Sutton, & M. Borenstein (Eds.), Publication bias in meta-analysis: Prevention, assessment, and adjustments. Chichester: UK: Wiley.

van Aert, R. C. M., & van Assen, M. A. L. M. (2025). Correcting for publication bias in a meta-analysis with the p-uniform* method. Manuscript submitted for publication. https://doi.org/10.31222/osf.io/zqjr9

A Brief Overview of Spin: The Twists and Turns of Scientific Writing

February 04, 2025 by Iris Willigers

This blogpost was written by Tijn van Hoesel. Tijn is a PhD student of our meta-research group and started his PhD in September 2024. During his PhD, he will be working on investigating the impact of spin and other reporting practices in scientific research with his supervisors Marjan Bakker and Bennett Kleinberg.

We have all been there, you are reading an abstract describing an interesting study that seems very convincing and has found some promising and (of course, most importantly!) significant results. However, after having read the rest of the paper, it all seem a lot less convincing, promising, and significant. Maybe the abstract only states the significant results, while in the full text five more outcomes are described for which no significant effect was found. Or maybe, after reading the sample details, you realize that the recommendations for practice stated in the abstract are not as ‘widely applicable’ as they are made out to be. Either way, it seems like you have just fallen victim to spin.

The word ‘spin’ in a social and behavioural context is commonly associated with the world of politics and its interaction with the media (Gaber, 1999; Grattan, 1998). Spin, in the political context, can generally be defined as “a favourable bias” (Andrews, 2006, p. 32). Moreover, spin can be seen as a part of propaganda and as a conscious, deliberate strategy of communication applied to achieve a certain goal (Macnamara, 2022). Often, in politics and public communication, the goal is to influence public opinion about a given situation/event, topic, person, or organization. Usually, the person who puts a favourable bias on the information (i.e., ‘spins’ the information) is referred to as a spin doctor. They may use various spin strategies like cherry picking, misrepresenting facts/numbers/quotes, presenting speculations as facts, burying bad news with other news, or reporting only to like-minded journalists.

Spin in Scientific Writing
Although more well-known in politics, the use of spin is a communication strategy that may be applied, whether deliberately or not, by people in all kinds of contexts. One such context in which proper communication of information is crucial, is scientific research. Allegedly, the first mention of spin in scientific writing was in a paper by Horton (1995), who described the use of hyperbole and “the conscious and unconscious tricks of authorial rhetoric” (p. 985) in scientific papers. More specifically, Horton mentions “the manipulation of language to convince the reader of the likely truth of a result” (p. 985). In his paper, Horton breaks down the discussion section of a paper and focusses on its linguistic features and the structure of the argumentation. His idea of spin seems to mostly revolve around the specific use of language.

About 15 years later, Boutron and colleagues (2010) conducted what is now the most cited investigation into spin in medical literature and defined it as: “specific reporting that could distort the interpretation of results and misleading readers” (p. 2058). Examples of such spin practices are (1) selective/strategic reporting of results throughout the report, (2) focussing on secondary analyses or sub-/within-group analyses, (3) claiming equivalence for statistically nonsignificant results, (4) use of (hype) words like “important”, “novel”, or “crucial” (i.e., linguistic spin), and (5) unsupported extrapolation of findings to other situations and/or populations. Compared to Horton (1995), Boutron and colleagues (2010) widen the concept of spin to include non-linguistic elements. Here, it is important to note that there are different ideas about what constitutes spin in scientific writing and how it should be defined.

The spin practices in scientific writing have some similarities with the spin strategies used in politics and public communications. Both involve the selective presentation and/or misrepresentation of information and making unsupported claims. However, an important difference between the two is that the use of spin in politics is generally considered a conscious and planned effort, while the use of spin in scientific writing is believed to not necessarily be a conscious decision. To indicate this important difference, I prefer the term spin ‘practices’ when talking about scientific writing as opposed to spin ‘strategies’, which is often used in politics and public communication contexts.

Context of Spin Research
The use of spin practices in scientific writing has mostly been of interest to (meta-) scientists in the field of (bio)medicine. Most of their research focusses on the presence of spin practices in two types of studies: (1) randomized controlled trials (e.g., Arunachalam et al., 2017; Gewandter et al., 2015; Guo et al., 2023) or (2) systematic reviews and/or meta-analyses (e.g., Balcerak et al., 2021; Corcoran et al., 2022; Flores et al., 2021). I believe this focus can partially be explained by the existence of well-known and widely-applied reporting guidelines for these types of studies, which are the CONSORT (Moher et al., 2010) and the PRISMA (Page et al., 2021) guidelines, respectively. These guidelines provide a structured way to evaluate the quality of reporting for a particular type of study, allowing deviations from those guidelines to be labelled as ‘spin practices’. Additionally, a clear and extensive classification of spin in systematic reviews (SR) and meta-analyses (MA) was developed by Yavchitz and colleagues (2016), making it easy for other researchers to evaluate spin in SRs and MAs in their own sub-field of interest.

Despite this focus, it is good to note that other types of studies are not entirely neglected. A number of studies investigated spin in nonrandomized trials (e.g., Lazarus et al., 2015), diagnostic accuracy studies (e.g., Ochodo et al., 2013), and clinical prediction model studies (e.g., Andaur Navarro et al., 2023). A very recent development with regards to clinical prediction model studies, is a framework for identifying and evaluating spin that has been developed by Andaur Navarro and colleagues (2024). In their framework, the authors identified several spin practices and facilitators. Some of which are specific for prediction model research (e.g., “Ignoring the risk of optimism in model performance” p. 5), while others are also applicable to a wider range of study types (e.g., “Unsubstantiated claims of clinical usefulness are reported” p. 8).

Although spin can occur in all parts of a paper, it is the abstract that has gotten a lot, if not most, of the attention in spin research. One of the main interest lies in the discrepancies between what has been reported in the results section of the full-text and what is reported and concluded in the abstract. It is argued that abstracts play an important role in science communication, which justifies the focus on abstracts found in spin research. This justification is supported by a recent study which found that 98.6% of health academics and researchers read the abstract first and over 80% of researchers rated the abstract as important or very important (Shiely et al., 2024). Furthermore, it has been found that clinicians also heavily rely on abstracts for information due to a lack of time to read the full article or the fact that the full article is behind a paywall (Khaliq et al., 2012; Saint et al., 2000). It goes without saying that the possible consequences of misinterpreted results and unsubstantiated claims of effectiveness can be severe, especially considering RCT’s and applications in clinical practice.

Spin Research Findings
Most studies investigating spin practices are mainly interested in measuring the prevalence of these practices. In a systematic review across 31 studies, it was found that the prevalence of spin in abstracts ranged from 9.6% to 83.6% and that the prevalence of spin in the main text ranged from 18.9% to 100% (Chiu et al., 2017). These wide ranges of prevalence are most likely due to the varying definitions and operationalisations of spin used and the varying sub-fields investigated across the different studies. More recent studies, not captured by this systematic review, have found comparable prevalence rates: 70% of papers evaluating ovarian cancer biomarkers (Ghannad et al., 2019), 46% of abstracts and 38% of full-text reports of systematic reviews of diagnostic accuracy studies in high-impact journals (McGrath et al., 2020), 67% of abstracts of systematic reviews and meta-analyses on cannabis use disorder (Corcoran et al., 2022), and 78% of abstracts of papers describing RCTs in sleep medicine (Guo et al., 2023).

Besides studies investigating the prevalence of spin, there have also been a couple of studies investigating other phenomena in relation to spin practices. For example, it was found that spin practices were not significantly related to either non-financial conflict of interest or industry funding (Jellison et al., 2019; Lieb et al., 2016). There also has been some interest in the interplay between spin practices and citation bias, where it is suggested that citation bias is less severe for negative studies that are positively spun (De Vries et al., 2016, 2017; Duyx et al., 2017). Other studies have explored the effects that spin practices might have on readers and their interpretation of the presented findings. For example, there is mixed evidence on the effect of spin on the perception of findings of RCTs. Where some studies find that spin in abstracts significantly increases the reader’s perceived effectiveness of a treatment (Boutron et al., 2014; Jankowski et al., 2022), other studies find no such effect (Shinohara et al., 2017; Van Hoesel & Bakker, 2024). These same studies found similarly mixed results regarding the effect of spin on readers’ interest in reading the full-text article, and their interest in extending the line of research for the investigated treatment.

What’s next?
You may have noticed that few firm conclusions can be reached from the current state of spin research and that those which can be reached are usually applicable only to specific situations (e.g., RCT’s with non-significant primary outcomes). Needless to say, more research is needed in order to establish the effects of spin and its relation to other (meta-scientific) concepts. I think that, within (bio)medicine, research on spin practice should more often consider other types of studies and that an effort should be made to come to a clear definition of spin. Furthermore, I personally believe a lot can also be gained from previous meta-scientific research in other disciplines, such as the social sciences. These disciplines investigate other meta-scientific concepts that have obvious overlap with the concept of spin, like questionable research practices (John et al., 2012; Nagy et al., 2024). This way, hopefully, we can get more insight into spin and its consequences on science and practice. Until then, we are probably best off not taking abstracts at face value and remaining critical.

References

Andaur Navarro, C. L., Damen, J. A. A., Ghannad, M., Dhiman, P., Van Smeden, M., Reitsma, J. B., Collins, G. S., Riley, R. D., Moons, K. G. M., & Hooft, L. (2024). SPIN-PM: A consensus framework to evaluate the presence of spin in studies on prediction models. Journal of Clinical Epidemiology, 170, 111364. https://doi.org/10.1016/j.jclinepi.2024.111364

Andaur Navarro, C. L., Damen, J. A. A., Takada, T., Nijman, S. W. J., Dhiman, P., Ma, J., Collins, G. S., Bajpai, R., Riley, R. D., Moons, K. G. M., & Hooft, L. (2023). Systematic review finds “spin” practices and poor reporting standards in studies on machine learning-based prediction models. Journal of Clinical Epidemiology, 158, 99–110. https://doi.org/10.1016/j.jclinepi.2023.03.024

Andrews, L. (2006). Spin: From tactic to tabloid. Journal of Public Affairs, 6(1), 31–45. https://doi.org/10.1002/pa.37

Arunachalam, L., Hunter, I. A., & Killeen, S. (2017). Reporting of Randomized Controlled Trials With Statistically Nonsignificant Primary Outcomes Published in High-impact Surgical Journals. Annals of Surgery, 265(6), 1141–1145. https://doi.org/10.1097/SLA.0000000000001795

Balcerak, G., Shepard, S., Ottwell, R., Arthur, W., Hartwell, M., Beaman, J., Lu, K., Zhu, L., Wright, D. N., & Vassar, M. (2021). Evaluation of Spin in the Abstracts of Systematic Reviews and Meta-Analyses of Studies on Opioid use Disorder. Substance Abuse, 42(4), 543–551. https://doi.org/10.1080/08897077.2021.1904092

Boutron, I., Altman, D. G., Hopewell, S., Vera-Badillo, F., Tannock, I., & Ravaud, P. (2014). Impact of Spin in the Abstracts of Articles Reporting Results of Randomized Controlled Trials in the Field of Cancer: The SPIIN Randomized Controlled Trial. Journal of Clinical Oncology, 32(36), 4120–4126. https://doi.org/10.1200/JCO.2014.56.7503

Boutron, I., Dutton, S., Ravaud, P., & Altman, D. G. (2010). Reporting and Interpretation of Randomized Controlled Trials With Statistically Nonsignificant Results for Primary Outcomes. JAMA, 303(20), 2058–2064. https://doi.org/10.1001/jama.2010.651

Chiu, K., Grundy, Q., & Bero, L. (2017). ‘Spin’ in published biomedical literature: A methodological systematic review. PLOS Biology, 15(9), e2002173. https://doi.org/10.1371/journal.pbio.2002173

Corcoran, A., Neale, M., Arthur, W., Ottwell, R., Roberts, W., Hartwell, M., Cates, S., Wright, D. N., Beaman, J., & Vassar, M. (2022). Evaluating Spin in the Abstracts of Systematic Reviews and Meta-Analyses on Cannabis use Disorder. Substance Abuse, 43(1), 380–388. https://doi.org/10.1080/08897077.2021.1944953

De Vries, Y. A., Roest, A. M., De Jonge, P., Cuijpers, P., Munafò, M. R., & Bastiaansen, J. A. (2017). The cumulative effect of reporting and citation biases on the apparent efficacy of treatments: The case of depression. Psychological Medicine, 48(15), 2453–2455. https://doi.org/10.1017/S0033291718001873

De Vries, Y. A., Roest, A. M., Franzen, M., Munafò, M. R., & Bastiaansen, J. A. (2016). Citation bias and selective focus on positive findings in the literature on the serotonin transporter gene (5-HTTLPR), life stress and depression. Psychological Medicine, 46(14), 2971–2979. https://doi.org/10.1017/S0033291716000805

Duyx, B., Urlings, M. J. E., Swaen, G. M. H., Bouter, L. M., & Zeegers, M. P. (2017). Scientific citations favor positive results: A systematic review and meta-analysis. Journal of Clinical Epidemiology, 88, 92–101. https://doi.org/10.1016/j.jclinepi.2017.06.002

Flores, H., Kannan, D., Ottwell, R., Arthur, W., Hartwell, M., Patel, N., Bowers, A., Po, W., Wright, D. N., Chen, S., Miao, Z., & Vassar, M. (2021). Evaluation of spin in the abstracts of systematic reviews and meta-analyses on breast cancer treatment, screening, and quality of life outcomes: A cross-sectional study. Journal of Cancer Policy, 27, 100268. https://doi.org/10.1016/j.jcpo.2020.100268

Gaber, I. (1999). Government by spin: An analysis of the process. Contemporary Politics, 5(3), 263–275. https://doi.org/10.1080/13569779908450008

Gewandter, J. S., McKeown, A., McDermott, M. P., Dworkin, J. D., Smith, S. M., Gross, R. A., Hunsinger, M., Lin, A. H., Rappaport, B. A., Rice, A. S. C., Rowbotham, M. C., Williams, M. R., Turk, D. C., & Dworkin, R. H. (2015). Data Interpretation in Analgesic Clinical Trials With Statistically Nonsignificant Primary Analyses: An ACTTION Systematic Review. The Journal of Pain, 16(1), 3–10. https://doi.org/10.1016/j.jpain.2014.10.003

Ghannad, M., Olsen, M., Boutron, I., & Bossuyt, P. M. (2019). A systematic review finds that spin or interpretation bias is abundant in evaluations of ovarian cancer biomarkers. Journal of Clinical Epidemiology, 116, 9–17. https://doi.org/10.1016/j.jclinepi.2019.07.011

Grattan, M. (1998). The Politics of Spin. Australian Studies in Journalism, 7, 32–45.

Guo, F., Zhao, T., Zhai, Q., Fang, X., Yue, H., Hua, F., & He, H. (2023). “Spin” among abstracts of randomized controlled trials in sleep medicine: A research-on-research study. SLEEP, 46(6), zsad041. https://doi.org/10.1093/sleep/zsad041

Horton, R. (1995). The rhetoric of research. BMJ, 310(6985), 985–987. https://doi.org/10.1136/bmj.310.6985.985

Jankowski, S., Boutron, I., & Clarke, M. (2022). Influence of the statistical significance of results and spin on readers’ interpretation of the results in an abstract for a hypothetical clinical trial: A randomised trial. BMJ Open, 12(4), e056503. https://doi.org/10.1136/bmjopen-2021-056503

Jellison, S., Roberts, W., Bowers, A., Combs, T., Beaman, J., Wayant, C., & Vassar, M. (2019). Evaluation of spin in abstracts of papers in psychiatry and psychology journals. BMJ Evidence-Based Medicine, 25(5), 178–181. https://doi.org/10.1136/bmjebm-2019-111176

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953

Khaliq, M. F., Noorani, M. M., Siddiqui, U. A., & Anwar, M. (2012). Physicians reading and writing practices: A cross-sectional study from Civil Hospital, Karachi, Pakistan. BMC Medical Informatics and Decision Making, 12(1), 76. https://doi.org/10.1186/1472-6947-12-76

Lazarus, C., Haneef, R., Ravaud, P., & Boutron, I. (2015). Classification and prevalence of spin in abstracts of non-randomized studies evaluating an intervention. BMC Medical Research Methodology, 15(85), 1–8. https://doi.org/10.1186/s12874-015-0079-x

Lieb, K., Osten-Sacken, J. V. D., Stoffers-Winterling, J., Reiss, N., & Barth, J. (2016). Conflicts of interest and spin in reviews of psychological therapies: A systematic review. BMJ Open, 6(4), e010606. https://doi.org/10.1136/bmjopen-2015-010606

Macnamara, J. (2022). Persuasion, promotion, spin, propaganda? In J. Falkheimer & M. Heide (Eds.), Research Handbook on Strategic Communication (pp. 46–61). Edward Elgar Publishing. https://doi.org/10.4337/9781800379893.00009

McGrath, T. A., Bowdridge, J. C., Prager, R., Frank, R. A., Treanor, L., Dehmoobad Sharifabadi, A., Salameh, J.-P., Leeflang, M., Korevaar, D. A., Bossuyt, P. M., & McInnes, M. D. F. (2020). Overinterpretation of Research Findings: Evaluation of “Spin” in Systematic Reviews of Diagnostic Accuracy Studies in High–Impact Factor Journals. Clinical Chemistry, 66(7), 915–924. https://doi.org/10.1093/clinchem/hvaa093

Moher, D., Hopewell, S., Schulz, K. F., Montori, V., Gotzsche, P. C., Devereaux, P. J., Elbourne, D., Egger, M., & Altman, D. G. (2010). CONSORT 2010 Explanation and Elaboration: Updated guidelines for reporting parallel group randomised trials. BMJ, 340(mar23 1), c869–c869. https://doi.org/10.1136/bmj.c869

Nagy, T., Hergert, J., Elsherif, M. M., Wallrich, L., Schmidt, K., Waltzer, T., Payne, J. W., Gjoneska, B., Seetahul, Y., Wang, Y. A., Scharfenberg, D., Tyson, G., Yang, Y.-F., Skvortsova, A., Alarie, S., Graves, K. A., Sotola, L. K., Moreau, D., & Rubínová, E. (2024). Bestiary of Questionable Research Practices in Psychology. PsyArXiv. https://doi.org/10.31234/osf.io/fhk98

Ochodo, E. A., De Haan, M. C., Reitsma, J. B., Hooft, L., Bossuyt, P. M., & Leeflang, M. M. G. (2013). Overinterpretation and Misreporting of Diagnostic Accuracy Studies: Evidence of “Spin.” Radiology, 267(2), 581–588. https://doi.org/10.1148/radiol.12120527

Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D., Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., … Moher, D. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ, n71. https://doi.org/10.1136/bmj.n71

Saint, S., Christakis, D. A., Saha, S., Elmore, J. G., Welsh, D. E., Baker, P., & Koepsell, T. D. (2000). Journal reading habits of internists. Journal of General Internal Medicine, 15(12), 881–884. https://doi.org/10.1046/j.1525-1497.2000.00202.x

Shiely, F., Gallagher, K., & Millar, S. R. (2024). How, and why, science and health researchers read scientific (IMRAD) papers. PLOS ONE, 19(1), e0297034. https://doi.org/10.1371/journal.pone.0297034

Shinohara, K., Aoki, T., So, R., Tsujimoto, Y., Suganuma, A. M., Kise, M., & Furukawa, T. A. (2017). Influence of overstated abstract conclusions on clinicians: A web-based randomised controlled trial. BMJ Open, 7(12), e018355. https://doi.org/10.1136/bmjopen-2017-018355

Van Hoesel, T. G. L., & Bakker, M. (2024). The Impact of Spin on the Interpretations of Abstracts of Randomized Controlled Trials in the Field of Clinical Psychology: An Online Randomized Controlled Trial. PsyArXiv. https://doi.org/10.31234/osf.io/rh9vg

Yavchitz, A., Ravaud, P., Altman, D. G., Moher, D., Hrobjartsson, A., Lasserson, T., & Boutron, I. (2016). A new classification of spin in systematic reviews and meta-analyses was developed and ranked according to the severity. Journal of Clinical Epidemiology, 75, 56–65. https://doi.org/10.1016/j.jclinepi.2016.01.020

The Measurement Crisis: A Hidden Flaw in Psychology.

January 13, 2025 by Iris Willigers

This blogpost was written by Iris Willigers. Iris is a PhD student of our meta-research group and started her PhD in September 2024. During her PhD, she will be working on Jelte’s Vici project: Examining the Variation in Causal Effects in Psychology with her supervisors Jelte Wicherts and Marjan Bakker.

In August 2015, one of the most well-known papers in Psychology was published titled: “Estimating the reproducibility of psychological science” by the Open Science Collaboration (1). In this paper, they argued that only 36% of 100 selected papers that were published in top journals in Psychology could be replicated. This paper was part of the Reproducibility Project, which involved collaborations with numerous researchers with the goal of estimating the reproducibility of published scientific findings (2) . Up until now, many suggestions to improve reproducibility of psychological science have mostly focused on the correct use and reporting of methods and statistics (3,4). However, even when correctly using and reporting methods and statistics, if the operationalization of the measured construct is invalid, the conclusions based on the results of the study may be invalid and unreliable. A threat for the reproducibility of psychology is the lesser talked about but related crisis: the Measurement Crisis (5).

Psychology heavily relies on the operationalization of abstract constructs. Operationalization (6) can be described as the process of translating the abstract constructs (e.g., anxiety) into observable and measurable variables (e.g., Beck Anxiety Inventory). However, the time and thinking that needs to go into this process is often underestimated. In case of poor operationalization, the observed variables lack construct validity. Construct validity describes the ability of a measurement instrument to measure the operational construct it is supposed to measure (7). An example of poor construct validity can be illustrated by asking the question: “Have you played tennis before?” with the goal of measuring implicit social cognition. In this example, I think we can agree that the face validity (5) of the operationalization of implicit social cognition is poor as the question has nothing to do with an attitude towards something/someone. However, it becomes less clear when I tell you that I used the Implicit Association Test (IAT) with the goal to measure implicit social cognition in my study. In this case, we would need to collect evidence for all components of construct validity to be able to decide whether the operationalization was successful. Several examples of this evidence are conceptualizing the construct (8) (substantive component), investigating Cronbach’s alpha (9) (structural component) or check for correlations with other scales that measure same and different constructs (9) (external component).

Let’s look at the example I mentioned before, the Implicit Association Test (IAT). The IAT (10) aims to measure implicit social cognition (often attitudes) by showing a participant two different conditions of pictures with words. If researchers want to measure attitudes towards sexuality with this test, they will ask participants to divide photos of heterosexual or gay couples on a computer using either ‘good’ or ‘bad’ words, while measuring your reaction time. An example of the IAT for attitudes towards race can be seen in Figure 1. To derive the participants’ implicit attitudes, the assumption is that the participant will take a shorter time to respond in case of a stronger association with the paired category.

Figure 1
An example of conditions of the Implicit Association Test (IAT).

*Note*. Participants are asked to the words and pictures that are displayed in the middle in either white/black patient or bad/good word. In each of the four conditions, the location of the words and the race is changed. The reaction time of participants is measured in each condition. Taken from “Implicit Bias Among Physicians” by Dawson and Arkes (11).

Although the face validity for the IAT looks okay, there have been several studies with conflicting outcomes for construct validity of the IAT. The first problem considers that it remains unclear in the literature what the IAT exactly measures (12). There are four possibilities that range from measuring implicit attitudes that are not possible to measure using explicit attitudes to the test being no valid measure as there are no stable attributes (13).

Another reason that the construct validity of the IAT is unclear is because the reliability of the IAT depends on the type of reliability. The test-retest reliability is moderate (r = .50), whereas the internal consistency is high (alpha = .80) (14). If the goal of the IAT is to measure one specific attitude that remains consistent over time, the construct validity gathered from reliability information is not sufficient.

Additionally, we are also not able to draw a conclusion on the convergent and discriminant validity of the IAT (15) following the approach of Campbell and Fiske (9). To provide evidence for discriminant validity, convergent validity needs to be well established. Discriminant validity describes that the measure does correlate low to moderate with other measures that are designed to measure a different construct, whereas convergent validity describes that the measure does correlate highly with other measures that are designed to measure the same or a theoretically similar construct. Logically, one should first demonstrate the measure presents the intended construct before being able to provide evidence that another construct differs from the intended construct. Thus, to be able to provide evidence for convergent and discriminant validity of the IAT, it should be clear whether the IAT measures implicit social cognition, explicit social cognition or something in-between. This brings us back to the first problem we discussed regarding the conceptualization problem of IAT for construct validation.

Our example on the construct validity of the IAT illustrates how difficult it is to determine whether a measure is valid. Even though it is unclear what the construct validity of the IAT is, the test is still used to measure or study implicit social cognition. Currently, it has been cited 18.012 times (as of 13-12-2024). But how can you blame the people citing this article when there is so much conflicting literature to keep up with?

Although operationalization and a study’s construct validity are essential elements required in the process of establishing the robustness of study findings, scientific manuscripts often do not contain sufficient information that validates the measured construct(s). Different studies reported lack of construct validity evidence for scientific manuscripts about general psychology (16), educational behavior (17), emotion (18), and social cognitive ability (19). Studies about reporting practices of reliability and validity show us that researchers often invoke the reliability and validity evidence of previous studies without testing it in their own sample (17, 18). As we know for reliability, this is a characteristic of the functioning of a test within a certain sample and not a characteristic of only the test itself (22). Current reporting practices also show that researchers still assume previous studies’ reliability and validity even though they adjusted their test by adding or deleting questions (18). Still, no or incomplete reporting of validity or reliability of the measurement instrument(s) can lead to over-reliance of the papers’ reported scientific results by the authors themselves. In addition, this can also be misleading to readers of the paper, as the reported conclusions cannot be evaluated based on the reported measurement information.

Part of the measurement problem in the field of psychology, is the lack of standardization of measurement instruments. An example can be retrieved from Weidman et al. in their study about the current state of emotion assessment and found that only 8.4% of the 356 measurement instruments were cited from an existing scale without modifying the scale for the current paper (18). In this study, 69% of the measurement instruments were developed without systematic scale development or reference to earlier literature.

The problem with this unstandardized way of measuring is that different measurements for the same abstract construct could possibly yield to different conclusions (23). This also has implications for the evaluation and comparison of parts of the scientific theory that is tested in the literature. For example, consider two constructs academic success and physical activity, which can be operationalized in multiple ways. Two operationalizations for academic success are someone’s grade point average (GPA) or someone’s self-reported GPA. For GPA, people often overreport their self-reported academic success compared to their actual GPA (24,25). Physical activity can be operationalized by reporting measured number of minutes of physical activity per day measured by an accelerometer for 5 days or self-reported physical activity in minutes per week. The correlation between academic success and physical activity has been studied using these different operationalizations. In one study they used the actual GPA and accelerometer to investigate the relationship between academic success and physical activity and found a strong correlation of 0.87 (N = 20) (26). In another study they used the self-reported operationalization academic success and physical activity and found a correlation of -0.12 (N = 104) (27). Even though different operationalizations of the variables are probably not the only reason why the correlations have a lot of variation between them (e.i., the studies’ sample sizes were small), it is an important aspect within this variation. But the example makes clear that we need standardized tests and other measures to build strong psychological theories (28,29).

How can we overcome the measurement crisis? I think it starts with prioritizing standardized measures to operationalize variables and ensuring transparency in reporting evidence for their construct validity. Furthermore, when feasible in terms of time and costs, researchers should offer evidence of construct validity within their specific sample to enhance the credibility of their findings. When designing a new measure to operationalize variables, it is essential to systematically develop and assess its validity. The process of instrument developments should be reported transparently, enabling readers to critically evaluate the validity of the study. A clear overview of avoiding, what are called, ‘Questionable Measurement Practices’ can be found in the paper of Flake & Fried (30). They provided a list of questions to consider when thinking about measurement. By prioritizing sound measurement practices, the robustness of psychological theory can move forward.

References

Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015 Aug 28;349(6251):aac4716.
Open Science Collaboration. An Open, Large-Scale, Collaborative Effort to Estimate the Reproducibility of Psychological Science. Perspect Psychol Sci. 2012 Nov 1;7(6):657–60.
Hales AH, Wesselmann ED, Hilgard J. Improving Psychological Science through Transparency and Openness: An Overview. Perspect Behav Sci. 2019 Mar 1;42(1):13–31.
Asendorpf JB, Conner M, De Fruyt F, De Houwer J, Denissen JJA, Fiedler K, et al. Recommendations for Increasing Replicability in Psychology. Eur J Personal. 2013 Mar 1;27(2):108–19.
Devine S. The Four Horsemen of the Crisis in Psychological Science [Internet]. Trial and Error. 2020 [cited 2025 Jul 1]. Available from: https://blog.trialanderror.org/the-four-horsemen-of-the-crisis-in-psychological-science
Jhangiani RS, Chiang IA, Cuttler C, Leighton DC. Research Methods in Psychology [Internet]. 4th ed. Surrey, B.C.: Kwantlen Polytechnic University; 2019. Available from: https://doi.org/10.17605/OSF.IO/HF7DQ
Cronbach LJ, Meehl PE. Construct validity in psychological tests. Psychol Bull. 1955;52(4):281–302.
Gehlbach H, Brinkworth ME. Measure twice, cut down error: A process for enhancing the validity of survey scales. Rev Gen Psychol. 2011;15(4):380–7.
Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16:297–334.
Greenwald AG, McGhee DE, Schwarts JL. Measuring individual differences in implicit cognition: the implicit association test. J Pers Soc Psychol. 1998;74(6):1464–80.
Dawson N, Arkes H. Implicit Bias Among Physicians. J Gen Intern Med. 2008 Nov 1;24:137–40.
Schimmack U. The Implicit Association Test: A Method in Search of a Construct. Perspect Psychol Sci. 2021 Mar 1;16(2):396–414.
Payne BK, Vuletich HA, Lundberg KB. The Bias of Crowds: How Implicit Bias Bridges Personal and Systemic Prejudice. Psychol Inq. 2017 Oct 2;28(4):233–48.
Greenwald AG, Lai CK. Implicit Social Cognition. Annu Rev Psychol. 2020 Jan;71:419–45.
Epifania OM, Anselmi P, Robusto E. Implicit social cognition through the years: The Implicit Association Test at age 21. Psychol Conscious Theory Res Pract. 2022;9(3):201–17.
Maassen E, D’Urso D, Assen M, Nuijten M, De Roover K, Wicherts J. The Dire Disregard of Measurement Invariance Testing in Psychological Science. Psychol Methods. 2023 Dec 25.
Barry AE, Chaney B, Piazza-Gardner AK, Chavarria EA. Validity and Reliability Reporting Practices in the Field of Health Education and Behavior: A Review of Seven Journals. Health Educ Behav. 2014 Feb 1;41(1):12–8.
Weidman AC, Steckler CM, Tracy JL. The jingle and jangle of emotion assessment: Imprecise measurement, casual scale usage, and conceptual fuzziness in emotion research. Emotion. 2017;17(2):267–95.
Higgins WC, Kaplan DM, Deschrijver E, Ross RM. Construct validity evidence reporting practices for the Reading the Mind in the Eyes Test: A systematic scoping review. Clin Psychol Rev. 2024 Mar 1;108:102378.
Barry AE, Chaney B, Piazza-Gardner AK, Chavarria EA. Validity and Reliability Reporting Practices in the Field of Health Education and Behavior: A Review of Seven Journals. Health Educ Behav. 2014 Feb 1;41(1):12–8.
Slaney KL, Tkatchouk M, Gabriel SM, Maraun MD. Psychometric assessment and reporting practices: Incongruence between theory and practice. J Psychoeduc Assess. 2009;27(6):465–76.
Revelle W. Chapter 7: Classical Test Theory and the Measurement of Reliability. In: An Introduction to Psychometric Theory with Applications in R [Internet]. Springer; Available from: http://personality-project.org/r/book
Breznau N, Rinke EM, Wuttke A, Nguyen HHV, Adem M, Adriaans J, et al. Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proc Natl Acad Sci. 2022 Nov 1;119(44):e2203150119.
Kuncel NR, Credé M, Thomas LL. The Validity of Self-Reported Grade Point Averages, Class Ranks, and Test Scores: A Meta-Analysis and Review of the Literature. Rev Educ Res. 2005;75(1):63–82.
Rosen JA, Porter SR, Rogers J. Understanding Student Self-Reports of Academic Performance and Course-Taking Behavior. AERA Open. 2017 May 1;3(2):2332858417711427.
Ðurić S, Bogataj Š, Zovko V, Sember V. Associations Between Physical Fitness, Objectively Measured Physical Activity and Academic Performance. Front Public Health [Internet]. 2021;9. Available from: https://www.frontiersin.org/journals/public-health/articles/10.3389/fpubh.2021.778837
Gonzalez EC, Hernandez EC, Coltrane AK, Mancera JM. The Correlation between Physical Activity and Grade Point Average for Health Science Graduate Students. OTJR Occup Ther J Res. 2014 Jun 1;34(3):160–7.
Goodhew SC, Dawel A, Edwards M. Standardizing measurement in psychological studies: On why one second has different value in a sprint versus a marathon. Behav Res Methods. 2020 Dec 1;52(6):2338–48.
Loevinger J. Objective Tests as Instruments of Psychological Theory. Psychol Rep. 1957 Jun 1;3(3):635–94.
Flake JK, Fried EI. Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them. Adv Methods Pract Psychol Sci. 2020 Dec 1;3(4):456–65.

What about meta-arts?

December 09, 2024 by Iris Willigers in Blogposts

This blogpost was written by Ben Kretzler. Ben is a PhD student of our meta-research group and started his PhD in September 2024. During his PhD, he will be working on Jelte’s Vici project: Examining the Variation in Causal Effects in Psychology with his supervisors Jelte Wicherts, Marcel van Assen and Robbie van Aert.

According to Wikipedia, we use the term "metascience" for the application of scientific methodology to study science itself. But there's perhaps another reason to talk about meta-science: within the arts and sciences, it seems that primarily the latter have a substantial number of researchers dedicated to scrutinizing research practices and assessing the confidence we can have in our knowledge.

To explain this, we could put forward several reasons: First, theories from the sciences often yield statements whose content is easier to falsify compared to the statements we can derive from theories from the arts.¹ Therefore, overconfidence in or flaws of theories from the sciences might be more easily detectable than those of theories from the arts. Second, and relatedly, meta-research movements in fields like medicine and psychology often arise as reactions to "crises of confidence"—when results don't replicate or scientific misconduct is uncovered (Nelson et al., 2018; Rennie & Flanagin, 2018). Since evaluating the arts' function can be more challenging, such confidence crises may simply occur less often, perhaps reducing the pressure for self-evaluation.²

Still, even if these reasons help explain why meta-research in the arts has not reached the same intensity as its counterparts in the sciences in the past, they are insufficient to explain why such meta-research is not happening in the present. In this post, we will argue that meta-research in the arts is not only possible but necessary, exemplified by cases from quantitative history and cultural studies.¹

Quick Detour: What Is the Current State of Meta-Arts?
As pointed out above, the meta-researcher-to-researcher ratio in the arts seems to be far below that in psychology or medicine. Consequently, evidence regarding publication bias, selective reporting, or analysis heterogeneity is sparse. Still, there are some individual projects that (directly or indirectly) addressed the replicability and robustness of research in the arts:

The X-Phi Replicability Project, which tested the reproducibility of experimental philosophy (Cova et al., 2018) by conducting high-powered replication of two samples of popular and randomly drawn studies. It yielded at a replication rate of 78.4% for original studies presenting significant results (as a comparison: the replication rate for psychological research stemming from 2008 seems to be around 37%; Open Science Collaboration, 2015).

A part of the June 2024 Issue of Zygon was devoted to a direct and a conceptual replication of John Hedley Brooke's account of whether religion helped or hindered the rise of modern science, as explored in his book Science and Religion. While the replicators mentioned a few minor inconsistencies in how Brooke presented the theses of some other researchers and interpreted some original and newly added source material differently than Brooke, they acknowledged that his work was of high quality and did not challenge his general account. Thus, although this historical work and its underlying sources beard some reliability issues and researcher degrees of freedom, they did not necessarily undermine the production of a robust and credible account of the relationship between religion and early science.

Ultimately, a project to assess the robustness reproducibility of publications in the American Economic Review (Campbell et al., 2024) also reanalyzed some cliometric papers (e.g., Angelucci et al., 2022; Ashraf & Galor, 2013; Berger et al., 2013). At the very least, these papers were not excluded from the general observation that the analyses conducted by the original authors tended to yield higher effect sizes and were more often significant than those conducted by the replication teams.

The latter notion is reinforced by several research controversies over the past two decades, where commentaries analyzing the same research question in different ways contradicted the original findings (e.g., Albouy, 2012, cf. Acemoglu et al., 2012; Guinnane & Hoffman, 2022, cf. Voigtländer & Voth, 2022). Thus, there seems to be some analysis heterogeneity in some individual cases.

What should we conclude from this short overview? On the one hand, it demonstrates the possibility that different research designs and analyses can induce interpretation-changing differences in results and that some publication bias and selective reporting are going on in quantitative historical or cultural research. On the other hand, these notions do not much more than thwart universal statements about the non-existence of such problems in the arts and, due to their anecdotal character, do not allow for any statements about the extent of such heterogeneity or bias.

Researcher Degrees of Freedom in Cliometrics and Cultural Research
Adding to our (weak) conclusion that researcher degrees of freedom can also affect topics associated with the arts, we will introduce two degrees of freedom specific to cliometric and cross-cultural research (and not included in enumerations of researcher degrees of freedom in other disciplines, such as psychology; cf. Wicherts et al., 2016): the selection of (growth) control variables and a reference year.

(Growth) Control Variables
Apparently, cross-cultural researchers like growth and GDP regressions (e.g., Acemoglu et al., 2005; Berggren et al., 2011; Gorodnichenko & Roland, 2017). However, they can hardly ever assume that the relationship between their predictor of interest and growth or GDP is unaffected by confounders, so that a set of control variables has to be determined. Defining such a set is not easy—for instance because many controls, such as education and income, are highly correlated with one another—and the outcomes are very different: some papers control for geographical and religious factors (e.g., Gorodnichenko & Roland, 2017), while others exclude these factors and instead focus on economic variables such as inflation rates, openness to foreign trade, or government expenditures (e.g., Berggren et al., 2011), and others again add historical variables such as the year of independence or war history (Acemoglu et al., 2005).

Thus, researchers can choose from a bunch of reasonable combinations of control variables. Does this affect the outcomes? To test this, we ran multiple analyses about the relationship between general government debt and growth rates across a sample of countries worldwide.³ Working with a set of nine widely used control variables,⁴ we ran one analysis with all control variables, and nine additional analyses where we removed one of the controls. The distribution of the p-values is displayed below.

First, the black bar shows the p-value for the analysis using the complete set of control variables. Here, the relationship between debt and growth rates was insignificant (p = .458). Yet, when removing one of the nine control variables, the results can change drastically, as demonstrated by the grey bars: Two analyses (one without life expectancy and the other without inflation rates) found that higher debt levels were highly significantly associated with lower growth, with p-values of .003 and < .001, respectively. Additionally, another analysis (this time without investment levels) detected a significant negative relationship, too, p = .023.⁵ All remaining analyses were, however, not even close to being significant.

Why do p-values change when we include or exclude different control variables? Generally, there are two main reasons for this:

Control variables might reduce noise in the outcome variable: By including control variables, we might explain some of the variation in the outcome (here: growth rates). This reduces the "noise" in its values, so it is easier to detect the effect of the predictor of interest (here: debt).

They might, however, also account for relationships between variables: Control variables may be related to both the predictor of interest and the outcome. By including these controls, we isolate the unique contribution of debt to growth rates. Without them, we might mistakenly attribute some of the control variable's effect to debt.

The second case is particularly interesting because it changes how we (should) interpret the regression results. For example, if we do not control for inflation rates, the observed relationship between debt and growth might not be due to debt itself reducing growth. Instead, it could reflect the fact that higher debt levels are often associated with high inflation, which in turn hampers growth. In this case, failing to control for inflation could lead us to a misleading conclusion about the causal relationship between debt and growth. However, not many papers reporting growth regressions seem to discuss how their composition of control variables affects the outcomes; instead, it appears more common to choose a particular set based on previous research (e.g., a popular paper by Barro, 1991) that might be more or less appropriate for different regressions.

This quick example demonstrates that the set of control variables heavily influences whether a predictor for economic growth will be significant or not.⁶ It also shows that, given the lack of consensus about which variables to control for, researchers have a fair chance of generating positive results by playing with the controls.

Year
Another standard research design in cliometrics or cultural research is the cross-section, where we score countries on a predictor and then examine whether this predictor is significantly related to an outcome: Does an individualistic (vs. collectivistic) culture relate to higher productivity (Gorodnichenko & Roland, 2017)? Do countries with low, medium, or high genetic diversity have a higher GDP per capita (Q. Ashraf & O. Galor, 2013)? For such comparisons, we must select a reference year—does an individualistic culture relate to higher productivity in 2000, 2010, or 2020?

To demonstrate that the year matters, we set up a quick example analysis: Is indulgence vs. restraint (i.e., the degree to which relatively free gratification of basic human needs is restricted by, for instance, social norms; Hofstede, 1980) associated with GDP per capita?⁷ The graphic below shows the p values for the years between 2005 and 2022:

First, the analyses for all years indicate that indulgence is positively related to GDP per capita. However, while this relationship is significant for the years between 2005 and 2012 (and marginally significant until 2015), it becomes insignificant afterward. This could be due to some short-term developments: for example, some very restrained countries (e.g., China and Pakistan) experienced relatively high growth rates during our investigation period, while some very indulgent countries (e.g., Argentina and Brazil) struggled more. Still, it could also reflect that the relationship between indulgence/restraint and economic performance became weaker over time. In any case, the common practice of picking one year seems misplaced for this particular analysis, as developments characteristic of that year but not of the research question of interest could determine whether we observe a (significant) relationship or not. Instead, it might be more appropriate to look at the development of the relationship over time: accounting for variation between the results of different years might not only prevent false positives (or negatives) but also detect long-term developments in a relationship that could, in turn, be exploited for theory-building (e.g., Maseland, 2021).

The analyses reveal a consistent positive relationship between indulgence and GDP per capita across all years. However, this relationship is significant only between 2005 and 2012 (and marginally significant until 2015) but becomes insignificant in later years. This shift could reflect short-term developments during the study period: for instance, some highly restrained countries, like China and Pakistan, experienced relatively high economic growth, while the economies of more indulgent countries, such as Argentina and Brazil, struggled at the start of the millennium. Alternatively, the fluctuations might also indicate that the relationship between indulgence/restraint and economic performance has weakened over time.

In either case, relying on data from a single year seems problematic for this kind of analysis. A snapshot from one year could be heavily influenced by events specific to that period that determine the answer we receive to our broader research question. It would be more meaningful to examine how this relationship evolves over time. By considering variations across multiple years, researchers can not only reduce the risk of false positives (or negatives) but also uncover long-term trends that might inform theory development (see, e.g., Maseland, 2021). Such an approach could help identify persistent patterns or shifts in the relationship, providing valuable insights into the dynamics between cultural traits and economic performance.

Conclusion
This blog post aimed to establish two fundamental notions: First, quantitative analysis in the arts (e.g., history, cultural research) also involves researcher degrees of freedom, which can lead to meaningful variations in results. Second, these degrees of freedom can be strategically utilized to generate significant findings.

Together, these two notions could lead to an inflated number of false-positive results. Indeed, the limited evidence we have so far suggests the existence of at least some publication bias and/or selective reporting in the quantitative humanities. Finally, while research in the humanities may not share the same topics or degrees of freedom as fields like psychology or medicine, the approaches that meta-researchers have developed in recent years (e.g., multiverse analyses, p-curves) could provide a good starting position for addressing publication bias and selective reporting in the arts as well.

References

Acemoglu, D., Johnson, S., & Robinson, J. (2005). The Rise of Europe: Atlantic Trade, Institutional Change, and Economic Growth. American Economic Review, 95(3), 546-579. https://doi.org/10.1257/0002828054201305

Acemoglu, D., Johnson, S., & Robinson, J. A. (2012). The Colonial Origins of Comparative Development: An Empirical Investigation: Reply. American Economic Review, 102(6), 3077-3110. https://doi.org/10.1257/aer.102.6.3077

Albouy, D. Y. (2012). The Colonial Origins of Comparative Development: An Empirical Investigation: Comment. American Economic Review, 102(6), 3059-3076. https://doi.org/10.1257/aer.102.6.3059

Angelucci, C., Meraglia, S., & Voigtländer, N. (2022). How Merchant Towns Shaped Parliaments: From the Norman Conquest of England to the Great Reform Act. American Economic Review, 112(10), 3441-3487. https://doi.org/10.1257/aer.20200885

Ashraf, Q., & Galor, O. (2013). The “Out of Africa” Hypothesis, Human Genetic Diversity, and Comparative Economic Development. American Economic Review, 103(1), 1-46. https://doi.org/10.1257/aer.103.1.1

Astington, J. W. (1999). The language of intention: Three ways of doing it. In P. D. Zelazo, J. W. Astington, & D. R. Olson (Eds.), Developing theories of intention. Erlbaum.

Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. Journal of Personality and Social Psychology, 71(2), 230-244. https://doi.org/10.1037/0022-3514.71.2.230

Barro, R. J. (1991). Economic growth in a cross section of countries. The Quarterly Journal of Economics, 106(2), 407. https://doi.org/10.2307/2937943

Berger, D., Easterly, W., Nunn, N., & Satyanath, S. (2013). Commercial Imperialism? Political Influence and Trade During the Cold War. American Economic Review, 103(2), 863-896. https://doi.org/10.1257/aer.103.2.863

Berggren, N., Bergh, A., & BjØRnskov, C. (2011). The growth effects of institutional instability. Journal of Institutional Economics, 8(2), 187-224. https://doi.org/10.1017/s1744137411000488

Bratman, M. E. (1987). Intention, plans, and practical reason. MIT Press.

Campbell, D., Brodeur, A., Dreber, A., Johannesson, M., Kopecky, J., Lusher, L., & Tsoy, N. (2024). The Robustness Reproducibility of the American Economic Review (124). https://www.econstor.eu/bitstream/10419/295222/1/I4R-DP124.pdf

Cova, F., Strickland, B., Abatista, A., Allard, A., Andow, J., Attie, M., Beebe, J., Berniūnas, R., Boudesseul, J., Colombo, M., Cushman, F., Diaz, R., N’Djaye Nikolai van Dongen, N., Dranseika, V., Earp, B. D., Torres, A. G., Hannikainen, I., Hernández-Conde, J. V., Hu, W.,…Zhou, X. (2018). Estimating the Reproducibility of Experimental Philosophy. Review of Philosophy and Psychology, 12(1), 9-44. https://doi.org/10.1007/s13164-018-0400-9

De Rijcke, S., & Penders, B. (2018). Resist calls for replicability in the humanities. Nature, 560(7716), 29. https://doi.org/10.1038/d41586-018-05845-z

Gorodnichenko, Y., & Roland, G. (2017). Culture, Institutions, and the Wealth of Nations. The Review of Economics and Statistics, 99(3), 402-416. https://doi.org/10.1162/REST_a_00599

Guinnane, T. W., & Hoffman, P. (2022). Medieval Anti-Semitism, Weimar Social Capital, and the Rise of the Nazi Party: A Reconsideration. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4286968

Hofstede, G. (1980). Culture's Consequences: International Differences in Work-Related Values. Sage Publications.

Knobe, J. (2003). Intentional action in folk psychology: An experimental investigation. Philosophical Psychology, 16(2), 309-324. https://doi.org/10.1080/09515080307771

Latour, B. (1991). We have never been modern. Harvard University Press.

Maseland, R. (2021). Contingent determinants. Journal of Development Economics, 151. https://doi.org/10.1016/j.jdeveco.2021.102654

Nelson, L. D., Simmons, J., & Simonsohn, U. (2018). Psychology's Renaissance. Annual Review of Psychology, 69(Volume 69, 2018), 511-534. https://doi.org/https://doi.org/10.1146/annurev-psych-122216-011836

Open Science Collaboration. (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716

Peels, R., Van Den Brink, G., Van Eyghen, H., & Pear, R. (2024). Introduction: Replicating John Hedley Brooke’s work on the history of science and religion. Zygon, 59(2). https://doi.org/10.16995/zygon.11255

Rennie, D., & Flanagin, A. (2018). Three Decades of Peer Review Congresses. JAMA, 319(4), 350-353. https://doi.org/10.1001/jama.2017.20606

Voigtländer, N., & Voth, H.-J. (2022). Response to Guinnane and Hoffman: Medieval Anti-Semitism, Weimar Social Capital, and the Rise of the Nazi Party: A Reconsideration. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4316007

Wicherts, J. M., Veldkamp, C. L., Augusteijn, H. E., Bakker, M., van Aert, R. C., & van Assen, M. A. (2016). Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking. Front Psychol, 7, 1832. https://doi.org/10.3389/fpsyg.2016.01832

Footnotes
1. For example, the theory of unconscious priming was originally corroborated by verifying the hypothesis ‘‘People walk slower when they are shown words they associate with the elderly” (Bargh et al., 1996). Compared to that, it is very hard to establish (inter-subjective) falsification of, say, the central hypothesis of Bruno Latour’s We Have Never Been Modern (Latour, 1991) “What we call the modern world is based on an ill-defined separation between nature and society.”

2. Also, an interesting account by de Rijcke and Penders (2018 suggests that the arts are more about the search for meaning than chasing after truth and perform “evaluation and assessment according to different quality criteria — namely, those that are based on cultural relationships and not statistical realities.” In this case, the problem of overconfidence in or flaws of the theoretical state of the art(s) is irrelevant and any efforts to detect such issues redundant. Still, as Peels et al. (2024) note, the arts are not entirely off the hook when it comes to truth-seeking, as they also include research questions such as whether European colonies that were poorer at the end of the Middle Ages developed better than richer colonies because they were not subject to extractive institutions (Acemoglu et al., 2002). Therefore, this blog post at least concerns questions of this type, without explicitly including or excluding any other research question from the arts.

3. Growth rates were calculated using the data from the Maddison Project Database 2023 (Bolt & van Zanden, 2023). Data about general government debt came from the Global Debt Database of the International Monetary Fund (Mbaye et al., 2018). The sources of the control variables were the Penn World Tables 10.01 (Feenstra et al., 2015), the World Development Indicators (World Bank, 2024), ILOSTAT (International Labour Organization, 2024), and Barro and Lee (2013). We used data from 2005 to 2019.

4. The control variables were GDP capita (for convergence), population growth, investment levels relative to the GDP, government share relative to the GDP, sum of imports and exports relative to the GDP, education level, inflation level, life expectancy, and labor force growth.

5. The coefficients indicate that a 25% increase in general government debt (similar to the increase in the United States during the first year of the COVID-19 pandemic) decreases yearly growth rates by 0.3% to 0.4%.

6. Interestingly, the effect size estimates are rather close to one another, ranging from 0.0% to 0.4% for all (significant and insignificant) analyses. The underlying multiverse variability is 0.014 for Cohen’s f².

7. GDP data came from the Maddison Project Database 2023 (Bolt & van Zanden, 2023), and data for power distance from the Hofstede (1980) data. We performed a linear regression for each year, controlling for a standard set of geographical and religious variables already used by previous studies on the relationship between culture and economic performance (e.g., Gorodnichenko & Roland, 2017).

Improving the Quality and Specificity of Preregistration

January 25, 2021 by M. Bakker

Marjan Bakker, writing for the Center of Open Science:

Preregistration, which is specifying a research plan in advance of the study, is seen as one of the most important ways to improve the credibility of research findings. With preregistration, a clear distinction between planned and unplanned analyses can be made, thereby eliminating the possibility of making data-contingent decisions (Nosek, Ebersole, DeHaven, & Mellor, 2018). Over the last years, preregistration is gaining more and more popularity. For example, the number of preregistrations at OSF has approximately doubled yearly from 38 in 2012 to 36,675 by the end of 2019 (http://osf.io/registries).
Read more…

Why I Think Open Peer Review Benefits PhD Students

January 10, 2020 by Olmo van den Akker

This blog post was part of an initiative by Nature Human Behavior called 'Publish or Perish' where early career researchers give their views on the pressure to publish in academia. The original blog post can be found here.

Doing scientific research is my dream job. Unfortunately, it’s not at all certain that I can keep doing research after getting my PhD degree. Research jobs are scarce and every year the academic job market is flooded with freshly minted PhDs. In practice, this means that only the most prolific PhD students will land a job. In other words, you either ‘publish or perish’. In this blog post I will argue that the culture of ‘publish or perish’, although not a problem in theory, is a problem in practice because of the unfairness of the peer review system. In my view, opening up this system would make it fairer for all researchers, but especially for PhD students.

Based on discussions with colleagues as well as my own experiences I’ve become aware that the peer review system can be random and biased. This intuition is supported by scientific studies of peer review that find that the interrater reliability of reviewers is low, which means that an editor’s (often arbitrary) choice of reviewers plays a big part in whether your manuscript will be accepted (Bornmann, Mutz, & Daniel, 2010; Cicchetti, 1991, Cole, Cole, & Simon, 1981; Jackson, Srinivasan, Rea, Fletcher, & Kravitz, 2011). In addition, studies have found that reviewers are more likely to value manuscripts including positive results (Mahoney, 1977; Emerson et al., 2010) and results consistent with their theoretical viewpoints (Mahoney, 1977). These structural biases as well as the random element make the peer review system unfair as it is unable to consistently distinguish good quality research from bad quality research. This is especially concerning for PhD students who only have a few years to accrue publications to get funding for an academic job. One unfair negative review could nip their career in the bud.

In my view, the solution to the unfairness of the peer review system is straightforward: Switch from a closed peer review system to an open peer review system. Here, I define open peer review as a peer review system in which authors and reviewers are aware of each other’s identity, and review reports are published alongside the relevant article. Ross-Hellauer (2017) found that these two aspects together account for more than 95% of the mentions of ‘open peer review’ in the recent literature. Note that open peer review may also refer to a situation where the wider community can comment on a manuscript, but I do not use that definition here. Below, I list the potential benefits and downsides of switching to an open peer review system.

Potential benefits of open peer review for PhD students

1) In an open peer review system reviewers’ names are linked to their public reviews, which increases accountability.

This accountability may cause reviewers to be more conscientious and thorough when reviewing a manuscript. Indeed, a transparent peer review process has been linked to higher-quality reviews in several studies (Kowalczuk et al., 2015; Mehmani, 2016; Walsh, Rooney, Appleby, & Wilkinson, 2000; Wicherts, 2016), although a sequence of studies by Van Rooyen (Van Rooyen, Delamothe, & Evans, 2010; Van Rooyen, Godlee, Evans, Smith, & Black, 1999) failed to find any difference in quality between open and closed reviews. For PhD students higher quality peer reviews are especially important because they are at a stage where feedback on their work is crucially important for their development. Moreover, high quality reviews are fairer for PhD students as such reviews can distinguish more accurately between good and bad research (and thus good and bad PhD students).

2) If the identity of reviewers are made public PhD students can get credit for the reviews they conduct.

McDowell, Knutsen, Graham, Oelker, & Lijkek (2019) found that many PhD students do not find their names on peer review reports submitted to journal editorial staff even though they had co-written the report with a more senior researcher. In such instances of “ghostwriting” the PhD student usually does most of the work while the senior researchers is the only one that profits by gaining appreciation from the editor. An open review system would provide public credit to reviewing PhD students (for example by making reviews citable, Hendricks & Lin, 2017) but would also provide less tangible rewards like senior researchers acknowledging their skills as a high quality scientist (see Tweet 1 below).

3) The fact that reviews are made open may also create a motivation for reviewers to be more friendly and constructive in their reviews.

Of course, this would greatly benefit PhD students because given their status they are likely influenced most severely by scathing or harsh reviews. Indeed, some research shows that reviews are potentially more courteous and constructive when they are open (Bravo, Grimaldo, López-Iñesta, Mehmani, & Squazzoni, 2019; Walsh, Rooney, Appleby, & Wilkinson, 2000).

4) Open peer review may lessen the risk of PhD students publishing in predatory journals.

In a situation with open peer review, journals with no or substandard peer review will be identified quickly and will become known as low-quality journals. Predatory journals can no longer hide behind the closed peer review system and will eventually disappear. This makes life easier for PhD students as it is often difficult to orient the publishing landscape if you are inexperienced with it.

5) Open peer review can help to prevent a practice called citation manipulation (Baas & Fennell, 2019), whereby a reviewer suggests large numbers of citations of their own work to be added to a submitted manuscript.

These are often unwarranted citations, but researchers (especially PhD students) are often coerced into adding them because they desperately want to publish their paper. Of course, only researchers who have a reasonable amount of citable papers under their belt would engage in citation manipulation, making it harder for PhD students to compete on the academic job market. Indeed, a prominent case of citation manipulation spurred a group of early career researchers to write an open letter to voice their concern. Open peer review would clearly help here as reviewers thinking of engaging with this unethical practice would think twice if their name and review were public.

6) Open peer review provides PhD students with insight in the mechanics of science.

For example, it allows PhD students to see how other papers have developed over time or to see that landmark papers have been rejected multiple times before being published. Such insights into the peer review process are very valuable for PhD students as they can get more comfortable with the peer review system and can see that rejections are the norm rather than the exception.

7) Open peer review (or streamlined review, see Collabra, 2019) could save PhD students (and other researchers) time.

Once a manuscript is rejected it is usually sent out to another journal to undergo a new round of review. It is likely that the arguments used by the first set of reviewers and the second set of reviewers are similar because the first set of reviews was done behind closed doors and authors often change little in between submission. It is estimated that 15 million hours are spent every year by restating arguments while reviewing rejected papers (The AJE Team, 2019). In open peer review, researchers can build on previous reviews, and see the development of the paper, which can free up many hours for valuable research. Of course, not all of the wasted review time is accounted for by PhD students, but because they are likely taking longer than the mean 8.5 hours for a review (Ware, 2008) an open peer review system would be especially time-saving for them.

Potential downsides of open peer review for PhD students

1) The main argument put forward against open peer review is that PhD students who write negative reviews may frustrate other researchers who could then retaliate. For example, vindictive researchers could provide negative reviews of the PhD student’s future work or could speak badly about them to their colleagues during a conference or in personal e-mails. This is plausible, but it is unsure whether a blind review system would prevent such practices as anonymity is by no means guaranteed. Many authors at least think they are able to correctly identify their reviewers (see Tweet 2 and 3), and a review found that masking reviewers’ identities was only successful about half of the time (Snodgrass, 2006). In any case, open peer review at least makes situations of power abuse easier to identify.

2) Whether PhD students will be retaliated against or not, a fear of retaliation does exist in the academic community (see Tweet 4 and 5) This fear could cause PhD students to shy away from criticizing senior researchers in reviews, or could even cause PhD students to reject review requests for work authored by senior researchers. The first scenario would cause suboptimal work by senior research to be published more often, reinforcing the academic status quo and decreasing the quality of the scientific literature. The second scenario would prevent PhD students from gaining valuable review experience and would cause the scientific process to slow down. The second scenario seems unlikely though in light of findings by Bravo et al. (2019) and Ross-Hellauer, Deppe, & Schmidt (2017) that more junior scholars are more willing to engage in open peer review than more senior scholars.

3) Power dynamics can also play a problematic role when the reviewer is a senior researcher and the manuscript’s author is a PhD student. When the manuscript involves findings that run counter to the senior researcher’s self-interest they may decide to write a condemning review to intimidate the PhD student from pursuing the work further (see Tweet 6). However, this can also happen in a system of closed peer review. At least in open peer review unfairly harsh and power-abusive reviews can be identified and be followed up on. Although there is currently no system for reprimanding power abuse in peer reviews, Bastian (2018) argues that there are ways to do this effectively. For example, we could explicitly label power abuse in peer review as professional misconduct or even harassment in the relevant codes of conduct.

4) In my view, the most problematic downside of open peer review (as I have defined it) is that all kinds of biases could creep into the peer review system. For example, it could be the case that papers from PhD students are rejected more often because PhD students do not have enough prestige or because PhD students more often come up with ideas that challenge the status quo in the literature. And indeed, studies have shown that open peer review may be associated with disproportionate rejections of researchers with low prestige, like PhD students (Seeber & Bacchelli, 2017; Tomkins, Zhang, & Heavlin, 2017). These findings are worrying and should be taken seriously. Importantly, open peer review should not be a goal in itself but should only be implemented when the benefits outweigh the costs. In this case, the benefits of unmasking the identities of authors (e.g., less hassle with masking your manuscripts) are marginal while the potential costs (discrimination against low prestige researchers) are likely high. An open peer review system where the identities of authors are masked therefore seems like the best solution.

Conclusion

My hope is that I won’t be the one to perish, but the simple fact is that there’s not enough funding available to accommodate every PhD student aspiring a job in academia. That does not need to be a problem as a little academic competition is fine. After all, it only seems fair that the best of the best are tasked with expanding our scientific knowledge. However, the best of the best are only selected as long as the peer review system is fair. Currently, that does not seem to be the case.

In this blog post I have therefore argued for an open peer review system. Implementing this system across the board could increase the quality and tone of peer reviews, could provide PhD students with credit for their reviews, could root out predatory journals, could prevent citation manipulation, could provide PhD students with insight into the mechanics of science, and could lessen the peer review burden for PhD students. Even though the arguments against open peer review should be taken seriously (for example by masking the identities of authors) I am convinced open peer review will create a fairer system. And, as you can see below, the European Journal of Neuroscience, one of the journals that already practices open peer review, wholeheartedly agrees.

Excerpt from the summary report of the European Journal of Neuroscience about their new open peer review system. Retrieved from https://www.wiley.com/network/researchers/being-a-peer-reviewer/transparent-review-at-the-european-journal-of-neuroscienc… — *Excerpt from the summary report of the European Journal of Neuroscience about their new open peer review system. Retrieved from* *https://www.wiley.com/network/researchers/being-a-peer-reviewer/transparent-review-at-the-european-journal-of-neuroscience-experiences-one-year-on*

References

Baas, J., & Fennell, C. (2019, May). When peer reviewers go rogue-Estimated prevalence of citation manipulation by reviewers based on the citation patterns of 69,000 reviewers. SSRN Working Paper. Retrieved from https://ssrn.com/abstract=3339568.
Bastian, H. (2018). Signing critical peer reviews & the fear of retaliation: What should we do? https://blogs.plos.org/absolutely-maybe/2018/03/22/signing-critical-peer-reviews-the-fear-of-retaliation-what-should-we-do.
Bornmann, L., Mutz, R., & Daniel, H. D. (2010). A reliability-generalization study of journal peer reviews: A multilevel meta-analysis of inter-rater reliability and its determinants. PloS ONE, 5(12), e14331.
Bravo, G., Grimaldo, F., López-Iñesta, E., Mehmani, B., & Squazzoni, F. (2019). The effect of publishing peer review reports on referee behavior in five scholarly journals. Nature Communications, 10(1), 322.
Cicchetti, D. V. (1991). The reliability of peer review for manuscript and grant submissions: A cross- disciplinary investigation. Behavioral and Brain Sciences, 14(1), 119-135.
Cole, S., & Simon, G. A. (1981). Chance and consensus in peer review. Science, 214(4523), 881-886.
Collabra (2019). Editorial Policies. Retrieved from https://www.collabra.org/about/editorialpolicies/#streamlined-review.
Emerson, G. B., Warme, W. J., Wolf, F. M., Heckman, J. D., Brand, R. A., & Leopold, S. S. (2010). Testing for the presence of positive-outcome bias in peer review: a randomized controlled trial. Archives of Internal Medicine, 170(21), 1934-1939.
Hendricks, G., & Lin, J. (2017). Making peer reviews citable, discoverable, and creditable. Retrieved from https://www.crossref.org/blog/making-peer-reviews-citable-discoverable-and-creditable.
Jackson, J. L., Srinivasan, M., Rea, J., Fletcher, K. E., & Kravitz, R. L. (2011). The validity of peer review in a general medicine journal. PLoS ONE, 6(7), e22475.
Kowalczuk, M. K., Dudbridge, F., Nanda, S., Harriman, S. L., Patel, J., & Moylan, E. C. (2015). Retrospective analysis of the quality of reports by author-suggested and non-author-suggested reviewers in journals operating on open or single-blind peer review models. BMJ Open, 5(9), e008707.
Mahoney, M. J. (1977). Publication prejudices: An experimental study of confirmatory bias in the peer review system. Cognitive Therapy and Research, 1(2), 161-175.
McDowell, G. S., Knutsen, J., Graham, J., Oelker, S. K., & Lijek, R. S. (2019). Co-reviewing and ghostwriting by early career researchers in the peer review of manuscripts. BioRxiv, 617373.
Mehmani, B. (2016). Is open peer review the way forward? Retrieved from https://www.elsevier.com/reviewers-update/story/innovation-in-publishing/is-open-peer-review-the-way-forward.
Ross-Hellauer, T. (2017). What is open peer review? A systematic review. F1000Research, 6. 10.12688/f1000research.11369.2
Ross-Hellauer, T., Deppe, A., & Schmidt, B. (2017). Survey on open peer review: Attitudes and experience amongst editors, authors and reviewers. PLoS ONE, 12(12), e0189311.
Seeber, M., & Bacchelli, A. (2017). Does single blind peer review hinder newcomers? Scientometrics, 113(1), 567-585.
Snodgrass, R. (2006). Single-versus double-blind reviewing: an analysis of the literature. ACM Sigmod Record, 35(3), 8-21.
The AJE Team (2019). Peer Review: How We Found 15 Million Hours of Lost Time. Retrieved from https://www.aje.com/arc/peer-review-process-15-million-hours-lost-time.
Tomkins, A., Zhang, M., & Heavlin, W. D. (2017). Reviewer bias in single-versus double-blind peer review. Proceedings of the National Academy of Sciences, 114(48), 12708-12713.
Van Rooyen, S., Delamothe, T., & Evans, S. J. (2010). Effect on peer review of telling reviewers that their signed reviews might be posted on the web: Randomised controlled trial. BMJ, 341, c5729.
Van Rooyen, S., Godlee, F., Evans, S., Smith, R., & Black, N. (1999). Effect of blinding and unmasking on the quality of peer review. Journal of General Internal Medicine, 14(10), 622-624.
Walsh, E., Rooney, M., Appleby, L., & Wilkinson, G. (2000). Open peer review: A randomised controlled trial. The British Journal of Psychiatry, 176(1), 47-51.
Ware, M. (2008). Peer review in scholarly journals: Perspective of the scholarly community–Results from an international study. Information Services & Use, 28(2), 109-112.
Wicherts, J. M. (2016). Peer review quality and transparency of the peer-review process in open access and subscription journals. PLoS ONE, 11(1), e0147913.

A Recap of the Tilburg Meta-Research Day

January 02, 2020 by Olmo van den Akker in Conference

On Friday November 22, 2019, the Meta-Research Center at Tilburg University organized the Tilburg Meta-Research Day. Around 90 interested researchers attended this day that involved three plenary lectures, by John Ioannidis (who received an honorary doctorate from Tilburg University a day earlier), Ana Marušić, and Sarah de Rijcke, and seven parallel sessions on meta-research.

Below you can find the links to the video footage of the three plenary sessions as well as summaries of all seven parallel sessions. The full program of the Tilburg Meta-Research Day can be found here. If you have any questions or comments, please contact us at metaresearch@uvt.nl.

Next up at Tilburg: The 1st European Conference on Meta-Research (July 2021).

Recordings of plenary talks:

Plenary talk by Sarah de Rijcke: Research on Research Evaluation: State-of-the-art and practical insights

Plenary talk by Ana Marušić: Reviewing Reviews: Research on the Review Process at Journals and Funding Agencies

Plenary talk by John Ioannidis: Meta-research in different scientific fields: What lessons can we learn from each other?

Parallel sessions (see below for summaries):

How can meta-research improve research evaluation? (Session leaders: Sarah de Rijcke & Rinze Benedictus)
How can we ensure the future of meta-research? (Session leader: Olmo van den Akker)
How can meta-research improve statistical practices? (Session leader: Judith ter Schure)
How can meta-research improve the Psychological Science Accelerator (PSA) and how can the PSA improve meta-research? (Session leaders: Peder Isager & Marcel van Assen)
How can meta-research improve peer review? (Session leader: Ana Marušić)
How can meta-research improve our understanding of the effects of incentives on the efficiency and reliability of science? (Session leaders: Sophia Crüwell, Leonid Tiokhin, & Maia Salholz-Hillel)
Many Paths: A new way to communicate, discuss, and conduct (meta-)research (Session leaders: Hans van Dijk & Esther Maassen)

How can meta-research improve research evaluation

Session leaders: Sarah de Rijcke & Rinze Benedictus

The evaluation of research and researchers is currently based on biased metrics like the H-index and the journal impact factor. Several new initiatives have been launched in favor of indicators that correspond better to actual research quality. One of these initiatives is “Redefine excellence” from the University Medical Center (UMC) Utrecht. In this session, Rinze Benedictus shortly outlined the innovations that are implemented at the UMC Utrecht, after which Sarah de Rijcke led a discussion on how we can properly evaluate whether these innovations are effective.

The session stimulated a productive discussion about differences and similarities between sociology of science and meta-research. Both fields could be termed ‘research on research’, but they appear to be rather distinct, using very different languages, concepts and maybe even springing from different concerns. However, the feeling in the session was that a lot could be gained by more interaction between the fields.

Promising ways to build bridges seem:

Shared conferences to share concepts, language and maybe even research questions. A thematic approach (as opposed to method-based) to research questions could also facilitate interaction.
Identification of stakeholders: why are we doing research? For who?
Shared teaching, e.g. through setting up a joint workshop by CWTS and Tilburg University/Department of Methodology

How can we ensure the future of meta-research?

Session leader: Olmo van den Akker

In this session, we set out to identify how we can ensure that the field of meta-research will remain vital in the upcoming years. Although the original focus of the session was to identify grant opportunities for meta-research projects, the discussion quickly developed into identifying journals that are open to submissions of meta-research studies. We aimed to draft a list of such journals, which can be found here. The list is far from exhaustive so please add journals if you can. The list mainly pertains to journals and journal collections specifically catered to meta-research, but there are of course also general journals that welcome meta-research submissions. In that sense, we are lucky as meta-researchers that our studies are often suitable for a wide variety of different journals.

That being said, one sentiment that arose in our discussion is that we still feel that we are missing a broad meta-research journal purely for meta-research papers. Such a journal would increase the visibility of our field, but there’s also the danger that more substantive researchers would engage less with meta-research studies published in a journal like this (as opposed to journals in their substantive field). However, we concluded that this might not be so problematic given that the majority of researchers use Google Scholar or other databases to look for papers and are less and less committed to only reading their papers from a few of their favorite journals. Below you can find a list of things that we thought would be valuable to consider when launching a specific meta-research journal.

The journal should be broad and welcome submissions from all areas of meta-research (and even meta-meta-research). As long as studying the process and outcomes of science is critical.
It would be good to have the journal link meta-research to the philosophy of science and science and technology studies (STS) as it appears that these related fields currently do not work together as much as they could.
It would be great if this journal would incorporate the latest meta-research on the effectiveness of journals as journal policy.
The journal could even be a trial ground for journal innovations. For example, the journal could try out whether a designated statistical reviewer for each submission would work (like is customary in medicine) or try out technological innovations facilitating SMART preregistration, multiverse analyses.
Initiating a Meta-Research Society with a dedicated conference could help fund the journal through society fees and conference fees.
The journal would do well to implement the CRediT authorship guidelines.
Preregistration, open data, open code, and open materials should be required, unless authors can convince the editorial team that it is not necessary in their case.
The editorial board should be paid, because a committed editorial board is crucial for the longevity and credibility of the journal. Preferably also reviewers would be paid, but this would require substantially more funding.

In the summer of 2021 Tilburg University will organize another Meta-Research Conference, this one will probably consist of two days and will focus more on the dissemination of meta-research studies. This conference could be a great place to launch a meta-research society and an accompanying meta-research journal.

How can meta-research improve statistics?

Session leader: Judith ter Schure

How can meta-research improve statistics? The conclusion we reached is that it varies a lot per field whether scientists in their experimental design actually feel like they contribute to an accumulating series of studies. In some fields awareness exists that the results of an experiment will someday end up in a meta-analysis with existing experiments, while in others scientists aim to design experiments as 'refreshingly new' as possible. In a table that shows series of studies together in one column if they could be meta-analyzed, this latter approach shows scientists who mainly aim to initiate new columns. This pre-experimental perspective might be different from the meta-analysis perspective, in which a systematic search and inclusion criteria might still force those experiments together in one column, even though they weren't intended that way. This practice might erode trust in meta-analyses that try to synthesize effects from too different experiments.

The discussion was very hesitant towards enforcing rules (e.g. by funders or universities) on scientists in priority setting, such as whether a field needs more columns of 'refreshingly new' experiments, or needs replications of existing studies (extra rows) so a field can settle on a specific topic in one column with a meta-analysis.

In terms of statistical consequences, sequential processes might still be at play if scientists designing experiments know about the results of other experiments that might end up in the same meta-analysis. Full exchangeability in meta-analysis means that no-one would have decided differently on the feasibility or design of an experiment had the results of others been different. If that assumption cannot be met, we should consider studies as part of series in our statistical meta-analysis, even without forcing this approach in the design phase.

Meta-research and the Psychological Science Accelerator

Session leaders: Marcel van Assen & Peder Isager

The Psychological Science Accelerator (PSA) is a standing network of more than 500 laboratories that collect large-scale, non-WEIRD data for psychology studies (see https://psysciacc.org and https://osf.io/93qpg). The PSA is currently running six many-lab projects, and a number of proposed future projects are currently under review. Importantly, the PSA has established a meta-research working group that is currently examining both how the PSA can best interface with the meta-research community, and how meta-research can help bolster the quality of research projects conducted at the PSA (see https://docs.google.com/document/d/1D-NmvFE4qaC-dXAWQn16SBLsY9AABCrm8jDDy3-cD8w/edit?usp=sharing)

The session began with an overview of PSA’s organization, presented by Peder, and a discussion of the importance of many-lab studies, presented by Marcel. The slides for these presentations can be found at https://osf.io/wnyga. Afterwards, the majority of the session was devoted to discussing seven predetermined topics related to how the meta-research field and the PSA may learn from each other. Participants could either independently provide their suggestions on the seven topics in a google doc (https://bit.ly/2KIUHTW) or on paper. After about half an hour independently working on the topics, we discussed the participants’ suggestions in the remainder of the session.

The following conclusions can be drawn from our discussion:

There are multiple ways in which the PSA could contribute to meta-research (e.g. by providing access to lab data and project-level data for conducted studies, and by allowing researchers to vary properties of research designs - like the measurement tools - to study effect size heterogeneity, and advance theory by examining boundary conditions).
There are multiple issues within the meta-research field that seems relevant to the PSA. Issues related to theory, measurement and sample size determination were emphasized in particular.
Meta-researchers seem interested in contributing to the PSA research endeavor, but emphasize a lack of both general information about the PSA organization and specific information about what contributions could/would entail (e.g. what volunteer efforts one could contribute to and what studies would be relevant for the “piggy-back” submission policy).

In summary, there seems to be much enthusiasm for the PSA within the meta-research community, and there are many overlapping interests between the PSA and the meta-research community. The points raised in this session will be communicated to the PSA network of researchers, with the hope that it will help facilitate more communication between the two research communities in the future.

Other resources

PSA Data & methods committee bylaws: https://osf.io/p65qe/

Proposing a theory committee at the PSA (blog post): https://pedermisager.netlify.com/post/psa-theory-committee/

How can meta-research improve peer review?

Session leader: Ana Marušić

The session started with a discussion about research approaches to different types of peer review: single blind, double blind, consultative, results free, open, and post-publication peer review. In post-publication peer review, the system that was pioneered by the F1000 Research, peer review is completely open to study, as all steps in the peer review process and editorial decision making are transparent and available in the public domain. This is not possible for other types of peer review, which remain elusive to researchers. Even in journals that publish the prepublication history of an article (like BioMed Central journals in biomedicine), the information on the review process is available only for published articles, but not for those that were rejected (and which represent the majority of articles submitted to a journal). This is a serious hindrance to meta-research on journal peer review.

The participants discussed the possibilities of having access to complete peer review data, and the recent activities by the COST Action PEERE – New Frontiers in Peer Review, were discussed. PEERE brought together the researchers and publishers to establish a database on peer review in journals from different disciplines in order to study all aspects of peer review.

The participants in the session also discussed differences in peer review in different disciplines, as well as the need for qualitative studies on peer review. This methodological approach would be particularly important in understanding preferences and habits of peer reviewers. Recent findings, both from surveys and analysis of peer review in journals show that researchers prefer double blind peer review when they are invited to review for a journal. A qualitative approach would be useful to understand this phenomenon and build hypotheses for testing in a quantitative methodological approach.

How can meta-Research improve research incentives?

Session leaders: Sophia Crüwell, Leo Tiokhin, & Maia Salholz-HillelEveryone’s talking about “the incentives,” but what does that mean? How can we move beyond our intuitions and towards a deeper understanding of how incentives affect the efficiency and reliability of science? The aim of this session was to explore the role of incentives in science, with the goal of facilitating a broader discussion of what important questions remain unanswered.

Some conclusions from our discussion are outlined below. We would like to invite both session participants and the wider community to contribute to the following library of resources on (meta)research relevant to incentives in science: https://www.zotero.org/groups/2421057/incentives_in_academic_science.

Some conclusions from our discussion:

We need to split incentives, stakeholders, behaviors, and outcomes.
- Should we be focusing on predictors of career success rather than on incentives? However, career success is the outcome, which incentivizes the behaviors (e.g. publications).
We need to understand the parameters within which each incentive operates, i.e., a cost-benefit assessment towards outcomes. We could create a mapping or taxonomy to move the conversation forward. We could do this through an iterative, cross-stakeholder process that would then allow us to decide on next steps.
- Rational choice theory
- Delphi method: cyclical process for circulating solutions between
We should consider both intrinsic and extrinsic incentives.
- Intrinsic incentives include what a person values, such as a desire to help patients, discover something about the world, etc. Extrinsic incentives include tenure and other career payoffs, prestige, etc. The external may crowd out internal incentives.
- Is it possible to separate them? For example, proximate/ultimate from biology. However, intrinsic vs. extrinsic may be a false dichotomy. Extrinsic incentives shape intrinsic ones.
- From a Mertonian sociology of science perspective, the drive to make a discovery is as strong a drive to refute a discovery. But this doesn’t seem to be the case. So, what are researchers trying to optimize?
Why do incentives exist? They are used as a proxy to measure who is a good scientist. E.g., measured by papers, publication, citation.
- Why do people leave science?
Possible definitions of incentives
- An ontology/framework of types of incentives & what questions you should ask about them; is it a positive or negative incentive?
- Approach & avoidance approach
- Incentive can also be the purpose
- Lots of theories of behavior change already exist; do we need to reinvent the wheel?
- Should we be talking about specific incentives?
- Do incentivized behaviors have to be intentional?
- Knowledge deficiency approach

Many Paths: A new way to conduct, discuss, and communicate (meta-)research

Session leaders: Hans van Dijk & Esther Maassen (in collaboration with Liberate Science)

Slides: https://github.com/emaassen/talks/blob/master/191122-mrd-many-paths.pdf

In Many Paths, we invite researchers from multiple disciplines to participate in a collaborative project to answer one research question, and we allow an emergent process to occur in the theory, data, results, and conclusion steps thereafter. Given that results are often path dependent, and *many paths* can be taken in a research process, we aim to examine what paths a research project initiates, prunes, and merges. The Many Paths model offers insight into how researchers from different disciplines approach and study the same question. We conduct and communicate the Many Paths research process in steps ("as-you-go"), instead of after the research is completed ("after-the-fact"). During our session, we also discussed the relationship of Many Paths to previous Many Projects (i.e., the Reproducibility Project Psychology, Many Labs, and Many Analysts).

Our goal of the session was to introduce the Many Paths model and to gather feedback and suggestions on the project. Reactions to the proposed model and the new way of communication were generally positive. Many Paths appears to provide the opportunity to gather a large amount of data from various disciplines in a transparent manner. It also allows for diversity and inclusivity. It would be interesting to find out if and how researchers decide to collaborate across disciplines. However, they might be hesitant to do so because of the notable difference in what they are used to now (i.e., competition) compared to what they could do (i.e., collaboration). Whereas some people claimed a project such as Many Paths would provide clear answers to the proposed research question, some expressed concerns about the possibility of excessive fragmentation or disintegration of paths, and difficulties with combining information from various conclusions and paths. Another possible issue that was mentioned relates to the quality assurance for the research output of Many Paths; a threshold should be in place to ensure contributions adhere to a certain quality. It should also be clear how the code of conduct would be enforced.

Meta-research at the Psychological Science Accelerator

December 16, 2019 by Amir in Conference

Friday November 22, 2019, the Meta-research center at Tilburg University (https://metaresearch.nl/) organized the meta-research day. Around 90 researchers attended the meta-research day that involved three plenary lectures, by John Ioannidis (who received an honorary doctorate from Tilburg University a day earlier), Ana Marušić, and Sarah de Rijcke, and seven parallel sessions on meta-research. One of these sessions was titled How can meta-research improve the Psychological Science Accelerator (PSA) and how can the PSA improve meta-research?, and was led by Peder Isager and Marcel van Assen. Nineteen participants attended this session.

Meta-Research Center at ICPS Paris

March 19, 2019 by Amir in Conference

March 7-9, 2019, the International Convention of Psychological Science (ICPS) of the Association for Psychological Science (APS) was held in Paris, France. The Meta-research group Tilburg (co-)organized three sessions at the ICPS. Here a short overview of the three sessions and their presentations, including links to the presentations.

Preregistration: Common Issues and Best Practices (Chair: Marjan Bakker)

Preregistration has been lauded as one of the key solutions to the many issues in the field of psychology (Nosek, Ebersole, DeHaven, & Mellor, 2018). For example, researchers have argued that preregistration tackles the problems of publication bias, reporting bias, and the opportunistic use of researchers degrees of freedom in data analysis (also called questionable research practices or p-hacking). However, skeptics have put forward a broad list of issues concerned with preregistration. For example, they have argued that preregistration stifles researchers’ creativity, is not effective in the case secondary data or qualitative data, and is only intended for confirmatory research. In this symposium we aim to touch upon some of these issues.

Andrea Stoevenbelt, in her talk “Challenges to Preregistering a Direct Replication - Experiences from Conducting an RRR on Stereotype Threat”, described the challenges surrounding the preregistration of direct replication studies from her experiences of conducting a registered replication report of the seminal study by Johns, Schmaders, and Martens (2005) on stereotype threat.

Olmo van den Akker, in this talk “The Do’s and Don'ts of Preregistering Secondary Data Analyses”, presented a tutorial for a template that can be used to preregister secondary data analysis. Preregistering secondary data analysis is different from preregistering primary data analysis because mainly because researchers already have some knowledge about the data (through their own work using the data or through reading other people´s work using the data). Olmo´s take home message from this talk is: "Specify your prior knowledge of the data set from your own previous use the data and from other researcher’s previous use of the data, preferably for each author separately."

In all, this symposium touched upon many of the issues that have been raised about preregistration and hopefully encouraged researchers from a wide range of fields to give preregistration a try.

Issues with Meta-Analysis: Bias, Heterogeneity, Reproducibility (Chair: Jelte Wicherts)

The popularity of meta-analysis has been increasing the last decades, which is reflected by the rapid increase of the relative number of published meta-analyses. One question of meta-research is what we learn from all these meta-analyses; about a certain research topic, systematic biases, meta-analytic outcomes, or quality of coding. All talks in this symposium correspond to these meta-questions on meta-analysis.

Jelte Wicherts, in his talk “Effect Sizes, Power, and Biases in Intelligence Research: A Meta-Meta-Analysis”, presents the results of a meta-meta-analysis to estimate the average effect size, median power, and evidence of bias (publication bias, decline effect, early extremes effect, citation bias) in the field of intelligence research.

Anton Olsson Collentine presented on the “Limited evidence for widespread heterogeneity in psychology”. He examined the heterogeneity of all meta-analyses of ManyLab studies and registered multi-lab replication studies, which both are presumably not affected by publication or other bias. This research is important as many researchers stress the potential effect of moderators when trying to explain the failure of replication studies.

Esther Maassen, in her talk “Reproducibility of Psychological Meta-analyses”, systematically assessed the prevalence of reporting errors and inaccuracy of computations within meta-analyses. She documented whether coding errors affected meta-analytic effect sizes and heterogeneity estimates, as well as how issues related to heterogeneity, outlying primary studies, and signs of publication bias were dealt with.

Meta-analysis: Informative Tools (Chair: Marcel van Assen)

Meta-analysis is a statistical technique that statistically combines effect sizes from independent primary studies on the same topic, and is now seen as the “gold standard” for synthesizing and summarizing the results from multiple primary studies. Main research objectives of a meta-analysis are (i) estimating the average effect, (ii) assessing heterogeneity of true effect size, and if true effect size differs across studies (iii) incorporating moderator variables in the meta-analysis to explain this heterogeneity. Many different tools, visual (e.g., the funnel plot) or purely statistical (e.g., techniques to estimate heterogeneity or adjust for publication bias), have been developed to reach these objectives.

In this symposium, four speakers explain visual and statistical tools helping researchers to make sense of information in the meta-analysis and provide recommendations for applying these tools in practice. The focus is more on application than on the statistical background of the tools. Xinru Li from Leiden University will explain how classification and regression trees (CART) can be used to explain heterogeneity in effect size in a meta-analysis. The current meta-analysis methodology lacks appropriate methods to identify interactions between multiple moderators when no a priori hypotheses have been specified. The proposed meta-CART approach has the advantage that it can deal with many moderators and is able to identify interaction effects between them.

Hilde Augusteijn, in her talk “Posterior Probabilities in Meta-Analysis: An Intuitive Approach of Dealing with Publication Bias”, introduced a new meta-analytical method that makes use of both Bayesian and frequentist statistics. This method evaluates the probability of the true effect size being zero, small, medium or large, and the probability of true heterogeneity being zero, small, medium or large, while correcting for publication bias. The approach, which intuitively provides an evaluation of uncertainty in the estimates of effect size and heterogeneity, is illustrated with real-life examples.

Robbie van Aert, in his talk “P-uniform*: A new meta-analytic method to correct for publication bias”, presented a new method to correct for publication bias in a meta-analysis. In contrast to the vast majority of existing methods to correct for publication bias, the proposed p-uniform* method can also be applied if the true effect size in a meta-analysis is heterogeneous. Moreover, the method enables meta-analysts to estimate and test for the presence of heterogeneity while taking into account publication bias. An easy-to-use web application will be presented for applying p-uniform* and recommendations for assessing the impact of publication bias will be given.

Marcel van Assen, in his talk “The Meta-plot: A Descriptive Tool for Meta-analysis”, explained and illustrate the meta-plot using real-life meta-analyses, in this talk “The meta-plot”. The meta-plot improves on the funnel plot and shows in one figure the overall effect size and its confidence interval, the quality of primary studies with respect to their power to detect small, medium, or larger effects, and evidence of publication bias.

Presentation on Teaching Open Science: Turning Students into Skeptics, not Cynics (Presenter: Michèle Nuijten)

Michèle Nuijten, in her presentation “Teaching Open Science: Turning Students into Skeptics, not Cynic”, focused on strategies to teach undergraduates about replicability and open science. Psychology’s “replication crisis” has led to many methodological changes, including preregistration, larger samples, and increased transparency. Nuijten argued that psychology students should learn these open science practices from the start. They should adopt a skeptical attitude – but not a cynical one.

Michèle Nuijten was also discussant at two sessions:

“What can you do with nothing? Informative null results in hard-to-reach populations” (discussant). In hard-to-reach populations, it is especially difficult and time consuming to collect data, resulting in smaller sample sizes and inconclusive results. Therefore it is particularly important to understand what null results can mean. In this symposium, we discussed results from our own experimental data and how meta-analyses and Bayes factors can increase informativeness.
“Improving the transparency of your research one step at a time” (chair & discussant). Many solutions have been proposed to increase the quality and replicability of psychological science. All these options can be a bit overwhelming, so in this symposium, we focused on some easy-to-implement, pragmatic strategies and tools, including preprints, Bayesian statistics, and multi-lab collaboration.

Plan S: Are the Concerns Warranted?

February 14, 2019 by Michele Nuijten

Blog by Olmo van den Akker. A Dutch version has been published by ScienceGuide.

Plan S is the ambitious plan of eleven national funding agencies together with the European Commission (cOAlition S) to make all research funded by these organisations publicly accessible from 2020 onward. Since its announcement on September 4th 2018 the plan’s contents and consequences have been widely debated. When the guidelines for the implementation of the plan were presented at the end of November some aspects were clarified, but it also became apparent that a lot of details are still unclear. Here, I will give my thoughts on four main themes surrounding Plan S: early career researchers, researchers with less financial backing, scholarly societies, and academic freedom.

The consequences of Plan S for early career researchers

Because of the low job security in the early stage of an academic career it is possible that early career researchers will be negatively affected by Plan S. Plan S currently involves 14 national funding agencies (including India that announced their participation on January 12th) and draws support from big private funds like the Wellcome Trust and the Bill & Melinda Gates Foundation. Combined, these funds represent not more than 15% of the available research money in the world.

This relatively small market share could hurt young researchers dependent on Plan S funders as they will not be allowed to publish in some prestigious, but closed access journals. When researchers funded by other agencies can put these publications on their CV they would have an unfair advantage on the academic labour market. Only when Plan S or similar initiatives would cover a critical mass of the world’s research output would the playing field be levelled.

A crucial assumption underlying this reasoning is the continuation of the prestige model of scientific journals. However, Plan S specifically expresses the ambition to change the way researchers are being evaluated. Instead of looking at the number of publications in prestigious journals researchers should be evaluated on the quality of their work. This point has been emphasized in the San Francisco Declaration on Research Assessment (DORA).

DORA has been signed by more than 1,000 research organizations and more than 13,500 individuals worldwide, indicating that the scientific community wants to get rid of classical quality indicators like the impact factor and the h-index in favour of a new system of research assessment. One way to evaluate researchers is to look at the extent to which their work is open and reproducible. Plan S strongly supports open science and could therefore even be beneficial to early career researchers. However, it should be noted that cOAlition S should play a proactive role in this culture change. The fact that so many people signed DORA does not mean that they will act on its principles.

The consequences of Plan S researchers with less financial backing

It is expected that Plan S will cause many journals that currently have a closed subscription model to transition to an author-pays model where the author pays so-called article processing charges (APCs) to get their work published open access. Many researchers have raised concerns that Plan S would make publishers increase their profits by increasing their APCs. Because researchers are forced to publish open access they are also forced to pay these higher APCs. For researchers with less financial backing (for example from smaller institutions or developing countries) the increased APCs may be unaffordable, which would crowd them out of science. However, there are several counterpoints to this scenario.

First, Plan S involves the condition that journals make their APCs reasonable and transparent. If this condition is met, it is expected that journal APCs go down. This is illustrated by the fact that many open access journals that have no or very low APCs. This was underscored by a white paper of the Max Planck Society that shows that an open access system with APCs comes with significantly lower cost than the current system. To attain this scenario, it is important that cOAlition S monitors that journal APCs are indeed reasonable and transparent. Commercial publishers have a lot of market power and will undoubtedly try to artificially increase their APCs. cOAlition S has already announced that they will develop a database like the Directory of Open Access Journals, in which researchers can find journals that comply by the demands set out in Plan S. Hopefully, the necessity for journals to be included in that database will make sure that they set affordable APCs.

Second, representatives of cOAlition S have already clarified that they will instate a fund that can help researchers pay due APCs. This fund will be available for funded researchers as well as non-funded researchers that cannot reasonably be expected to pay APCs. The way this APC fund will be financed is as of yet unclear, but it is clear that individual researchers do not need to come up with the costs of open access themselves.

The consequences of Plan S for scholarly societies

Like regular journals, journals from scholarly societies will have to move from a subscription model to an author-pays model. Representatives of scholarly societies fear that this will be the end of them. Societies would face high investments to make the open access transition. For example, to be Plan S compliant, journals need to make their articles fully machine-readable by transforming them into a JATS XML format. In addition, they need to create an Application Programming Interface (API). Developing a digital infrastructure like this is costly and can be problematic given that societies lose their subscription fees from January 1st 2020.

Therefore, it is essential that cOAlition S plays a proactive role and tries to facilitate the open access transition for society journals on a case-by-case basis. A starting point for cOAlition S could be the results of a study by Wellcome Trust that will investigate how scholarly societies can transition to a Plan S compliant model as efficiently as possible. One possibility is that cOAlition S (partly) subsidizes the transition costs of journals and guides them in developing the required digital infrastructure.

The consequences of Plan S for academic freedom

One common concern of Plan S is that it restricts the freedom of researchers to determine what and how they do research, and how they disseminate their research results. This academic freedom is guaranteed by governments and academic institutions with the aim of insulating researchers from censorship and other negative consequences of their work. In this way, researchers can focus on their research without having to worry about any outside influence. When Plan S is implemented, researchers can no longer publish in paywalled journals. This would hamper researcher’s freedom to disseminate their research in the way they see fit.

However, one can raise doubts about the extent to which researchers currently do have the freedom to choose where and how to publish their work as researchers’ hands are generally tied by demands from scientific journals. They must abide by strict word limits and specific layout standards, and usually have to hand over their copyright to the commercial publisher. Moreover, to move up in academia, they are almost forced to publish in prestigious journals. Therefore, appealing to academic freedom to criticize Plan S is unconvincing, especially given that Plan S does not place any restrictions on research contents and on the methods researchers employ.

A more ideological point against the academic freedom argument is that academic freedom is part of an unofficial reciprocal arrangement between researchers and society. Researchers receive funding and freedom from society, but in return they should incorporate the interests of society into their decision-making. Publishing in a prestigious but closed journal does not fit with this reciprocal arrangement. Currently, many researchers have access to closed journals because university libraries pay a subscription fee to the publishers of those journals. However, not all researchers can take advantage of these subscriptions because their organisation cannot afford them or because the negotiations about subscription fees were unsuccessful.

Because of the limited access to research results scientific progress slows down. This is problematic in itself, but can have major consequences for research about climate change or contagious diseases. In addition, the subscription fees demanded by publishers is disproportionally high. In 2018, The Netherlands paid more than 12 million euros to one of the main scientific publishers, Elsevier. A big chunk of that money ended up as profit for Elsevier and would not by reinvested into science. Obviously, this practice does not fit with the reciprocal arrangement between researchers and society either.

Conclusion

After their call for feedback cOAlition S was flooded by a wave of comments and ideas about Plan S, of which the mains ones are outlined above. Even alternative plans were proposed with names like Plan U and Plan T, which were often even more radical than Plan S. Although such initiatives are very valuable to the scientific community it is hard to create a new infrastructure for scholarly communication without a large budget and without the support of a critical mass. cOAlition S does have a large budget and is getting increasing support from the scientific community. That’s why I think that Plan S is currently the most efficient way forward, especially because the potential issues with the plan are relatively straightforward to prevent. I have faith that cOAlition S will take the responsibility that follows from intiating this ambitious plan. Let us place our trust as a research community and back cOAlition S toward a more open science.

Open Science: The Way Forward

November 29, 2018 by Michele Nuijten

Blog by Michèle Nuijten for Tilburg University on the occasion of World Digital Preservation Day.

We have all seen headlines about scientific findings that sounded too good to be true. Think about the headline “a glass of red wine is the equivalent of an hour in the gym”. A headline like this may make you skeptical right away, and rightly so. In this particular case, it turned out that several journalists got carried away, and the researchers never made such claims.

However, sometimes the exaggeration of an effect already takes place in the scientific article itself. Indeed, increasing evidence shows that many published results might be overestimated, or even false.

This excess of overestimated results is probably caused by a complex interaction of different factors, but there are several leads of what important problems might be.

The first problem is publication bias: studies that “find something” have a larger probability to be published than studies that don’t find anything. You can imagine that if we only present the success stories, the overall picture gets distorted and overly optimistic.

This publication bias may lead to the second problem: flexible data analysis. Scientists can start showing strategic behavior to increase their chances to publish their findings: “if I leave out this participant, or if I try a different analysis, maybe my data will show me the result I was looking for.” This can even happen completely unconsciously: in hindsight, all these decisions may seem completely justified.

The third problem that can distort scientific results are statistical errors. Unfortunately, it seems that statistical errors in publications are widespread (see, e.g., the prevalence of errors in psychology).

The fact that we make mistakes and have human biases, doesn’t make us bad scientists. However, it does mean that we have to come up with ways to avoid or detect these mistakes, and that we need to protect ourselves from our own biases.

I believe that the best way of doing that is through open science.

One of the most straightforward examples of open science is sharing data. If raw data are available, you can see exactly what the conclusions in an article are based on. This way, any errors or questionable analytical choices can be corrected or discussed. Maybe the data can even be used to answer new research questions.

Sharing data can seem as simple as posting them on your own personal website, but this has proven to be rather unstable: URLs die, people move institutions, or they might leave academia altogether. A much better way to share data is via certified data repositories. That way, your data are safely stored for the long run.

Open data is only one example of open science. Another option is to openly preregister research plans before you actually start doing the research. You can also make materials and analysis code open, publish open access, or write public peer reviews.

Of course, it is not always possible to make everything open in every research project. Practical issues such as privacy can restrict how open you can be. However, you might be surprised by how many other things you can make open, even if you can’t share your data.

I would like to encourage you to think about ways to make your own research more open. Maybe you can preregister your plans, maybe you can publish open access, maybe you can share your data. No matter how small the change is, opening things up will make our science better, one step at a time.

This blog has been posted on the website of Tilburg University: https://www.tilburguniversity.edu/current/news/blog-michele-nuijten-open-science/

statcheck – A Spellchecker for Statistics

February 28, 2018 by Michele Nuijten

Guest blog for LSE Impact Blog by Michèle Nuijten

If you’re a non-native English speaker (like me), but you often have to write in English (like me), you will probably agree that the spellchecker is an invaluable tool. And even when you do speak English fluently, I’m sure that you’ve used the spellchecker to filter out any typos or other mistakes.

When you’re writing a scientific paper, there are many more things that can go wrong than just spelling. One thing that is particularly error-prone is the reporting of statistical findings.

Statistical errors in published papers

Unfortunately, we have plenty of reasons to assume that copying the results from a statistical program into a manuscript doesn’t always go well. Published papers often contain impossible means, coefficients that don’t add up, or ratios that don’t match their confidence intervals.

In psychology, my field, we found a high prevalence of inconsistencies in reported statistical test results (although these problems are by no means unique to psychology). Most conclusions in psychology are based on “null hypothesis significance testing” (NHST) and look roughly like this:

“The experimental group scored significantly higher than the control group, t(58) = 1.91, p < .05”.

This is a t-test with 58 degrees of freedom, a test statistic of 1.91, and a p-value that is smaller than .05. A p-value smaller than .05 is usually considered “statistically significant”.

This example is, in fact, inconsistent. If I recalculate the p-value based on the reported degrees of freedom and the test statistic, I would get p = .06, which is not statistically significant anymore. In psychology, we found that roughly half of papers contain at least one inconsistent p-value, and in one in eight papers this may have influenced the statistical conclusion.

Even though most inconsistencies we found were small and likely to be the result of innocent copy-paste mistakes, they can substantively distort conclusions. Errors in papers make results unreliable, because they become “irreproducible”: if other researchers would perform the same analyses on the same data, a different conclusion would roll out. This, of course, affects the level of trust we place in these results.

statcheck

The inconsistencies I’m talking about are obvious. Obvious, in the sense you don’t need raw data to see that certain reported numbers don’t match. The fact that these inconsistencies do arise in the literature means that peer review did not filter them out. I think it could be useful to have an automated procedure to flag inconsistent numbers. Basically, we need a spellchecker for stats. To that end, we developed statcheck.

statcheck roughly works as follows. First, it converts articles to plain-text files. Next, it searches the text for statistical results. This is possible in psychology, because of the very strict reporting style (APA); stats are always reported in the same way. When statcheck detects a statistical result, it uses the reported degrees of freedom and test statistic to recompute the p-value. Finally, it compares the reported p-value with the recalculated one, to see if they match. If not, the result is flagged as an inconsistency. If the reported p-value is significant and the recalculated one is not, or vice versa, it is flagged as a gross inconsistency. More details about how statcheck works can be found in the manual.

statcheck’s accuracy

It is important that we know how accurate statcheck is in flagging inconsistencies. We don’t want statcheck to mark large numbers of correct results as inconsistent, and, conversely, we also don’t want statcheck to wrongly classify results as correct when they are actually inconsistent. We investigated statcheck’s accuracy by running it on a set of articles for which inconsistencies were also manually coded.

When we compared statcheck’s results with the manual codings, we found two main things. First, statcheck detects roughly 60% of all reported stats. It missed the statistics that were not reported completely according to APA style. Second, statcheck did a very good job in flagging the detected statistics as inconsistencies and gross inconsistencies. We found an overall accuracy of 96.2% to 99.9%, depending on the specific settings. (There has been some debate about this accuracy analysis. A summary of this discussion can be found here.)

Even though statcheck seems to perform well, its classifications are not 100% accurate. But, to be fair, I doubt whether any automated algorithm could achieve this (yet). And again, the comparison with the spellchecker still holds; mine keeps telling me I misspelled my own name, and that it should be “Michelle” (it really shouldn’t be).

One major advantage of using statcheck (or any algorithm) for statistical checks is its efficiency. It will take only seconds to flag potential problems in a paper, rather than going through all the reported stats and checking them manually.

An increasing number of researchers seem convinced of statcheck’s merits; the R package has been downloaded more than 8,000 times, while the web app has been visited over 23,000 times. Additionally, two flagship psychology journals have started to use statcheck as a standard part of their peer review process. Testimonies on Twitter illustrate the ease and speed with which papers can be checked before they’re submitted:

Just statcheck-ed my first co-authored manuscript. On my phone while brushing my teeth. Great stuff @MicheleNuijten @SachaEpskamp @seanrife!
— Anne Scheel (@annemscheel) October 22, 2016

Automate the error-checking process

More of these “quick and dirty spellchecks” for stats are being developed (e.g. GRIM to spot inconsistencies in means; or p-checker to analyse the consistency and other properties of p-value), and an increasing number of papers and projects make use of automated scans to retrieve statistics from large numbers of papers (e.g. here, here, here, and here).

In an era where scientists are pressed for time, automated tools such as statcheck can be very helpful. As an author you can make sure you didn’t mistype your key results, and as a peer reviewer you can quickly check if there are obvious problems in the statistics of a paper. Reporting statistics can just as easily go wrong as grammar and spelling; so when you’re typing up a research paper, why not also check your stats?

More information about statcheck can be found at: http://statcheck.io

Journal Policies that Encourage Data Sharing Prove Extremely Effective

September 05, 2017 by Michele Nuijten

Guest blog for LSE Impact Blog by Michèle Nuijten

For science to work well we should move towards opening it up. That means sharing research plans, materials, code, and raw data. If everything is openly shared, all steps in a study can be checked, replicated, or extended. By sharing everything we let the facts speak for themselves, and that’s what science is all about.

Unfortunately, in my own field of psychology, raw data are notoriously hard to come by. Statements in papers such as “all data are available upon request” are often void, and data may get lost if a researcher retires, switches university, or even buys a new computer. We need to somehow incentivise researchers to archive their data online in a stable repository. But how?

Currently it is not in a scientist’s interests to put effort into making data and materials available. Scientists are evaluated based on how much they publish and how often they’re cited. If they don’t receive credit for sharing all details of their work, but instead run the risk colleagues will criticise their choices (or worse: find errors!), why would they do it?

So now for the good news: incentivising researchers to share their data may be a lot easier than it seems. It could be enough for journals to simply ask for it! In our recent preprint, we found journal policies that encourage data sharing are extremely effective. Journals that require data sharing showed a steep increase in the percentage of articles with open data from the moment these policies came into effect.

In our study we looked at five journals. First, we compared two journals in decision making research: Judgment and Decision Making (JDM), which started to require data sharing from 2011; and the Journal of Behavioral Decision Making (JBDM), which does not require data sharing. Figure 1 shows a rapidly increasing percentage of articles in JDM sharing data (up to 100%!), whereas nothing happens in JBDM. The same pattern holds for psychology articles from open access publisher PLOS (with its data-sharing policy taking effect in 2014) and the open access journal Frontiers in Psychology (FP; no such data policy).

Similarly, the journal Psychological Science (PS) also contained increasing numbers of articles with open data after it introduced its Open Practice Badges in 2014. You can earn a badge for sharing data, sharing materials, or preregistering your study. A badge is basically a sticker for good behaviour on your paper. Although this may sound a little kindergarten, believe me: you don’t want to be the one without a sticker!

Figure 1: Percentage of articles per journal to have open data. A solid circle indicates no open-data policy; an open circle indicates an open-data policy. Source: Nuijten, M. B., Borghuis, J., Veldkamp, C. L. S., Alvarez, L. D., van Assen, M. A. L.… — Figure 1: Percentage of articles per journal to have open data. A solid circle indicates no open-data policy; an open circle indicates an open-data policy. Source: Nuijten, M. B., Borghuis, J., Veldkamp, C. L. S., Alvarez, L. D., van Assen, M. A. L. M., & Wicherts, J. M. (2017) “Journal Data Sharing Policies and Statistical Reporting Inconsistencies in Psychology”, PsyArXiv Preprints. This work is licensed under a CC0 1.0 Universal license.

The increase in articles with available data is encouraging and has important consequences. With raw data we are able to explore different hypotheses from the same dataset, or combine information of similar studies in an Individual Participant Data (IPD) meta-analysis. We could also use the data to check if conclusions are robust to changes in the analyses.

The availability of research data would increase the quality of science as a whole. With raw data we have the possibility to find and correct mistakes. On top of that, the probability of making a mistake is likely to be lower once you have gone to the effort of archiving your data in such a way that another person can understand it. The process of archiving data for future users could also provide a barrier to taking advantage of the flexibility in data analysis that could lead to false positive results. Enforcing data sharing might even deter fraud.

Of course, data-sharing policy is not a “one-size-fits-all” solution. In some fields of psychological research (e.g. sexology or psychopathology) data can be very personal and sensitive, and can’t simply be posted online. Luckily there are increasingly sophisticated techniques to anonymise data, and often materials and analysis plans can still be shared to increase transparency.

It is also important to acknowledge the time and effort it took to collect the original data. One way to do this is to set a fixed period of time during which only the original researchers have access to the data. That way they get a head start in publishing studies based on the data. When this period is over and others can also use the data, the original authors should, of course, be properly acknowledged through citations, or even, in some cases, co-authorship.

There are many different ways to encourage openness in science. My hope is that more journals will soon follow and start implementing an open-data policy. But aside from merely requiring data sharing, journals should also check if the data is actually available. To illustrate the importance of this, our study found one third of PLOS articles claiming to have open data, actually did not deliver (for similar numbers, see the data by Chris Chambers).

And many (including myself) would even like to go one step further. Datasets should not only be available, they should also be stored in such a way that others can use them (see the FAIR Data Principles). A good way to influence the usability of open data might be the use of the Open Practice Badges. It turned out that in PS, the badges not only increased the availability of data, but also the relevance, usability, and completeness of the data. Another way of ensuring data quality, but also recognition for your work, is to publish your data in a special data journal, such as the Journal of Open Psychology Data.

Even though data sharing in psychology is not yet the status quo, several journals are already helping our field take a step in the right direction. As a matter of fact, the American Psychological Association (APA) has recently announced it will give its editors the option of awarding badges. It is very encouraging that journal policies on data sharing, or even an intervention as simple as a badge to reward good practice can cause such a surge in open data. Therefore, I hereby encourage all editors in all fields to start requiring data. And while we’re at it, why not ask for research plans, materials, and analysis code too?

I would like to thank Marcel van Assen for his helpful comments while drafting this blog.

This blog post is based on the author’s co-written article, “Journal Data Sharing Policies and Statistical Reporting Inconsistencies in Psychology”, available at http://doi.org/10.1525/collabra.102

BayesMed and statcheck

March 20, 2017 by Michele Nuijten

Read more on Association for Psychological Science

Michèle wrote a guest post for the APS Observer about her R packages "BayesMed" and "statcheck": two packages that deal with problems related to p-values, but in very different ways. Read the full piece here.

The Replication Paradox

January 05, 2016 by Michele Nuijten

Guest blog for The Replication Network by Michèle Nuijten

Lately, there has been a lot of attention for the excess of false positive and exaggerated findings in the published scientific literature. In many different fields there are reports of an impossibly high rate of statistically significant findings, and studies of meta-analyses in various fields have shown overwhelming evidence for overestimated effect sizes.

Originally Published on The Replication Network

The suggested solution for this excess of false postive findings and exaggerated effect size estimates in the literature is replication. The idea is that if we just keep replicating published studies, the truth will come to light eventually.

This intuition also showed in a small survey I conducted among psychology students, social scientists, and quantitative psychologists. I offered them different hypothetical combinations of large and small published studies that were identical except for the sample size – they could be considered replications of each other. I asked them how they would evaluate this information if their goal was to obtain the most accurate estimate of a certain effect. In almost all of the situations I offered, the answer was almost unanimously: combine the information of both studies.

This makes a lot of sense: the more information the better, right? Unfortunately this is not necessarily the case.

The problem is that the respondents forgot to take into account the influence of publication bias: statistically significant results have a higher probability of being published than non-significant results. And only publishing significant effects leads to overestimated effect sizes in the literature.

But wasn’t this exactly the reason to take replication studies into account? To solve this problem and obtain more accurate effect sizes?

Unfortunately, there is evidence from multi-study papers and meta-analyses that replication studies suffer from the same publication bias as original studies (see below for references). This means that bothtypes of studies in the literature contain overestimated effect sizes.

The implication of this is that combining the results of an original study with those of a replication study could actually worsen the effect size estimate. This works as follows.

Bias in published effect size estimates depends on two factors: publication bias and power (the probability that you will reject the null hypothesis, given that it is false). Studies with low power (usually due to a small sample size) contain a lot of noise, and the effect size estimate will be all over the place, ranging from severe underestimations to severe overestimations.

This in itself is not necessarily a problem; if you would take the average of all these estimates (e.g., in a meta-analysis) you would end up with an accurate estimate of the effect. However, if because of publication bias only the significant studies are published, only the severe overestimations of the effect will end up in the literature. If you would calculate an average effect size based on these estimates, you will end up with an overestimation.

Studies with high power do not have this problem. Their effect size estimates are much more precise: they will be centered more closely on the true effect size. Even when there is publication bias, and only the significant (maybe slightly overestimated) effects are published, the distortion would not be as large as with underpowered, noisier studies.

Now consider again a replication scenario such as the one mentioned above. In the literature you come across a large original study and a smaller replication study. Assuming that both studies are affected by publication bias, the original study will probably have a somewhat overestimated effect size. However, since the replication study is smaller and has lower power, it will contain an effect size that is even more overestimated. Combining the information of these two studies then basically comes down to adding bias to the effect size estimate of the original study. In this scenario it would render a more accurate estimation of the effect if you would only evaluate the original study, and ignored the replication study.

In short: even though a replication will increase precision of the effect size estimate (a smaller confidence interval around the effect size estimate), it will add bias if the sample size is smaller than the original study, but only if there is publication bias and the power is not high enough.

There are two main solutions to the problem of overestimated effect sizes.

The first solution would be to eliminate publication bias; if there is no selective publishing of significant effects, the whole “replication paradox” would disappear. One way to eliminate publication bias is to preregister your research plan and hypotheses before collecting the data. Some journals will even review this preregistration, and can give you an “in principle acceptance” – completely independent of the results. In this case, studies with significant and non-significant findings have an equal probability of being published, and published effect sizes will not be systematically overestimated. Another way is for journals to commit to publishing replication results independent of whether the results are significant. Indeed, this is the stated replication policy of some journals already.

The second solution is to only evaluate (and perform) studies with high power. If a study has high power, the effect size estimate will be estimated more precisely and less affected by publication bias. Roughly speaking: if you discard all studies with low power, your effect size estimate will be more accurate.

A good example of an initiative that implements both solutions is the recently published Reproducibility Project, in which 100 psychological effects were replicated in studies that were preregistered and high powered. Initiatives such as this one eliminates systematic bias in the literature and advances the scientific system immensely.

However, before preregistered, highly powered replications are the new standard, researchers that want to play it safe should change their intuition from “the more information, the higher the accuracy,” to “the more power, the higher the accuracy.”

This blog is based on the paper “The replication paradox: Combining studies can decrease the accuracy of effect size estimate” (2015) by Nuijten, van Assen, Veldkamp, Wicherts (2015). Review of General Psychology, 19 (2), 172-182.

Literature on How Replications Suffer From Publication Bias

Francis, G. (2012). Publication bias and the failure of replication in experimental psychology. Psychonomic Bulletin & Review, 19(6), 975-991.
Ferguson, C. J., & Brannick, M. T. (2012). Publication bias in psychological science: Prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. Psychological Methods, 17, 120-128.

Data sharing not only helps facilitate the process of psychology research, it is also a reflection of rigour

October 24, 2013 by Michele Nuijten

Guest blog for LSE Impact Blog by Jelte Wicherts

Data sharing in scientific psychology has not been particularly successful and it is high time we change that situation. Before I explain how we hope to get rid of the secrecy surrounding research data in my field of psychology, let me explain how I got here.

Ten years ago, I was working on a PhD thesis for which I wanted to submit old and new IQ data from different cohorts to novel psychometric techniques. These techniques would enable us to better understand the remarkable gain in average IQ that has been documented in most western countries over the course of the 20thcentury. These new analyses had the potential to shed light on why it is that more recent cohorts of test-takers (say, folks born between 1975-1985) scored so much higher on IQ tests than older cohorts (say, baby boomers). In search of useful data from the millions of yearly IQ test administrations, I started emailing psychologists in academia and the test-publishing world. Although my colleagues acknowledged that indeed there must be a lot of data around, most of their data were not in any useful format or could no longer be found.

Raven Matrix – IQ Test Image credit: Life of Riley [CC-BY-SA-3.0]

After a persistent search I ended up getting five useful data sets that had been lying in a nearly-destroyed file-cabinet at some library in Belgium, were saved on old floppy disks, were reported as a data table in published articles, or were in a data repository (because data collection had been financed by the Dutch Ministry of Education under the assumption that these data would perhaps be valuable for future use). Our analyses of the available data showed that the gain in average IQ was in part an artefact of testing. So a handful of psychologists back in the 1960s kept their data, which decades later helped show that their rebellious generation was not simply less intelligent than generations X (born 1960-1980) or Y (born 1980-2000). The moral of the story is that often we do not know about all potential uses of the data that we as researchers collect. Keeping the data and sharing them can be scientifically valuable.

Psychologists used to be quite bad at storing and sharing their research data. In 2005, we contacted 141 corresponding authors of papers that had been published in top-ranked psychology journals. In our study, we found that 73% of corresponding authors of papers published 18 months earlier were unable or unwilling to share data upon request. They did so despite the fact that they had signed a form stipulating that they would share data for verification purposes. In a follow-up study, we found that researchers who failed to share data upon request reported more statistical errors and report less convincing results than researchers who did share data. In other words, sharing data is a reflection of rigor. We in psychology have learned a hard lesson when it comes to researchers being secretive about their data. Secrecy enables up all sorts of problems including biases in reporting of results, honest errors, and even fraud.

So it is high time that we as psychologists become more open with our research data. For this reason, an international group of researchers from different subfields in psychology and I have established an open access journal, published by Ubiquity Press, that rewards the sharing of psychological research data. The journal is called Journal of Open Psychology Data and in it we publish so-called data papers. Data papers are relatively short, peer-reviewed papers that describe an interesting and potentially useful data set that has been shared with the scientific community in an established data repository.

We aim to publish three types of data papers. First, a data paper in the Journal of Open Psychology Data may describe the data from research that has been published in traditional journals. For instance, our first data paper reports raw data from a study of cohort differences in personality factors over the period 1982-2007, which was previously published in the Journal of Personality and Social Psychology. Second, we seek data papers from unpublished work that may of interest for future work because they can be submitted to alternative analyses or can be enriched later. Third, we publish papers that report data from replications of earlier findings in the psychological literature. Such replication efforts are often hard to publish in traditional journals, but we consider them to be important for progress. So the Journal of Open Psychology Data helps psychologists to find interesting data sets that can be used for educational purposes (learning of statistical analyses), data sets that can be included in meta-analyses, or data sets that can be submitted to secondary analyses. More information can be found in the editorial I wrote for the first issue.

In order to remain open access, the Journal of Open Psychology Data charges authors a publication fee. But our article processing charge is currently only 25 pounds or 30 euros. So if you are a psychologist and have data lying around that will probably vanish as soon as your new computer arrives, don’t hesitate. Put your data in a safe place in a data repository, download the paper template, describe how the data were collected (and/or where they were previously reported), explain why they are interesting, and submit your data paper to the Journal of Open Psychology Data. We will quickly review your data paper, determine whether the data are interesting and useful, and check the documentation and accessibility of the data. If all is well, you can add a data paper to your resume and let the scientific community know that you have shared your interesting data. Who knows how your data may be used in the future.

This post is part of a wider collection on Open Access Perspectives in the Humanities and Social Sciences (#HSSOA) and is cross-posted at SAGE Connection. We will be featuring new posts from the collection each day leading up to the Open Access Futures in the Humanities and Social Sciences conference on the 24th October, with a full electronic version to be made openly available then.