The Measurement Crisis: A Hidden Flaw in Psychology.
This blogpost was written by Iris Willigers. Iris is a PhD student of our meta-research group and started her PhD in September 2024. During her PhD, she will be working on Jelte’s Vici project: Examining the Variation in Causal Effects in Psychology.
In August 2015, one of the most well-known papers in Psychology was published titled: “Estimating the reproducibility of psychological science” by the Open Science Collaboration (1). In this paper, they argued that only 36% of 100 selected papers that were published in top journals in Psychology could be replicated. This paper was part of the Reproducibility Project, which involved collaborations with numerous researchers with the goal of estimating the reproducibility of published scientific findings (2) . Up until now, many suggestions to improve reproducibility of psychological science have mostly focused on the correct use and reporting of methods and statistics (3,4). However, even when correctly using and reporting methods and statistics, if the operationalization of the measured construct is invalid, the conclusions based on the results of the study may be invalid and unreliable. A threat for the reproducibility of psychology is the lesser talked about but related crisis: the Measurement Crisis (5).
Psychology heavily relies on the operationalization of abstract constructs. Operationalization (6) can be described as the process of translating the abstract constructs (e.g., anxiety) into observable and measurable variables (e.g., Beck Anxiety Inventory). However, the time and thinking that needs to go into this process is often underestimated. In case of poor operationalization, the observed variables lack construct validity. Construct validity describes the ability of a measurement instrument to measure the operational construct it is supposed to measure (7). An example of poor construct validity can be illustrated by asking the question: “Have you played tennis before?” with the goal of measuring implicit social cognition. In this example, I think we can agree that the face validity (5) of the operationalization of implicit social cognition is poor as the question has nothing to do with an attitude towards something/someone. However, it becomes less clear when I tell you that I used the Implicit Association Test (IAT) with the goal to measure implicit social cognition in my study. In this case, we would need to collect evidence for all components of construct validity to be able to decide whether the operationalization was successful. Several examples of this evidence are conceptualizing the construct (8) (substantive component), investigating Cronbach’s alpha (9) (structural component) or check for correlations with other scales that measure same and different constructs (9) (external component).
Let’s look at the example I mentioned before, the Implicit Association Test (IAT). The IAT (10) aims to measure implicit social cognition (often attitudes) by showing a participant two different conditions of pictures with words. If researchers want to measure attitudes towards sexuality with this test, they will ask participants to divide photos of heterosexual or gay couples on a computer using either ‘good’ or ‘bad’ words, while measuring your reaction time. An example of the IAT for attitudes towards race can be seen in Figure 1. To derive the participants’ implicit attitudes, the assumption is that the participant will take a shorter time to respond in case of a stronger association with the paired category.
Figure 1
An example of conditions of the Implicit Association Test (IAT).
Although the face validity for the IAT looks okay, there have been several studies with conflicting outcomes for construct validity of the IAT. The first problem considers that it remains unclear in the literature what the IAT exactly measures (12). There are four possibilities that range from measuring implicit attitudes that are not possible to measure using explicit attitudes to the test being no valid measure as there are no stable attributes (13).
Another reason that the construct validity of the IAT is unclear is because the reliability of the IAT depends on the type of reliability. The test-retest reliability is moderate (r = .50), whereas the internal consistency is high (alpha = .80) (14). If the goal of the IAT is to measure one specific attitude that remains consistent over time, the construct validity gathered from reliability information is not sufficient.
Additionally, we are also not able to draw a conclusion on the convergent and discriminant validity of the IAT (15) following the approach of Campbell and Fiske (9). To provide evidence for discriminant validity, convergent validity needs to be well established. Discriminant validity describes that the measure does correlate low to moderate with other measures that are designed to measure a different construct, whereas convergent validity describes that the measure does correlate highly with other measures that are designed to measure the same or a theoretically similar construct. Logically, one should first demonstrate the measure presents the intended construct before being able to provide evidence that another construct differs from the intended construct. Thus, to be able to provide evidence for convergent and discriminant validity of the IAT, it should be clear whether the IAT measures implicit social cognition, explicit social cognition or something in-between. This brings us back to the first problem we discussed regarding the conceptualization problem of IAT for construct validation.
Our example on the construct validity of the IAT illustrates how difficult it is to determine whether a measure is valid. Even though it is unclear what the construct validity of the IAT is, the test is still used to measure or study implicit social cognition. Currently, it has been cited 18.012 times (as of 13-12-2024). But how can you blame the people citing this article when there is so much conflicting literature to keep up with?
Although operationalization and a study’s construct validity are essential elements required in the process of establishing the robustness of study findings, scientific manuscripts often do not contain sufficient information that validates the measured construct(s). Different studies reported lack of construct validity evidence for scientific manuscripts about general psychology (16), educational behavior (17), emotion (18), and social cognitive ability (19). Studies about reporting practices of reliability and validity show us that researchers often invoke the reliability and validity evidence of previous studies without testing it in their own sample (17, 18). As we know for reliability, this is a characteristic of the functioning of a test within a certain sample and not a characteristic of only the test itself (22). Current reporting practices also show that researchers still assume previous studies’ reliability and validity even though they adjusted their test by adding or deleting questions (18). Still, no or incomplete reporting of validity or reliability of the measurement instrument(s) can lead to over-reliance of the papers’ reported scientific results by the authors themselves. In addition, this can also be misleading to readers of the paper, as the reported conclusions cannot be evaluated based on the reported measurement information.
Part of the measurement problem in the field of psychology, is the lack of standardization of measurement instruments. An example can be retrieved from Weidman et al. in their study about the current state of emotion assessment and found that only 8.4% of the 356 measurement instruments were cited from an existing scale without modifying the scale for the current paper (18). In this study, 69% of the measurement instruments were developed without systematic scale development or reference to earlier literature.
The problem with this unstandardized way of measuring is that different measurements for the same abstract construct could possibly yield to different conclusions (23). This also has implications for the evaluation and comparison of parts of the scientific theory that is tested in the literature. For example, consider two constructs academic success and physical activity, which can be operationalized in multiple ways. Two operationalizations for academic success are someone’s grade point average (GPA) or someone’s self-reported GPA. For GPA, people often overreport their self-reported academic success compared to their actual GPA (24,25). Physical activity can be operationalized by reporting measured number of minutes of physical activity per day measured by an accelerometer for 5 days or self-reported physical activity in minutes per week. The correlation between academic success and physical activity has been studied using these different operationalizations. In one study they used the actual GPA and accelerometer to investigate the relationship between academic success and physical activity and found a strong correlation of 0.87 (N = 20) (26). In another study they used the self-reported operationalization academic success and physical activity and found a correlation of -0.12 (N = 104) (27). Even though different operationalizations of the variables are probably not the only reason why the correlations have a lot of variation between them (e.i., the studies’ sample sizes were small), it is an important aspect within this variation. But the example makes clear that we need standardized tests and other measures to build strong psychological theories (28,29).
How can we overcome the measurement crisis? I think it starts with prioritizing standardized measures to operationalize variables and ensuring transparency in reporting evidence for their construct validity. Furthermore, when feasible in terms of time and costs, researchers should offer evidence of construct validity within their specific sample to enhance the credibility of their findings. When designing a new measure to operationalize variables, it is essential to systematically develop and assess its validity. The process of instrument developments should be reported transparently, enabling readers to critically evaluate the validity of the study. A clear overview of avoiding, what are called, ‘Questionable Measurement Practices’ can be found in the paper of Flake & Fried (30). They provided a list of questions to consider when thinking about measurement. By prioritizing sound measurement practices, the robustness of psychological theory can move forward.
References
Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015 Aug 28;349(6251):aac4716.
Open Science Collaboration. An Open, Large-Scale, Collaborative Effort to Estimate the Reproducibility of Psychological Science. Perspect Psychol Sci. 2012 Nov 1;7(6):657–60.
Hales AH, Wesselmann ED, Hilgard J. Improving Psychological Science through Transparency and Openness: An Overview. Perspect Behav Sci. 2019 Mar 1;42(1):13–31.
Asendorpf JB, Conner M, De Fruyt F, De Houwer J, Denissen JJA, Fiedler K, et al. Recommendations for Increasing Replicability in Psychology. Eur J Personal. 2013 Mar 1;27(2):108–19.
Devine S. The Four Horsemen of the Crisis in Psychological Science [Internet]. Trial and Error. 2020 [cited 2025 Jul 1]. Available from: https://blog.trialanderror.org/the-four-horsemen-of-the-crisis-in-psychological-science
Jhangiani RS, Chiang IA, Cuttler C, Leighton DC. Research Methods in Psychology [Internet]. 4th ed. Surrey, B.C.: Kwantlen Polytechnic University; 2019. Available from: https://doi.org/10.17605/OSF.IO/HF7DQ
Cronbach LJ, Meehl PE. Construct validity in psychological tests. Psychol Bull. 1955;52(4):281–302.
Gehlbach H, Brinkworth ME. Measure twice, cut down error: A process for enhancing the validity of survey scales. Rev Gen Psychol. 2011;15(4):380–7.
Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16:297–334.
Greenwald AG, McGhee DE, Schwarts JL. Measuring individual differences in implicit cognition: the implicit association test. J Pers Soc Psychol. 1998;74(6):1464–80.
Dawson N, Arkes H. Implicit Bias Among Physicians. J Gen Intern Med. 2008 Nov 1;24:137–40.
Schimmack U. The Implicit Association Test: A Method in Search of a Construct. Perspect Psychol Sci. 2021 Mar 1;16(2):396–414.
Payne BK, Vuletich HA, Lundberg KB. The Bias of Crowds: How Implicit Bias Bridges Personal and Systemic Prejudice. Psychol Inq. 2017 Oct 2;28(4):233–48.
Greenwald AG, Lai CK. Implicit Social Cognition. Annu Rev Psychol. 2020 Jan;71:419–45.
Epifania OM, Anselmi P, Robusto E. Implicit social cognition through the years: The Implicit Association Test at age 21. Psychol Conscious Theory Res Pract. 2022;9(3):201–17.
Maassen E, D’Urso D, Assen M, Nuijten M, De Roover K, Wicherts J. The Dire Disregard of Measurement Invariance Testing in Psychological Science. Psychol Methods. 2023 Dec 25.
Barry AE, Chaney B, Piazza-Gardner AK, Chavarria EA. Validity and Reliability Reporting Practices in the Field of Health Education and Behavior: A Review of Seven Journals. Health Educ Behav. 2014 Feb 1;41(1):12–8.
Weidman AC, Steckler CM, Tracy JL. The jingle and jangle of emotion assessment: Imprecise measurement, casual scale usage, and conceptual fuzziness in emotion research. Emotion. 2017;17(2):267–95.
Higgins WC, Kaplan DM, Deschrijver E, Ross RM. Construct validity evidence reporting practices for the Reading the Mind in the Eyes Test: A systematic scoping review. Clin Psychol Rev. 2024 Mar 1;108:102378.
Barry AE, Chaney B, Piazza-Gardner AK, Chavarria EA. Validity and Reliability Reporting Practices in the Field of Health Education and Behavior: A Review of Seven Journals. Health Educ Behav. 2014 Feb 1;41(1):12–8.
Slaney KL, Tkatchouk M, Gabriel SM, Maraun MD. Psychometric assessment and reporting practices: Incongruence between theory and practice. J Psychoeduc Assess. 2009;27(6):465–76.
Revelle W. Chapter 7: Classical Test Theory and the Measurement of Reliability. In: An Introduction to Psychometric Theory with Applications in R [Internet]. Springer; Available from: http://personality-project.org/r/book
Breznau N, Rinke EM, Wuttke A, Nguyen HHV, Adem M, Adriaans J, et al. Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proc Natl Acad Sci. 2022 Nov 1;119(44):e2203150119.
Kuncel NR, Credé M, Thomas LL. The Validity of Self-Reported Grade Point Averages, Class Ranks, and Test Scores: A Meta-Analysis and Review of the Literature. Rev Educ Res. 2005;75(1):63–82.
Rosen JA, Porter SR, Rogers J. Understanding Student Self-Reports of Academic Performance and Course-Taking Behavior. AERA Open. 2017 May 1;3(2):2332858417711427.
Ðurić S, Bogataj Š, Zovko V, Sember V. Associations Between Physical Fitness, Objectively Measured Physical Activity and Academic Performance. Front Public Health [Internet]. 2021;9. Available from: https://www.frontiersin.org/journals/public-health/articles/10.3389/fpubh.2021.778837
Gonzalez EC, Hernandez EC, Coltrane AK, Mancera JM. The Correlation between Physical Activity and Grade Point Average for Health Science Graduate Students. OTJR Occup Ther J Res. 2014 Jun 1;34(3):160–7.
Goodhew SC, Dawel A, Edwards M. Standardizing measurement in psychological studies: On why one second has different value in a sprint versus a marathon. Behav Res Methods. 2020 Dec 1;52(6):2338–48.
Loevinger J. Objective Tests as Instruments of Psychological Theory. Psychol Rep. 1957 Jun 1;3(3):635–94.
Flake JK, Fried EI. Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them. Adv Methods Pract Psychol Sci. 2020 Dec 1;3(4):456–65.