Skip to main content
ScienceSIOSSpiegeloog 432: Fragile

Shall we do it Again? Wishful Thinking and the Current State of Credibility in Psychology

By March 27, 2024No Comments

Around the turn of the millennium, concerns about the replicability of medical and social scientific findings began to grow. Amusingly unlikely results, including claims about mind-reading and some clear cases of fraud, alluded to the severe lack of verification and falsification within the social sciences. As for the state of psychology, a large-scale replication study on 100 psychological research papers found a mean effect size that was merely half of the original mean effect size, and demonstrated that only 39 of the 100 original papers successfully replicated the original effect (Open Science Collaboration, 2015). More than a decade has passed since the term replication crisis was coined, but how much has changed? Have we fixed psychology as a science?

About the Authors

Thomas is a psychology research master’s student at the University of Amsterdam. His research focuses on the socialization processes that contribute to stigma, prejudice, and radical beliefs. He is an advocate for open science and ethical accountability in psychological research, and values cross-cultural collaboration and equitable and inclusive research funding.

Jeroen is a psychology research master’s student and the coordinator of the Student Initiative for Open Science at the University of Amsterdam. Passionate about formalizing psychological theory and open software, he is currently working on evidence accumulation modelling with censored data.

Around the turn of the millennium, concerns about the replicability of medical and social scientific findings began to grow. Amusingly unlikely results, including claims about mind-reading and some clear cases of fraud, alluded to the severe lack of verification and falsification within the social sciences. As for the state of psychology, a large-scale replication study on 100 psychological research papers found a mean effect size that was merely half of the original mean effect size, and demonstrated that only 39 of the 100 original papers successfully replicated the original effect (Open Science Collaboration, 2015). More than a decade has passed since the term replication crisis was coined, but how much has changed? Have we fixed psychology as a science?

About the Authors

Thomas is a psychology research master’s student at the University of Amsterdam. His research focuses on the socialization processes that contribute to stigma, prejudice, and radical beliefs. He is an advocate for open science and ethical accountability in psychological research, and values cross-cultural collaboration and equitable and inclusive research funding.

Jeroen is a psychology research master’s student and the coordinator of the Student Initiative for Open Science at the University of Amsterdam. Passionate about formalizing psychological theory and open software, he is currently working on evidence accumulation modelling with censored data.

Several recent articles would lead you to believe that we are well on our way. Even at the undergraduate thesis level, simply pre-registering the studies promotes rigour, increases good research practices (GRPs), and enhances understanding of open science principles (Pownall et al., 2023). In July 2023, Korbmacher and colleagues published a review on recent developments in scientific practices since the replication crisis that have led to positive changes in academia at all levels. Discussing the effects of incentivising researchers to pre-register their work, engage in thorough statistical assessments, and collaborate with scholars with opposing views, among others, they conclude the article by demonstrating how failed replications are not the only, or even the most important, component of the crisis (Korbmacher et al., 2023). Following that, Protzko et al. (2023) conducted a replication study on 16 recent social-behavioural studies that used the following four rigour-enhancing practices: confirmatory tests, large sample sizes, preregistration, and methodological transparency. They found that when these practices were implemented in the original study, replication effect sizes were more consistent with the original effect size (replicated effect sizes were 97% of the original effect size), and 86% of the replications had a significant effect in the hypothesised direction.

Despite the apparent good news, some researchers believe that progress has been overstated. One of the major concerns lies in Protzko’s formula for the replication rate, which many claim has inflated the metric. In a recent publication, Bak-Coleman and Devezer (2023) discuss how past replication studies have used the ratio of directly replicated effects to the total number of effects found, whereas Protzko and colleagues utilised the proportion of all significant tests from four replications per effect across 16 different effects. Thus, the formula postulates that even when some effects differed from the original, they would be counted towards the replication rate as long as they produced statistically significant results in the replications. This is a clear and heavily biased departure from traditional replication calculation methods and inevitably leads to rate inflation. Reassuringly, Bak-Coleman and Devezer’s paper points to another measure, the ratio of independent replications that fall within the confirmatory confidence interval, found in Protzko’s supplemental materials, as a superior approach. Using this method, only 71% of Protzko’s replications were successful. Although only two behavioural science replications are cited (c.f., Camerer, 2016; Camerer, 2018), the authors claim that this metric “has produced some of the highest estimates of replicability”.

“Despite the apparent good news, some researchers believe that progress has been overstated.”

Yet another possible issue with Protzko’s paper stems from one of the hallmark rigour-enhancing methods championed by virtually all academic reformers, including many of the paper’s authors: pre-registration. In this case, the critique lodged against Protzko’s pre-registration is that the paper’s core claim (i.e., rigour-enhancing methodology improves replicability) was not a pre-registered hypothesis, and thus is indistinguishable from selective reporting and hypothesising after the results are known (HARKing; Bak-Coleman & Devezer, 2023). Indeed, as of March 13th, 2024, the Open Science Framework (OSF) repository for Protzko’s paper included a pre-registration and analysis plan that described how replications would be analysed, but we were unable to find a pre-registration for the causal claims of the paper.

Given these apparent issues, it is perhaps unsurprising that Protzko’s paper has generated enough buzz that it is now under editorial investigation. For some, the controversy has even caused a malaise towards certain open science practices. In a blogpost, Associate Prof. Jessica Hullman points out the irony in how one of the most vocal proponents of open science apparently forgot to preregister their paper. Hullman humorously discusses the credibility of papers that have not been preregistered, and seems to claim that expecting the same result every time we run the same study does not make much sense. In an article published in the Northwestern University’s Institute for Policy Research, she says: “There may be subtle differences in the sample of people that explain the difference, which doesn’t necessarily mean there isn’t an effect—it’s just not as simple as estimating an average effect on one sample and expecting it to transfer outside of that particular set of experimental participants.” This questioning of a paradigm that emphasises the importance of replicability is not a novel phenomenon, yet hosts rather dangerous attitudes about the credibility of social scientific research.

Perhaps more constructively, Prof. Andrew Gelman suggests that while the practices described by Protzko are “fine”, there are five practices that are more important in increasing scientific rigour: (1) full, clear descriptions of methods, (2) increasing effect sizes, (3) focusing studies on most appropriate populations and contexts, (4) improving outcome measurements, and (5) improving baseline measurements. Whether these practices are better or more effective does not seem a beneficial use of time, but utilising them alongside other GRPs can only help improve the rigour of psychological research. In fact, Brian Nosek, a leader in the replication movement and a co-author of the Protzko manuscript, agrees with Gelman’s suggestions, while pointing out that some of them can only be effective with a more robust foundation of evidence.

“Rather than a bygone moment accomplished by heroic figures, reforming psychological research is a continuous effort to be undertaken by the entire research community”

If the shortcomings of Protzko’s study are innocuous, they highlight how human error will always remain a risk, even among prolific researchers. If instead certain analytical methods were prioritised because their results were flashier and more exciting, it serves as a reminder that the heralded credibility revolution is an ongoing process, and that the dangers of the replication crisis remain a reality. Rather than a bygone moment accomplished by heroic figures, reforming psychological research is a continuous effort to be undertaken by the entire research community, and it will require vigilance, honesty, and humility to come to fruition.

In spite of flaws, there is no reason to think that psychological research should abandon rigour-enhancing practices. If anything, persisting issues point to the need to continue employing and refining these practices, and perhaps pursue a higher level of rigour.

If the psychological research community wants the credibility movement to succeed and for rigour enhancing methods to contribute to it, then reform-minded senior career researchers/academics need to put their money where their mouth is. At this point, most early career researchers are well aware of the importance of GRPs. At the same time, we are shown that there are, apparently, exceptions to these rules. While such actions do not invalidate the inherent value of GRPs, senior academics should be careful not to harm the credibility of the credibility revolution. This might mean a bit of extra time and effort to teach and model best practices, but it is also the best path forward. For change to occur at the institutional and field levels, those with the most sway need to use their influence in a meaningful way. In the discourse surrounding transparency and credibility, actions speak louder than words. <<

Student Initiative for Open Science

This article has been written as part of an ongoing collaborative project with the Student Initiative for Open Science (SIOS). The Amsterdam-based initiative is focused on educating undergraduate- and graduate-level students about good research practices.

SIOS is also looking for new members – if the article spoke to you and you would like to be part of the process of implementing the open science framework in the academic world, join today!

References

Andrew. (2023, November 19). Brian Nosek on “Here are some ways of making your study replicable”: | Statistical Modeling, Causal Inference, and Social Science. https://statmodeling.stat.columbia.edu/2023/11/19/brian-nosek-on-here-are-some-ways-of-making-your-study-replicable/
‘An Existential Crisis’ for Science: Institute for Policy Research – Northwestern University. (n.d.). https://www.ipr.northwestern.edu/news/2024/an-existential-crisis-for-science.html
Bak-Coleman, J. B., & Devezer, B. (2023, November 20). Causal claims about scientific rigor require rigorous causal evidence. https://doi.org/10.31234/osf.io/5u3kj
Camerer, C. F., Dreber, A., Forsell, E., Ho, T. H., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T., Heikensten, E., Holzmeister, F., Imai, T., Isaksson, S., Nave, G., Pfeiffer, T., Razen, M., & Wu, H. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436. https://doi.org/10.1126/SCIENCE.AAF0918/SUPPL_FILE/PAP.PDF
Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Nave, G., Nosek, B. A., Pfeiffer, T., Altmejd, A., Buttrick, N., Chan, T., Chen, Y., Forsell, E., Gampa, A., Heikensten, E., Hummer, L., Imai, T., … Wu, H. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637–644. https://doi.org/10.1038/s41562-018-0399-z
Hullman, J. (2023, November 21). Of course its preregistered. Just give me a sec | Statistical Modeling, Causal Inference, and Social Science. https://statmodeling.stat.columbia.edu/2023/11/21/of-course-its-preregistered-just-give-me-a-sec/
Korbmacher, M., Azevedo, F., Pennington, C. R., Hartmann, H., Pownall, M., Schmidt, K., Elsherif, M., Breznau, N., Robertson, O., Kalandadze, T., Yu, S., Baker, B. J., O’Mahony, A., Olsnes, J. Ø.-S., Shaw, J. J., Gjoneska, B., Yamada, Y., Röer, J. P., Murphy, J., … Evans, T. (2023). The replication crisis has led to positive structural, procedural, and community changes. Communications Psychology, 1(1), 3. https://doi.org/10.1038/s44271-023-00003-2
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251). https://doi.org/10.1126/science.aac4716
Pownall, M., Talbot, C. V., Kilby, L., & Branney, P. (2023). Opportunities, challenges and tensions: Open science through a lens of qualitative social psychology. British Journal of Social Psychology, 62(4), 1581–1589. https://doi.org/10.1111/bjso.12628
Protzko, J., Krosnick, J., Nelson, L., Nosek, B. A., Axt, J., Berent, M., Buttrick, N., DeBell, M., Ebersole, C. R., Lundmark, S., MacInnis, B., O’Donnell, M., Perfecto, H., Pustejovsky, J. E., Roeder, S. S., Walleczek, J., & Schooler, J. W. (2023). High replicability of newly discovered social-behavioural findings is achievable. Nature Human Behaviour, 8(2), 311–319. https://doi.org/10.1038/s41562-023-01749-9

Several recent articles would lead you to believe that we are well on our way. Even at the undergraduate thesis level, simply pre-registering the studies promotes rigour, increases good research practices (GRPs), and enhances understanding of open science principles (Pownall et al., 2023). In July 2023, Korbmacher and colleagues published a review on recent developments in scientific practices since the replication crisis that have led to positive changes in academia at all levels. Discussing the effects of incentivising researchers to pre-register their work, engage in thorough statistical assessments, and collaborate with scholars with opposing views, among others, they conclude the article by demonstrating how failed replications are not the only, or even the most important, component of the crisis (Korbmacher et al., 2023). Following that, Protzko et al. (2023) conducted a replication study on 16 recent social-behavioural studies that used the following four rigour-enhancing practices: confirmatory tests, large sample sizes, preregistration, and methodological transparency. They found that when these practices were implemented in the original study, replication effect sizes were more consistent with the original effect size (replicated effect sizes were 97% of the original effect size), and 86% of the replications had a significant effect in the hypothesised direction.

Despite the apparent good news, some researchers believe that progress has been overstated. One of the major concerns lies in Protzko’s formula for the replication rate, which many claim has inflated the metric. In a recent publication, Bak-Coleman and Devezer (2023) discuss how past replication studies have used the ratio of directly replicated effects to the total number of effects found, whereas Protzko and colleagues utilised the proportion of all significant tests from four replications per effect across 16 different effects. Thus, the formula postulates that even when some effects differed from the original, they would be counted towards the replication rate as long as they produced statistically significant results in the replications. This is a clear and heavily biased departure from traditional replication calculation methods and inevitably leads to rate inflation. Reassuringly, Bak-Coleman and Devezer’s paper points to another measure, the ratio of independent replications that fall within the confirmatory confidence interval, found in Protzko’s supplemental materials, as a superior approach. Using this method, only 71% of Protzko’s replications were successful. Although only two behavioural science replications are cited (c.f., Camerer, 2016; Camerer, 2018), the authors claim that this metric “has produced some of the highest estimates of replicability”.

“Despite the apparent good news, some researchers believe that progress has been overstated.”

Yet another possible issue with Protzko’s paper stems from one of the hallmark rigour-enhancing methods championed by virtually all academic reformers, including many of the paper’s authors: pre-registration. In this case, the critique lodged against Protzko’s pre-registration is that the paper’s core claim (i.e., rigour-enhancing methodology improves replicability) was not a pre-registered hypothesis, and thus is indistinguishable from selective reporting and hypothesising after the results are known (HARKing; Bak-Coleman & Devezer, 2023). Indeed, as of March 13th, 2024, the Open Science Framework (OSF) repository for Protzko’s paper included a pre-registration and analysis plan that described how replications would be analysed, but we were unable to find a pre-registration for the causal claims of the paper.

Given these apparent issues, it is perhaps unsurprising that Protzko’s paper has generated enough buzz that it is now under editorial investigation. For some, the controversy has even caused a malaise towards certain open science practices. In a blogpost, Associate Prof. Jessica Hullman points out the irony in how one of the most vocal proponents of open science apparently forgot to preregister their paper. Hullman humorously discusses the credibility of papers that have not been preregistered, and seems to claim that expecting the same result every time we run the same study does not make much sense. In an article published in the Northwestern University’s Institute for Policy Research, she says: “There may be subtle differences in the sample of people that explain the difference, which doesn’t necessarily mean there isn’t an effect—it’s just not as simple as estimating an average effect on one sample and expecting it to transfer outside of that particular set of experimental participants.” This questioning of a paradigm that emphasises the importance of replicability is not a novel phenomenon, yet hosts rather dangerous attitudes about the credibility of social scientific research.

Perhaps more constructively, Prof. Andrew Gelman suggests that while the practices described by Protzko are “fine”, there are five practices that are more important in increasing scientific rigour: (1) full, clear descriptions of methods, (2) increasing effect sizes, (3) focusing studies on most appropriate populations and contexts, (4) improving outcome measurements, and (5) improving baseline measurements. Whether these practices are better or more effective does not seem a beneficial use of time, but utilising them alongside other GRPs can only help improve the rigour of psychological research. In fact, Brian Nosek, a leader in the replication movement and a co-author of the Protzko manuscript, agrees with Gelman’s suggestions, while pointing out that some of them can only be effective with a more robust foundation of evidence.

“Rather than a bygone moment accomplished by heroic figures, reforming psychological research is a continuous effort to be undertaken by the entire research community”

If the shortcomings of Protzko’s study are innocuous, they highlight how human error will always remain a risk, even among prolific researchers. If instead certain analytical methods were prioritised because their results were flashier and more exciting, it serves as a reminder that the heralded credibility revolution is an ongoing process, and that the dangers of the replication crisis remain a reality. Rather than a bygone moment accomplished by heroic figures, reforming psychological research is a continuous effort to be undertaken by the entire research community, and it will require vigilance, honesty, and humility to come to fruition.

In spite of flaws, there is no reason to think that psychological research should abandon rigour-enhancing practices. If anything, persisting issues point to the need to continue employing and refining these practices, and perhaps pursue a higher level of rigour.

If the psychological research community wants the credibility movement to succeed and for rigour enhancing methods to contribute to it, then reform-minded senior career researchers/academics need to put their money where their mouth is. At this point, most early career researchers are well aware of the importance of GRPs. At the same time, we are shown that there are, apparently, exceptions to these rules. While such actions do not invalidate the inherent value of GRPs, senior academics should be careful not to harm the credibility of the credibility revolution. This might mean a bit of extra time and effort to teach and model best practices, but it is also the best path forward. For change to occur at the institutional and field levels, those with the most sway need to use their influence in a meaningful way. In the discourse surrounding transparency and credibility, actions speak louder than words. <<

Student Initiative for Open Science

This article has been written as part of an ongoing collaborative project with the Student Initiative for Open Science (SIOS). The Amsterdam-based initiative is focused on educating undergraduate- and graduate-level students about good research practices.

SIOS is also looking for new members – if the article spoke to you and you would like to be part of the process of implementing the open science framework in the academic world, join today!

References

Andrew. (2023, November 19). Brian Nosek on “Here are some ways of making your study replicable”: | Statistical Modeling, Causal Inference, and Social Science. https://statmodeling.stat.columbia.edu/2023/11/19/brian-nosek-on-here-are-some-ways-of-making-your-study-replicable/
‘An Existential Crisis’ for Science: Institute for Policy Research – Northwestern University. (n.d.). https://www.ipr.northwestern.edu/news/2024/an-existential-crisis-for-science.html
Bak-Coleman, J. B., & Devezer, B. (2023, November 20). Causal claims about scientific rigor require rigorous causal evidence. https://doi.org/10.31234/osf.io/5u3kj
Camerer, C. F., Dreber, A., Forsell, E., Ho, T. H., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A., Chan, T., Heikensten, E., Holzmeister, F., Imai, T., Isaksson, S., Nave, G., Pfeiffer, T., Razen, M., & Wu, H. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436. https://doi.org/10.1126/SCIENCE.AAF0918/SUPPL_FILE/PAP.PDF
Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Nave, G., Nosek, B. A., Pfeiffer, T., Altmejd, A., Buttrick, N., Chan, T., Chen, Y., Forsell, E., Gampa, A., Heikensten, E., Hummer, L., Imai, T., … Wu, H. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637–644. https://doi.org/10.1038/s41562-018-0399-z
Hullman, J. (2023, November 21). Of course its preregistered. Just give me a sec | Statistical Modeling, Causal Inference, and Social Science. https://statmodeling.stat.columbia.edu/2023/11/21/of-course-its-preregistered-just-give-me-a-sec/
Korbmacher, M., Azevedo, F., Pennington, C. R., Hartmann, H., Pownall, M., Schmidt, K., Elsherif, M., Breznau, N., Robertson, O., Kalandadze, T., Yu, S., Baker, B. J., O’Mahony, A., Olsnes, J. Ø.-S., Shaw, J. J., Gjoneska, B., Yamada, Y., Röer, J. P., Murphy, J., … Evans, T. (2023). The replication crisis has led to positive structural, procedural, and community changes. Communications Psychology, 1(1), 3. https://doi.org/10.1038/s44271-023-00003-2
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251). https://doi.org/10.1126/science.aac4716
Pownall, M., Talbot, C. V., Kilby, L., & Branney, P. (2023). Opportunities, challenges and tensions: Open science through a lens of qualitative social psychology. British Journal of Social Psychology, 62(4), 1581–1589. https://doi.org/10.1111/bjso.12628
Protzko, J., Krosnick, J., Nelson, L., Nosek, B. A., Axt, J., Berent, M., Buttrick, N., DeBell, M., Ebersole, C. R., Lundmark, S., MacInnis, B., O’Donnell, M., Perfecto, H., Pustejovsky, J. E., Roeder, S. S., Walleczek, J., & Schooler, J. W. (2023). High replicability of newly discovered social-behavioural findings is achievable. Nature Human Behaviour, 8(2), 311–319. https://doi.org/10.1038/s41562-023-01749-9
SIOS Editors

Author SIOS Editors

SIOS editorial staff.

More posts by SIOS Editors