Critical analysis of the general methods, procedures and forms of doing science, as well as the identification of fraudulent and questionable conduct in research, is not something new (Barber, Reference Barber1976). However, in the last 12 years, the detailed review of scientific work has been extended to other fields of knowledge and other research designs and, particularly in psychology and other related sciences, it has been resumed with great force (Chin et al., Reference Chin, Pickett, Vazire and Holcombe2023). These ideas have been discussed from the perspective of the so-called replicability crisis (Nosek et al., Reference Nosek, Hardwicke, Moshontz, Allard, Corker, Dreber, Fidler, Hilgard, Kline Struhl, Nuijten, Rohrer, Romero, Scheel, Scherer, Schönbrodt and Vazire2022; Simmons et al., Reference Simmons, Nelson and Simonsohn2011). Revisions and proposals of viable explanations and solutions are still now being produced; yet the scientific literature, and in particular the meta-scientific studies (i.e., studies on the way science is done), have been giving concrete recommendations regard questionable and responsible research conduct.
One of the pending objectives is to adjust these suggestions to the specific contexts of each discipline and subdiscipline within psychological science (Chin et al., Reference Chin, Pickett, Vazire and Holcombe2023; Kirtley et al., Reference Kirtley, Janssens and Kaurin2022; Tackett et al., Reference Tackett, Lilienfeld, Patrick, Johnson, Krueger, Miller, Oltmanns and Shrout2017). A correct identification of questionable research conduct and dissemination of responsible research conduct adapted to each specific area of research is essential in order to achieve a more generalized knowledge and adherence within the scientific community (Bosma & Granger, Reference Bosma and Granger2022; Waldman & Lilienfeld, Reference Waldman and Lilienfeld2016; Steneck, Reference Steneck2006). Some of these proposals have recently been published and there appeared examples in the field of psychological-clinical evaluation (Tackett et al., Reference Tackett, Brandes and Reardon2019) and in the context of more general measurement practices (Flake & Fried, Reference Flake and Fried2020; Lilienfeld & Strother, Reference Lilienfeld and Strother2020). Although these relevant antecedents have undoubtedly helped to identify questionable research conduct and promote responsible research practice in the field of psychological evaluation or measurement use in general, here we want to focus on a set of questionable research conduct in psychometricsFootnote 1 that we believe merit further attention: Fit-hacking, model-HARKing and emphasizing new -measurements or estimation- models. We believe that the focus on these questionable research conducts in psychometric studies is relevant because the more general validity of our research results depends on the validity of the interpretations we can make of our measurements (Flake et al., Reference Flake, Davidson, Wong and Pek2022; Lilienfeld & Strother, Reference Lilienfeld and Strother2020). In this sense, the focus on measurement is fundamental since it is at the base of scientific progress and of (valid) interpretation of research results (Clark & Watson, Reference Clark and Watson2019; Flake et al., Reference Flake, Pek and Hehman2017; Flake & Fried, Reference Flake and Fried2020). All this has consequences for the applied field: If the evidence behind the theories is not based on properly validated measurementsFootnote 2, those theories are not correctly translated into practical applications (Bosma & Granger, Reference Bosma and Granger2022; Lewis, Reference Lewis2021). In addition to focusing on questionable research conduct in psychometrics, the aim of this paper is to provide resources that enable psychometric researchers to protect themselves against questionable research conduct, with a focus on practices related to transparency and the open science framework. All in all, we believe that in the context of psychometric studies it is important to continue to identify questionable research conduct that is specific to this type of research, and that it is also necessary to adapt the recommendations on responsible research conduct in this area.
Summarizing, the purpose of this paper is twofold: (a) To identify questionable research conduct specifically linked to psychometric studies; and (b) to promote greater awareness and widespread application of responsible research conduct in psychometrics research. To this end, we will first develop the more general concepts of research conduct and associated variables. We will then focus specifically on the identification of questionable research conduct in psychometrics, differentiating between questionable conduct linked to practices and questionable conduct linked to reporting. Finally, we will address the topic of transparency practices and the use of the open science framework as inherent actions of responsible research conduct, focusing here also on their applicability and relevance to psychometrics.
Research Conduct in Psychometrics
Behavior in psychometrics research can be analyzed based on more general models of research conduct. Setneck (Reference Steneck2006) proposes one such model, pointing out that research conduct can be understood as a continuum: From the ideal conduct, Responsible Research Conduct (RRC), to the worst conduct, which is characterized by Practices of Fabrication, Falsification and Plagiarism (FFP). Questionable Research Conduct (QRC) falls in the middle point of this continuum. It is also important to differentiate research practices from reporting practices (Manapat et al., Reference Manapat, Anderson and Edwards2022; Munafò et al., Reference Munafò, Nosek, Bishop, Button, Chambers, Percie du Sert, Simonsohn, Wagenmakers, Ware and Ioannidis2017)Footnote 3. Research and report practices have a direct impact on the reliability of science, that is, on the replicability, robustness, and reproducibility of scientific findingsFootnote 4. While FFP describes unequivocal, easily documented actions deserving severe sanctions (Steneck, Reference Steneck2006), QRC tend to be more difficult to define (i.e. they are not unequivocal), they occur more frequently (Munafò et al., Reference Munafò, Nosek, Bishop, Button, Chambers, Percie du Sert, Simonsohn, Wagenmakers, Ware and Ioannidis2017) and are more difficult to identify as bad practices by the researchers and institutions involved (i.e., researchers and institutions disagree on whether these practices are actually harmful or engage in QRC without being aware of their deleterious effects). QRC is a general term referring to the misuse or non-optimal use of methodological-statistical procedures, from which non-robust (e.g., overfitting), invalid, and biased results are more likely obtained (Antonakis, Reference Antonakis2017; Munafò et al., Reference Munafò, Nosek, Bishop, Button, Chambers, Percie du Sert, Simonsohn, Wagenmakers, Ware and Ioannidis2017; Nelson et al., Reference Nelson, Simmons and Simonsohn2018; Waldman & Lilienfeld, Reference Waldman and Lilienfeld2016). Most of the literature about QRC also highlights the problem of flexibility or researcher degrees of freedom. In the context of QRC, the researcher has a “garden of forking paths” (Gelman & Loken, Reference Gelman and Loken2014) to make decisions about the method to be applied in each step, which can be exploited in favor of the researcher (intentionally or not) to achieve the desired results. So, QRC “often involve hidden research decisions” (Chin et al., Reference Chin, Pickett, Vazire and Holcombe2023), and these practices distort the accuracy of research reports when are not reported transparently. Moreover, “Such practices produce biases because undisclosed flexibility … allows researchers to selectively under- or over-fit models and exploit noise in a way that goes uncorrected…” (Chin et al., Reference Chin, Pickett, Vazire and Holcombe2023, p. 3). In these sense, QRC can impact on the reliability and validity of scientific research in a large and widespread manner (Chin et al., Reference Chin, Pickett, Vazire and Holcombe2023), and this impact can be even greater than that of the FFP (Munafò et al., Reference Munafò, Nosek, Bishop, Button, Chambers, Percie du Sert, Simonsohn, Wagenmakers, Ware and Ioannidis2017). This is why international literature has focused primarily on identifying questionable research conduct and promoting responsible research conduct that serves as a preventive vaccine to avoid the adverse effects of QRC (Chin et al., Reference Chin, Pickett, Vazire and Holcombe2023).
Examples of very widespread QRC, which have been the focus of study internationally, are the following:
-
1. Using multiple comparisons from, for example, exclusion of atypical cases, inclusion of covariates, incorporation of more cases to the sample, with the aim of finding statistically significant results (i.e., p-hacking; Nelson et al., Reference Nelson, Simmons and Simonsohn2018).
-
2. Generating hypotheses and theoretical explanations from the results obtained and presenting these explanations and theories as if they had been proposed prior to data collection. In other words, presenting exploratory research as confirmatory (i.e.: HARKing; Munafò et al., Reference Munafò, Nosek, Bishop, Button, Chambers, Percie du Sert, Simonsohn, Wagenmakers, Ware and Ioannidis2017).
-
3. Emphasizing new, and statistically significant, results (i.e., selectively reporting positive results), not mentioning the results that have not reached statistical significance (i.e., negative results) (Antonakis, Reference Antonakis2017).
In the following, we will focus on QRC specific to psychometric studies and offer recommendations for promoting RRC in this area. Based on the taxonomy previously presented, we will separate QRC linked to practices from those linked to reporting.
Questionable Research Conduct in Psychometrics (QRCΨmetrics Footnote 5 )
In psychometrics we also have at our disposal a “garden of forking paths” which can be exploited in favor to achieve the desired results. Let us look at some of these “tricks” in more detail.
QRC: Practice-Related Research (QRC-PΨmetrics)
We consider it important to highlight the following QRC-.P Ψmetrics, which we refer as follows:
Fit-hacking: Using different types of strategies in the model estimation-specification with the aim of finding an acceptable or optimal fitFootnote 6. In this context, the publication of over-adjusted models (i.e., overfitting) is a very common practice. One example is the specification, within confirmatory factor analyses, of measurement modelsFootnote 7 that incorporate error covariances or correlation between errors, as many as necessary to exceed the cut-off points for model fit (Flores-Kanter et al., Reference Flores-Kanter, Garrido, Moretti and Medrano2021). Another example is the application of unjustified-inappropriate complex models, such as bifactor confirmatory models, which facilitate obtaining acceptable and optimal fit indicators (Flores-Kanter et al., Reference Flores-Kanter, Toro and Alvarado2022; Haywood et al., Reference Haywood, Baughman, Mullan and Heslop2021; Reise et al., Reference Reise, Kim, Mansolf and Widaman2016). The Positive and Negative Affective Affective Schedule (PANAS) studies serve as a good example to visualise both examples of QRC- Ψmetrics. In the case of incorporating error covariances, most studies on the PANAS have implemented this strategy to reach the minimum cut-off points to consider the fit of the measurement model acceptable (e.g., CFI > .90) (Flores-Kanter et al., Reference Flores-Kanter, Toro and Alvarado2022). In the case of bifactor models, mathematically equivalent to specifying covariances between all pairs of errors (Matsunaga, Reference Matsunaga2008), researchers have found in it a model that hardly presents indicators of poor fit (Flores-Kanter et al., Reference Flores-Kanter, Dominguez-Lara, Trógolo and Medrano2018). Other good examples of the misuse of bifactor models can be found in the case of psychometric analysis of measures of depression (Heinrich et al., Reference Heinrich, Zagorscak, Eid and Knaevelsrud2018) and psychopathology (Bonifay et al., Reference Bonifay, Lane and Reise2016). These fit-hacking related practices undermine the external validity and reliability of the findings and are very similar to the behavior described as p-hacking in other disciplinary contexts.
Does the aforementioned mean that the practice of specifying covariances between pairs of uniqueness or bifactor models is, per se, bad practice? We state categorically that, per se, these models do not refer to bad practice. Indeed, there are concrete situations where the use of correlations between errors as well as the specification of traditional bifactor models can be justified and recommended (see, for example, Eid et al., Reference Eid, Geiser, Koch and Heene2017). Conversely, the downside of such practices lies in how these models are used, interpreted, and reported, not in the models themselves (Box, Reference Box, Launer and Wilkinson1979; McElreath, Reference McElreath2020). In the case at hand, what is questionable is the indiscriminate and unjustified use of covariances between pairs of errors or bifactor models with the sole objective of reaching the cut-off points typically established for the model fit indicators. Added to this is the tendency to generate a discourse, after the results are known, that persuades the reader that the model is theoretically valid and procedurally sound. We will return to this point later when discussing Model-HARKing.
Emphasizing new -measurements or estimation- models: There is a clear tendency to massively apply a new estimation method or measurement model without a clear justification for the choice and without a critical use of it. This has been seen in the case of bifactor models mentioned previously (Bonifay et al., Reference Bonifay, Lane and Reise2016; Flores-Kanter et al., Reference Flores-Kanter, Dominguez-Lara, Trógolo and Medrano2018), but a similar trend can also be verified in the case of psychometric network models (Burger et al., Reference Burger, Isvoranu, Lunansky, Haslbeck, Epskamp, Hoekstra, Fried, Borsboom and Blanken2022). The emphasis on publishing a new measurement model or a new estimation method at the expense of generating a critical view of the measure in question prevents the proper advancement of psychometrics (Flake & Fried, Reference Flake and Fried2020), consequently generating unnecessary noise in widely applied studies (Lewis, Reference Lewis2021). Previous studies have also warned about the tendency to create new scales without considering their overlap with existing scales, and without assessing their incremental value in relation to previous measures of the same or similar constructs (Rosenbusch et al., Reference Rosenbusch, Wanders and Pit2020). All of the above can be linked primarily to the research bias that Antonakis (Reference Antonakis2017) defined as neophilia desease (i.e., a tendency to show novel or spectacular results which are likely to be wrong); but it is also associated with theorrea disease, in the sense that many psychometric researchers do not engage critically with the theoretical aspects of the measures they assess, focusing almost exclusively on a single source of evidence of construct validity, usually the structural or external source (Flake & Fried, Reference Flake and Fried2020; Lilienfeld & Strother, Reference Lilienfeld and Strother2020). Here again, is a QRC-P Ψmetrics applying novel psychometric/measurement models or proposing a new measure? Of course not; what is questionable lies in a- the uncritical and unjustified application of models, selecting them not on the basis of critical and theoretically relevant reasoning, but on the basis of their novelty and associated publication advantage; and b- the generation of new measure in a superficial manner, i.e. without adequately considering their overlap with previous measures and their incremental value.
There are many other practices in psychometrics that can be included in the category we have called here QRC-P Ψmetrics. Here we have made an arbitrary selection of two sets of practices, fit-hacking and emphasizing new -measurements or estimation- models, which we consider to be widespread and to which we believe more attention should be given. However, we encourage the reader to delve deeper into the extensive literature on other forms of questionable practices in psychometrics. We mention below some examples of these QRC-P Ψmetrics that have been identified in previous contributionsFootnote 8:
-
1. Inferring the measurements that are derived from an instrument, and basing the choice of the measures, solely on the basis of the name given to the instrument (Lilienfeld & Strother, Reference Lilienfeld and Strother2020).
-
2. Misapplication of internal consistency indicators and the exclusive use of Cronbach’s alpha coefficient (Cho, Reference Cho2021).
-
3. The elimination of items in order to achieve acceptable internal consistency (Ulrich & Miller, Reference Ulrich and Miller2018).
-
4. Inappropriate use of factor estimation methods and procedures associated with exploratory factor analysis (Ferrando et al., Reference Ferrando, Lorenzo-Seva, Hernández-Dorado and Muñiz2022; Lloret-Segura et al., Reference Lloret-Segura, Ferreres-Traver, Hernández-Baeza and Tomás-Marco2014).
-
5. The debatable use of sum scores (Widaman & Revelle, Reference Widaman and Revelle2023) and item parcel approach (Matsunaga, Reference Matsunaga2008).
-
6. Using fit indexes arbitrarily and with different cutoffs to support or reject the fit of a model (McNeish & Wolf, Reference McNeish and Wolf2021).
-
7. Exclusive consideration of the structural or external phase as final evidence of construct validity (Flake et al., Reference Flake, Pek and Hehman2017; Lilienfeld & Strother, Reference Lilienfeld and Strother2020).
In sum, in psychometrics we have a great deal of methodological flexibility available that can be exploited to our advantage to achieve the desired result; but not only that, we also have ways to convince editors, reviewers and readers of the relevance of our (questionable) approach, the novelty, relevance and necessity of our (forced) findings (i.e., theoretical and practical implication). In the following, we will make references to these QRC linked to the report.
QRC: Report-Related Research (QRC-RΨmetrics)
QRC-RΨmetrics refer to non-transparent/non-accessible report of the steps and procedures applied in the investigation, as well as data and other research materials. Although this last case does not necessarily imply that QRCΨmetrics have been presented referring to the misuse or non-optimal use of methodological-statistical procedures, the lack of transparency, and the impediment of access to the steps, procedures and research materials, makes it difficult to evaluate/review the entirely research.
The second group of QRC-R Ψmetrics has recently been identified in psychological assessment in general. Authors such as Flake and Fried (Reference Flake and Fried2020) have indicated certain uses and behaviors as questionable practices in measurement and have especially highlighted the need to promote a more open and transparent reporting methodology in the area. Here, we are interested in emphasizing another particular QRC-R Ψmetrics, which we have named model-HARKing. Originally, the acronym HARKing is used to refer to the behavior of “hypothesizing after the results are known” (Munafò et al., Reference Munafò, Nosek, Bishop, Button, Chambers, Percie du Sert, Simonsohn, Wagenmakers, Ware and Ioannidis2017). In psychometrics it is possible to identify similar conducts, which are mainly evidenced in the way the report is presented. In there, the overlapping of exploratory and confirmatory objectives and/or analyses is common and widespread. The biggest problem is that, as happens in other fields of knowledge (Fife & Rodgers, Reference Fife and Rodgers2022), an approach that is entirely exploratory is presented as confirmatory. Thus, with the term model-HARKing we try to draw attention to those forms of reporting that, after knowing which model presents a better fit (generally achieved from the behaviors we have mentioned as fit-hacking), aim to assemble the whole document in coherence with this result; not making visible the fact that this model did not emerge from a confirmatory approach but rather an exploratory one (e.g., from the modification indicesFootnote 9 resulting from the specification of a given measurement model). This has led, for example, to innumerable factorial solutions being proposed for the same measure, all of which find a “reasonable” explanation within a given body of theory (see the examples presented in Flores-Kanter et al., Reference Flores-Kanter, Garrido, Moretti and Medrano2021, and Fried et al., Reference Fried, Flake and Robinaugh2022). The latter is also associated with theorrea desease (Antonakis, Reference Antonakis2017), in that psychometric researchers seem more concerned with showing a line of argumentation consistent with the best fitting model finding, rather than valuing an indicator of poor fit as an opportunity to critically reflect on the theoretical aspects of the measurements they assess, and as an opportunity to focusing on all sources of construct validity (i.e., substantive, structural and external), and not exclusively on the structural (in the majority of cases, on factor analysis) or external phase.
Let us now consider what vaccines are available to prevent the emergence of these diseases in psychometric research.
Responsible Research Conduct in Psychometrics (RRCΨmetrics)
As psychometric researchers we must do the best we can, trying to apply the best practices suggested and enabled at the time. However, what is recommended or conceived as good practice at one time may no longer be recommendable later; and no matter how well-intentioned we may be, we will always be susceptible to mistakes. Moreover, there will surely always be alternative ways of psychometrically modelling our problem (i.e., methodological flexibility; see Manapat et al., Reference Manapat, Anderson and Edwards2022). As we tried to express in the previous paragraphs, the criticism is not of statistical models per se, but of their unjustified and inappropriate use, and the way in which the procedure followed is detailed and the research report is written up. At this point, there are two components that we consider key and mutually related to the achievement of RRC Ψmetrics: achieving greater transparency and aligning with the principles of open science.
Be Transparent
Psychometric studies should be explicit, clear, and systematic in the procedures and steps followed throughout the research process. Transparency of information, thought in terms of cooperation, contributes to strengthening the idea of peer control in the scientific community, not only as a way to legitimize research results but also to produce scientific advances. Promoting transparent is an essential step, as it facilitates the detection of errors, makes it possible to make pertinent corrections and enables the reproducibility of scientific findings.
In the context of qualitative-applied research in psychology, considerations regarding explicitness about assumptions and justifications of methodological choices, and recommendations on transparency have been addressed, developed and refined for decades. Yet quantitative science interest on these factors is only recent (Lewis, Reference Lewis2021). Tuval-Mashiach (Reference Tuval-Mashiach2017) has proposed a model of transparency for qualitative research which summarizes these qualitative-applied science contributions. According to this model, authors should be able to clearly express, in general terms, the “what, how, and why” of their research procedures. We will now extend this model to psychometric research.
There are three basic questions that must be answered to achieve an adequate level of transparency in psychometric research reports. We speak of a report in a broad sense, including here not only the paper but all that annexed material (e.g., supplementary material in an external open access repository) that develops each of these questions with adequate detail:
“What I did”: The procedure, method or approach used must be named with correct language (i.e., using the statistical-methodological term that is widely used and supported by the scientific community of reference). This may seem trivial at first glance, but the ambiguous use of certain terms in psychometrics undermines the replicability, robustness, and reproducibility of psychometrics findings (see for example Cho, Reference Cho2021; on the denomination of reliability coefficients). Researchers should also follow international guidelines and standards on the reports of each specific design. Although the American Psychological Association has not yet presented a standard format for all types of psychometric studies, there is currently a guide for reporting studies that use Structural Equation Models (Appelbaum et al., Reference Appelbaum, Cooper, Kline, Mayo-Wilson, Nezu and Rao2018, p. 18, Table 7). This guide can be extended to, and serve as a model for, the reporting of other types of psychometric studies (e.g., exploratory factor analysis). Using the guide for reporting studies that use Structural Equation Models (Appelbaum et al., Reference Appelbaum, Cooper, Kline, Mayo-Wilson, Nezu and Rao2018) is strongly recommended as it will enable that the information and details necessary to understand the investigation (the “What I did”) should be present in the report.
“How I did it”: the necessary degree of transparency is achieved when an external or independent researcher can repeat -and clearly understand- the steps and procedures described in a report. The correct reporting of “How I did it” can be achieved by making good use of open science tools. These open science tools will be described in detail later. We will simply say here that using open science tools is a good way to answer the “How I did it”, given that integrates a wide variety of resources for open and reproducible science, giving the options to express the research workflow-procedure in a transparent, clear, and complete way.
“Why I did it”: the researcher must be able to explain why a given method was chosen and justify such choice by comparing alternative methods. This is extremely important in the field of psychometrics where many times the decision about the applied methods (e.g., factor estimation methods; rotation methods; coefficients considered) depends on the software usually used by, or available to, the researcher (Lloret-Segura et al., Reference Lloret-Segura, Ferreres-Traver, Hernández-Baeza and Tomás-Marco2014). Given the manifest tendency to use without reason certain procedures repeated in the literature simply because they appear cited by a respected authority in the field (McNeish & Wolf, Reference McNeish and Wolf2021), it is relevant to reflect upon the motivation to choose a given method. A clear practice of this is in the interpretation made (i.e., cut-off points considered) of commonly applied fit indicators such as the comparative fit index (CFI) and the root mean square error of approximation (RMSEA) (McNeish & Wolf, Reference McNeish and Wolf2021).
This model of transparency is not only applicable to the method procedures but must be transversal to the other facets of the research (or empirical cycle, see Tijdink et al., Reference Tijdink, Horbach, Nuijten and O’Neill2021), which corresponds to the sections of the report or paper (e.g., introduction and methods). For example, authors should report in a transparent manner the procedure used to research the background on the subject. In psychometric studies, this is fundamental to understand the state of the art of the proposed measurement models, as well as the applied psychometric procedures. In this sense, we recommend following the guidance offered in “Conducting a Meta-Analysis in the Age of Open Science” paper, particularly with regard to documenting the procedures of research and revision of antecedents in a transparent manner, and in an open science framework (Moreau & Gamble, Reference Moreau and Gamble2020).
Be Open Science
In the context of analyzing and responding to QRC, the international literature has called for greater use of so-called open science practices (OSPs; Munafò et al., Reference Munafò, Nosek, Bishop, Button, Chambers, Percie du Sert, Simonsohn, Wagenmakers, Ware and Ioannidis2017; Nelson et al., Reference Nelson, Simmons and Simonsohn2018). These include a diverse set of practices, including sharing of data and code, pre-registration and preprints, among others. As stated in the case of transparency practices, it is important to note that OSPs are not a guarantee, on their own, of validity and robustness in the reported procedure and findings. Instead, the value in OSPs lies in the fact that they facilitate scrutiny and evaluation of the entire research process, also making it easier to detect and correct honest errors in research (Chin et al., Reference Chin, Pickett, Vazire and Holcombe2023). External repositories for OSPs are a great tool for psychometric research and its use should be widely encouraged. There are highly valuable technological resources, among which we strongly recommend the Open Science Framework (OSF),Footnote 10 since it is free and open source. In addition, it integrates a wide variety of resources for open and reproducible science, giving all the options to express the research workflow in a transparent, clear and complete way. Among OSF resources, we suggest:
Pre-registration of projects/research plans: With this, the hypotheses and planning of analytical methods and procedures can be shared in advance. Objectives, hypotheses and procedures planned in advance (i.e., prior to the execution of the investigation) are clearly differentiated in a more transparent way from objectives, procedures and hypotheses derived from the course of investigation itself (i.e., in the execution of the investigation or after it). An example of this is the difference between confirmatory objectives or analyses and exploratory ones, which is closely linked to the model-HARKing behavior mentioned above. The pre-registration of projects/research plans is, therefore, a good antidote to fit-hacking and model-HARKing behaviors, given that psychometric researchers can use this resource to clearly delimit, prior to the actual execution of the research, fundamental aspects such as: the measurement and structural models to be estimated; the estimation method; the fit indicators to be considered and the cut-off points considered. This prior delimitation, and restriction-transparenting of the researcher degrees of freedom, denotes a clear baseline that allow then evaluate the parts of the report that correspond to a confirmatory approach from those that are exploratory in nature. Of course, if a psychometric researcher has published a pre-registration prior to conducting his research, it will be more difficult to publish a report in which the inherent methodological flexibility will be exploited in favor of the researcher to achieve the desired results (i.e., fit-hacking), and the report will be presented in a hidden research decisions manner (i.e., model-HARKing).
Open Database and Open Code: The platform allows uploading both the data (i.e., raw data and processing data) as well as the code-syntax that was used to carry out the analyses in the given softwareFootnote 11. Publishing the code-syntax is useful for promoting reproducibility, as it allows the same analytical steps to be followed, but also promotes quality control through increased opportunities to find bugs in the code (Laurinavichyute et al., Reference Laurinavichyute, Yadav and Vasishth2022). All of this is especially important in psychometrics, given the existence of so many analytical options and software availability. We also suggest taking into account the guidance for a correct presentation (that promotes reproducibility and replicability) of the software’s syntax and information (Buchanan et al., Reference Buchanan, Crain, Cunningham, Johnson, Stash, Papadatou-Pastou, Isager, Carlsson and Aczel2021; Epskamp, Reference Epskamp2019), and follow the TIER protocolFootnote 12. Also, the publication of the database should respect the conditions known by the acronym FAIR (Buchanan et al., Reference Buchanan, Crain, Cunningham, Johnson, Stash, Papadatou-Pastou, Isager, Carlsson and Aczel2021; Levenstein & Lyle, Reference Levenstein and Lyle2018). FAIR refers to conditions of findability (with adequate metadata), accessibility, interoperability (adaptation to systems) and reuse (open licenses). It is important to mention that the open data movement acknowledges that there are always restrictions that may be valid (such as personal data, for example), and it is important to declare the restrictions applied to open data (Meyer, Reference Meyer2018).
Preprint research report: Preprints are open access documents (i.e., research reports) published on a specific server, made available to receive comments from peers in a given discipline (Moshontz et al., Reference Moshontz, Binion, Walton, Brown and Syed2021). Preprints may refer to a preliminary version of a document, for example, being in a state prior to a peer review process; or it can be an accepted version of a paper to be published in a scientific journal. In the latter case, a version not edited by the journal in which the document has been accepted is uploaded, letting the reader know about the differences between this preprint version and the version formally published in the scientific journal. The psychology-specific preprint server hosted by the OSF is called PsyArxivFootnote 13. A preprint in PsyArxiv may be integrated as part of a larger project in OSF, meaning that it can have all associated data, protocols, and other study materials published along with it. The use of this resource in psychometrics, as in all disciplines, broadens the scope of access of many publications in scientific journals, reducing restrictions.
Conclusions
As scientists in the psychological field, we are witnessing a present time full of positive changes. Every day the movement that seeks to promote the credibility and replicability of psychological research upon the basis of transparency is becoming stronger (Mellor et al., Reference Mellor, Vazire and Lindsay2018). The evidence suggests that the majority of researchers agree with the principles of transparency and open science in research. However, it has also been shown that the concrete application of these principles and practices is not homogeneous in all scientific disciplines or in all subdisciplines of a given knowledge area. While we have presented relevant background information that certainly helps to promote responsible conduct in research in the field of psychological assessment in general, these analyses and recommendations have not yet focused specifically on the psychometric studies proper. This paper has begun to fill this gap.
Given all the above, we believe it is important to give some recommendations considering the different levels involved, which have an influence on questionable research practices (Tijdink et al., Reference Tijdink, Horbach, Nuijten and O’Neill2021). At the individual level, it is important for psychometric researchers to be aware of these questionable research practices and to be able to identify the biases associated with these trends (Antonakis, Reference Antonakis2017). At a more general level, it is essential that journals should begin, as a first step, to adapt their editorial processes to models of responsible conduct in research regarding transparency and open science. It is also important that research ethics committees, or centers in charge of the ethical evaluation of psychometric projects should incorporate, promote and adhere to these practices.
The present article only offers an initial approach to the problem of questionable research practices in psychometrics. It is necessary to carry out meta-scientific investigations and systematics reviews that help to investigate the frequency of QRC Ψmetrics, the factors associated with these, and the uses and factors that contribute to RRC Ψmetrics (Chin et al., Reference Chin, Pickett, Vazire and Holcombe2023). These future investigations should provide the context in which research work is carried out in order to verify variabilities depending on the countries involved. For example, we think that QRC Ψmetrics are especially widespread in our South-Central American region, and that the use of RRC Ψmetrics is not yet widespread in these countries. However, future studies should provide empirical data in this regard. Also tutorials showing the steps to generate a transparent report in psychometrics should be published (Luong & Flake, Reference Luong and Flake2022; PsyTeachR Team et al., 2022) to facilitate the use of external open access repositories (e.g., OSF), as well as to promote good practices associated with information access (e.g., recommendations for providing open access to databases; how to present the information and organize it in the external repository; how to organize an open access code). Lastly, reporting standards for psychometrics studies should be promoted. These should consider not only the points to be presented in the paper, but also the aspects that should be developed in supplementary materials. In addition, this guideline or set of guidelines should cover the entire spectrum of approaches in the psychometric field, from more exploratory or less restrictive approaches to confirmatory or more restrictive approaches (e.g., Exploratory Factor Analysis -EFA-, Ferrando et al., Reference Ferrando, Lorenzo-Seva, Hernández-Dorado and Muñiz2022; Exploratory Structural Equation Modeling -ESEM-, Marsh et al., Reference Marsh, Morin, Parker and Kaur2014; Confirmatory Factor Analysis -CFA-; see also Morin et al., Reference Morin, Myers, Lee, Tenenbaum and Eklund2020); as well as new approaches in psychometrics (e.g., Exploratory Graph Analysis -EGA-, Golino & Epskamp, Reference Golino and Epskamp2017) and other type of psychometrics approaches such as item response theory (IRT; Raykov et al., Reference Raykov, Dimitrov, Marcoulides and Harrison2017). Also, the specific objectives of the psychometric study (e.g., construction, adaptation, validation; Ferrando et al., Reference Ferrando, Lorenzo-Seva, Hernández-Dorado and Muñiz2022) and the type of test applied (e.g., experimental manipulations; Chester & Lasko, Reference Chester and Lasko2021) should be considered.
To summarize, we believe that the final objective pursued by the contributions about the replicability crisis in, particularly, the behavioral sciences should not be underestimated; the same is also true about the meta-scientific studies that identify questionable research conduct as well as insights developed to promote responsible conduct in research. The objective is “to construct reliable and valid knowledge about how the mind works, and how the mind influences our behavior and vice versa” (Lewis, Reference Lewis2021, p. 10). Bearing this general objective in mind, and following Lewis (Reference Lewis2021), it is necessary not only to promote good methodological practices and transparency in individual studies, but also to promote greater heterogeneity and integration between diverse methodologies, different populations, and research groups (see also, Wagenmakers et al., Reference Wagenmakers, Sarafoglou and Aczel2022). Only then shall behavioral sciences overcome this moment of crisis and achieve more valid and reliable knowledge. It is also important to note that while transparency and open science practices can enhance the evaluation of research validity and robustness, it is crucial to supplement them with critical appraisal that can distinguish between strong (i.e., robust-valid) and weak research practices (Chin et al., Reference Chin, Pickett, Vazire and Holcombe2023). As pointed out by Antonakis (Reference Antonakis2017), a useful science is one that, in addition to accounting for the rigor (i.e., robustness, accuracy, and reliability of the research) can respond to the following generic questions: (a) So what? Reports if the theoretical or empirical contribution adds up to cumulative research efforts; and (b) Will it make a difference? Refers the extent to which the finding can inform basic or applied research, so that we can better understand the phenomenon and/or inform policy or practice. This is in line with the conclusion of Rohrer et al. (Reference Rohrer, Hünermund, Arslan and Elson2022, p. 11), which we agree with and consider relevant for psychometric studies: “Our vision is one in which psychological research is inherently transparent and collaborative, collectively striving toward greater robustness and culmination of knowledge.”
Finally, we would like to emphasize that this manuscript has not been written with the aim of blaming anyone in particular but, on the contrary, in the spirit of constructive criticism in a field in which we, as psychometric researchers and authors of this proposal, are no strangers. Indeed, we recognize that we ourselves identify with many of the biases outlined above, we have conducted some of these QRCΨmetrics, and we have only recently begun to incorporate tools consistent with the OSPs. As Rohrer et al. (Reference Rohrer, Hünermund, Arslan and Elson2022, p. 10) say “This unfortunate situation can occur without any ill intention on the part of researchers, and we do not mean to imply that researchers who use these models are bad at their job or (even worse) do not care about the truthfulness of their claims—they are simply implementing practices that they have been taught and that often result in interesting sounding empirical claims.” We believe that the identification and recognition of these conducts is important and will help us to improve our daily work as psychometricians. All in all, we hope that this work will help generate greater awareness of QRCΨmetrics and adherence to RRCΨmetrics.