We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Machine learning has exhibited substantial success in the field of natural language processing (NLP). For example, large language models have empirically proven to be capable of producing text of high complexity and cohesion. However, at the same time, they are prone to inaccuracies and hallucinations. As these systems are increasingly integrated into real-world applications, ensuring their safety and reliability becomes a primary concern. There are safety critical contexts where such models must be robust to variability or attack and give guarantees over their output. Computer vision had pioneered the use of formal verification of neural networks for such scenarios and developed common verification standards and pipelines, leveraging precise formal reasoning about geometric properties of data manifolds. In contrast, NLP verification methods have only recently appeared in the literature. While presenting sophisticated algorithms in their own right, these papers have not yet crystallised into a common methodology. They are often light on the pragmatical issues of NLP verification, and the area remains fragmented. In this paper, we attempt to distil and evaluate general components of an NLP verification pipeline that emerges from the progress in the field to date. Our contributions are twofold. First, we propose a general methodology to analyse the effect of the embedding gap – a problem that refers to the discrepancy between verification of geometric subspaces, and the semantic meaning of sentences which the geometric subspaces are supposed to represent. We propose a number of practical NLP methods that can help to quantify the effects of the embedding gap. Second, we give a general method for training and verification of neural networks that leverages a more precise geometric estimation of semantic similarity of sentences in the embedding space and helps to overcome the effects of the embedding gap in practice.
The gift-exchange game is a form of sequential prisoner's dilemma, developed by Fehr et al. (1993), and popularized in a series of papers by Ernst Fehr and co-authors. While the European studies typically feature a high degree of gift exchange, the few U.S. studies provide some conflicting results. We find that the degree of gift exchange is surprisingly sensitive to an apparently innocuous change—whether or not a comprehensive payoff table is provided in the instructions. We also find significant and substantial time trends in responder behavior.
Coordination games with Pareto-ranked equilibria have attracted major attention over the past two decades. Two early path-breaking sets of experimental studies were widely interpreted as suggesting that coordination failure is a common phenomenon in the laboratory. We identify the major determinants that seem to affect the incidence, and/or emergence, of coordination failure in the lab and review critically the existing experimental studies on coordination games with Pareto-ranked equilibria since that early evidence emerged. We conclude that there are many ways to engineer coordination successes.
Random effects meta-analysis model is an important tool for integrating results from multiple independent studies. However, the standard model is based on the assumption of normal distributions for both random effects and within-study errors, making it susceptible to outlying studies. Although robust modeling using the t distribution is an appealing idea, the existing work, that explores the use of the t distribution only for random effects, involves complicated numerical integration and numerical optimization. In this article, a novel robust meta-analysis model using the t distribution is proposed (tMeta). The novelty is that the marginal distribution of the effect size in tMeta follows the t distribution, enabling that tMeta can simultaneously accommodate and detect outlying studies in a simple and adaptive manner. A simple and fast EM-type algorithm is developed for maximum likelihood estimation. Due to the mathematical tractability of the t distribution, tMeta frees from numerical integration and allows for efficient optimization. Experiments on real data demonstrate that tMeta is compared favorably with related competitors in situations involving mild outliers. Moreover, in the presence of gross outliers, while related competitors may fail, tMeta continues to perform consistently and robustly.
A dataset does not speak for itself, and model assumptions can drive results just as much as the data. Limited transparency about model assumptions creates a problem of asymmetric information between analyst and reader. This chapter shows how we need better methods for robust results.
Multiverse analysis is not simply a computational method but also a philosophy of science. In this chapter we explore its core tenets and historical foundations. We discuss the foundational principle of transparency in the history of science and argue that multiverse analysis brings social science back into alignment with this core founding ideal. We make connections between this framework and multiverse concepts developed in cosmology and quantum physics.
This chapter advocates a simple principle: Good analysis should be easier to publish than bad analysis. Multiverse methods promote transparency over asymmetric information and emphasize robustness, countering the fragility inherent in single-path analysis. In an era when the credibility of scientific results is often challenged, the use of multiverse analysis is crucial for bolstering both the credibility and persuasiveness of research findings.
Despite the versatility of generalized linear mixed models in handling complex experimental designs, they often suffer from misspecification and convergence problems. This makes inference on the values of coefficients problematic. In addition, the researcher’s choice of random and fixed effects directly affects statistical inference correctness. To address these challenges, we propose a robust extension of the “two-stage summary statistics” approach using sign-flipping transformations of the score statistic in the second stage. Our approach efficiently handles within-variance structure and heteroscedasticity, ensuring accurate regression coefficient testing for 2-level hierarchical data structures. The approach is illustrated by analyzing the reduction of health issues over time for newly adopted children. The model is characterized by a binomial response with unbalanced frequencies and several categorical and continuous predictors. The proposed approach efficiently deals with critical problems related to longitudinal nonlinear models, surpassing common statistical approaches such as generalized estimating equations and generalized linear mixed models.
A method for robust canonical discriminant analysis via two robust objective loss functions is discussed. These functions are useful to reduce the influence of outliers in the data. Majorization is used at several stages of the minimization procedure to obtain a monotonically convergent algorithm. An advantage of the proposed method is that it allows for optimal scaling of the variables. In a simulation study it is shown that under the presence of outliers the robust functions outperform the ordinary least squares function, both when the underlying structure is linear in the variables as when it is nonlinear. Furthermore, the method is illustrated with empirical data.
Tukey's scheme for finding separations in univariate data strings is described and tested. It is found that one can use the size of a data gap coupled with its ordinal position in the distribution to determine the likelihood of its having arisen by chance. It was also shown that this scheme is relatively robust for fatter-tailed-than-Gaussian distributions and has some interesting implications in multidimensional situations.
A test is proposed for the equality of the variances of k ≥ 2 correlated variables. Pitman's test for k = 2 reduces the null hypothesis to zero correlation between their sum and their difference. Its extension, eliminating nuisance parameters by a bootstrap procedure, is valid for any correlation structure between the k normally distributed variables. A Monte Carlo study for several combinations of sample sizes and number of variables is presented, comparing the level and power of the new method with previously published tests. Some nonnormal data are included, for which the empirical level tends to be slightly higher than the nominal one. The results show that our method is close in power to the asymptotic tests which are extremely sensitive to nonnormality, yet it is robust and much more powerful than other robust tests.
Item response theory (IT) models are now in common use for the analysis of dichotomous item responses. This paper examines the sampling theory foundations for statistical inference in these models. The discussion includes: some history on the “stochastic subject” versus the random sampling interpretations of the probability in IRT models; the relationship between three versions of maximum likelihood estimation for IRT models; estimating θ versus estimating θ-predictors; IRT models and loglinear models; the identifiability of IRT models; and the role of robustness and Bayesian statistics from the sampling theory perspective.
In the framework of a robustness study on maximum likelihood estimation with LISREL three types of problems are dealt with: nonconvergence, improper solutions, and choice of starting values. The purpose of the paper is to illustrate why and to what extent these problems are of importance for users of LISREL. The ways in which these issues may affect the design and conclusions of robustness research is also discussed.
Taxicab correspondence analysis is based on the taxicab singular value decomposition of a contingency table, and it shares some similar properties with correspondence analysis. It is more robust than the ordinary correspondence analysis, because it gives uniform weights to all the points. The visual map constructed by taxicab correspondence analysis has a larger sweep and clearer perspective than the map obtained by correspondence analysis. Two examples are provided.
This paper addresses methodological issues that concern the scaling model used in the international comparison of student attainment in the Programme for International Student Attainment (PISA), specifically with reference to whether PISA’s ranking of countries is confounded by model misfit and differential item functioning (DIF). To determine this, we reanalyzed the publicly accessible data on reading skills from the 2006 PISA survey. We also examined whether the ranking of countries is robust in relation to the errors of the scaling model. This was done by studying invariance across subscales, and by comparing ranks based on the scaling model and ranks based on models where some of the flaws of PISA’s scaling model are taken into account. Our analyses provide strong evidence of misfit of the PISA scaling model and very strong evidence of DIF. These findings do not support the claims that the country rankings reported by PISA are robust.
We study the robustness of Krupka and Weber's method (2013) for eliciting social norms. In two online experiments with more than 1200 participants on Amazon Mechanical Turk, we find that participants’ response patterns are invariant to differences in the salience of the monetarily incentivized coordination aspect. We further demonstrate that asking participants for their personal first- and second-order beliefs without monetary incentives results in qualitatively identical responses in the case that beliefs and social norms are well aligned. Overall, Krupka and Weber's method produces remarkably robust response patterns.
Corrections for restriction in range due to explicit selection assume the linearity of regression and homoscedastic array variances. This paper develops a theoretical framework in which the effects of some common forms of violation of these assumptions on the estimation of the unrestricted correlation can be investigated. Simple expressions are derived for both the restricted and corrected correlations in terms of the target (unrestricted) correlation in these situations.
This paper discusses the issue of differential item functioning (DIF) in international surveys. DIF is likely to occur in international surveys. What is needed is a statistical approach that takes DIF into account, while at the same time allowing for meaningful comparisons between countries. Some existing approaches are discussed and an alternative is provided. The core of this alternative approach is to define the construct as a large set of items, and to report in terms of summary statistics. Since the data are incomplete, measurement models are used to complete the incomplete data. For that purpose, different models can be used across countries. The method is illustrated with PISA’s reading literacy data. The results indicate that this approach fits the data better than the current PISA methodology; however, the league tables are nearly identical. The implications for monitoring changes over time are discussed.
The importance of appropriate test selection for a given research endeavor cannot be overemphasized. Using samples drawn from eleven populations (differing in shape, peakedness, and density in the tails), this study investigates the small sample empirical powers of nine k-sample tests against ordered location alternatives under completely randomized designs. The results then are intended to aid the researcher in the selection of a particular procedure appropriate for a given endeavor. To highlight this an industrial psychology application involving work productivity is presented.
When some of observed variates do not conform to the model under consideration, they will have a serious effect on the results of statistical analysis. In factor analysis the model with inconsistent variates may result in improper solutions. In this article a useful method for identifying a variate as inconsistent is proposed in factor analysis. The procedure is based on the likelihood principle. Several statistical properties such as the effect of misspecified hypotheses, the problem of multiple comparisons, and robustness to violation of distributional assumptions are investigated. The procedure is illustrated by some examples.