Research in judgment and decision-making often observes clear deviations from the predictions of normative models of choice under risk and uncertainty like expected utility theory. This has led to the development of many so-called descriptive models, meant to describe how people actually make choices in risky and uncertain situations (e.g., Kahneman & Tversky, Reference Kahneman and Tversky1979; Tversky & Kahneman, Reference Tversky and Kahneman1992; Savage, Reference Savage1954; Bell, Reference Bell1982). As these models proliferate, determining their relevance in various decision contexts emerges as a pivotal challenge, underscoring the need for systematic model evaluations and comparisons.
One effective way to perform such systematic evaluations is to compare models based on their predictive accuracy in large sets of human choice problems, preferably answered by large samples of participants. The methodology of comparing models based on prediction accuracy on common data draws from a large literature in computer and data science, facilitates comparison between models with different numbers of parameters, and increases the chances that diverse models will be developed (Plonsky & Erev, Reference Plonsky and Erev2021a).
In a recent impressive study, He, Analytis, and Bhatia (Reference He, Analytis and Bhatia2022), hereafter HAB, performed a large-scale comparison of dozens of models of decision-making under risk. They grouped 19 datasets from multiple published studies with more than 1800 choice tasks, each a one-shot decision between two fully described gambles with up to two possible outcomes. They analyzed both datasets with tasks involving only potential gains (hereafter the gain domain) and datasets with tasks involving both gains and losses (hereafter the mixed domain). This paradigm of choice between gambles has been a prevalent research tool in behavioral economics since its inception, enabling researchers to gain valuable insights into human preferences and attitudes in a wide range of decision-making contexts (e.g., Allais, Reference Allais1953; Erev et al., Reference Erev, Ert, Plonsky, Cohen and Cohen2017, Ert & Erev, Reference Ert and Erev2013; Kahneman & Tversky, Reference Kahneman and Tversky1979; Stewart et al., Reference Stewart, Reimers and Harris2015). HAB compared 58 published models of risky choice to examine which of these offers the best predictions of individual decision makers’ choices for this large data. The results revealed that a variant of cumulative prospect theory (CPT; Prelec, Reference Prelec1998) was the best predictive model in both the gain domain and the mixed domain.Footnote 1
However, in two other recent large-scale model comparison studies, Choice Prediction Competitions (CPC) 2015 (Erev et al., Reference Erev, Ert, Plonsky, Cohen and Cohen2017) and 2018 (Plonsky et al., Reference Plonsky, Apel, Ert, Tennenholtz, Bourgin, Peterson and Erev2024), CPT did not fare as well. In these competitions, any model, including CPT, could be independently submitted and evaluated for its prediction accuracy. The results of the competition showed that the model BEAST (Best Estimate and Sampling Tools; Erev et al., Reference Erev, Ert, Plonsky, Cohen and Cohen2017), which is very different from mainstream models like CPT, emerged as the one with the most accurate predictions.Footnote 2 It is thus of interest to investigate the reasons behind these differing results.
1.1. Overlooking nonanalytic models
Aside from the different winning models, several differences exist between the study conducted by HAB and the two CPCs. The most significant difference is HAB’s decision to exclude a certain category of models from their large-scale analysis. Specifically, they chose to exclude models that could not be fitted easily using analytical likelihood functions (hereafter nonanalytic models), like those that require running simulations to make predictions. This choice, incidentally, implied the exclusion of the model BEAST from the analysis. The decision to exclude nonanalytic models reflects a general practice in the field that tends to focus on models that are amenable to estimation and those with easily identifiable parameters. The focus on such models diminishes modeling effort, allows building directly on previous classical models (like expected utility and prospect theory), and is likely more easily justifiable to reviewers and readers (Plonsky & Erev, Reference Plonsky and Erev2021a). However, there is no apriori reason to assume that a theoretical “ideal model” of decision-making must necessarily fall within the space of models that are easily estimable. Ignoring nonanalytic models may hinder progress and suppress our understanding of human decision-making (Bugbee & Gonzalez, Reference Bugbee and Gonzalez2022). This potential problem may be particularly concerning as models that utilize simulations to generate predictions—and that are therefore not easily estimable using traditional fitting practices—have a strong track record of providing highly useful predictions of behavior (Erev et al., Reference Erev, Ert and Roth2010; Erev et al., Reference Erev, Ert, Plonsky, Cohen and Cohen2017; Plonsky et al., Reference Plonsky, Apel, Ert, Tennenholtz, Bourgin, Peterson and Erev2024).
Furthermore, nonanalytic models are often implemented in ways that are not amenable to easy estimation because they assume psychological processes that are hard (or impossible) to implement analytically. Disregarding the potential of nonanalytic models thus may lead to wrong conclusions about the underlying psychological processes that are important for choice prediction. For example, (He et al., Reference He, Analytis and Bhatia2022) concluded that subjective nonlinear payoff and probability transformations are essential mechanisms for choice prediction accuracy. However, BEAST does not assume either of these mechanisms. Rather, it uses simulations to derive the predictions of a process of mental sampling of potential outcomes which is sensitive to anticipated regret. Hence, evaluating BEAST can help shed light on the usefulness of assuming very different psychological processes than mainstream models assume.
In this paper, we seek to reconcile the inconsistent results between the two CPCs, where BEAST emerged as the superior predictive model, and HAB’s comparison, where CPT was identified as the leading model. To do so, we apply identical methods to those used by (He et al., Reference He, Analytis and Bhatia2022) and examine the predictive power of BEAST, which was excluded from their analysis, on their data. Notably, HAB’s data includes only one-shot binary decisions under risk with up to two outcomes, whereas BEAST was originally developed to capture choice in a much wider class of tasks (including decisions under ambiguity and decisions under risk with repeated feedback). The wide applicability of BEAST has led its developers, concerned with overfitting issues, to introduce several arbitrary implementation assumptions that restrict the model in ways that are not necessarily implied by the underlying theory, but save free parameters. In the simple domain of choice under risk with up to two outcomes, BEAST requires less free parameters. This allowed us to also develop and test a version of BEAST that relaxes some of the original restrictive implementation assumptions. This version, which we call AdaBEAST, maintains the underlying psychological rationale of BEAST but allows for increased adaptability to different study contexts. Our study thus investigates how two nonanalytic models, BEAST and AdaBEAST, fare in comparison to dozens of analytic choice models in one of the most basic decision tasks, and explores what can be learned from this comparison.
2. The structure of BEAST and its distinction from classical models
Before presenting the analysis, it is useful to clarify the main theoretical underpinnings of BEAST, its main mechanistic structure, and the main differences from mainstream choice under risk models like CPT. (Implementation details are left for the Methods section and the Supplementary Material.)
BEAST belongs to a class of models that rely on the conjecture that judgment and decision-making reflect cognitive strategies or tools that have been effective in past experiences perceived as similar to the current situation (e.g., Plonsky et al., Reference Plonsky, Teodorescu and Erev2015). A key assumption, which builds on Skinner’s notion of “contingencies of reinforcement” (Skinner, Reference Skinner1953), is that people act as “intuitive classifiers” (Erev & Marx, Reference Erev and Marx2023). They use environmental cues to intuitively determine the class that the current situation belongs to, and then rely on strategies that previously worked well in this class of situations. This subjective imperfect classification process is often effective, but can also lead to behavioral biases: When people misclassify a situation to a class that only appears similar, they rely on strategies that can be counter-productive in the current context (Erev et al., Reference Erev, Ert, Plonsky, Cohen and Cohen2017).
This cognitive process that relies on potentially intricate subjective similarity relationships can be highly complex, and extant models (including BEAST) often do not represent it explicitly. Rather, models that rely on this conjecture aim to approximate the main implications of the complex process. Past research shows that this can be done by surprisingly simple models (Erev & Roth, Reference Erev and Roth2014; Erev et al., Reference Erev, Ert, Plonsky and Roth2023) that assume people behave as if they take small mental samples of potential outcomes of the possible actions and tend to choose the action with the best average outcome in the sample (cf., e.g., Juslin et al., Reference Juslin, Winman and Hansson2007; Vul et al., Reference Vul, Goodman, Griffiths and Tenenbaum2014; Zhu et al., Reference Zhu, Sanborn and Chater2020). Many of the previous models in this class were developed to capture repeated choice behavior. Hence, it was natural to assume that the sampled outcomes are drawn from the payoffs observed in previous experiences with the same task. BEAST, in contrast, aims to (also) capture initial (pre-feedback) choice behavior. The sampled outcomes are thus assumed to reflect the results of cognitive strategies that worked well in situations outside the current experimental context. These strategies are implemented as potentially biased “sampling tools.”
To evaluate a choice option, BEAST agents mentally sample possible payoffs from either the objectively described payoff distribution—using the so-called unbiased sampling tool—and/or from transformed (i.e., biased) versions of that distribution. The use of the three biased sampling tools in BEAST— contingent pessimism, uniform, and sign —aims to capture reliance on past experiences where the information provided was not accurate and objective but biased or irrelevant. Specifically, the contingent pessimism tool implies sampling (with certainty) the worst payoff described. Its use may reflect a subjective perception of a task as similar to situations where an adversary could influence the agent’s realized payoff. A tendency to perceive tasks pessimistically can help explain behavioral phenomena like loss aversion (Samuelson, Reference Samuelson1963), the Allais paradox (Allais, Reference Allais1953), and the St. Petersburg paradox (Bernoulli, Reference Bernoulli1954). The uniform sampling tool implies an equiprobable sampling of one of the described payoffs. Its use may reflect subjectively classifying a task as similar to situations where ignoring probability information (which, unlike payoff information, is often unverifiable even after the fact) was beneficial. A tendency to rely only on payoff information in this manner can help explain phenomena like overweighting of rare events (Friedman & Savage, Reference Friedman and Savage1948) and the Allais paradox. Finally, the sign sampling tool involves unbiased sampling of payoffs after they have been transformed using the sign function. Its use may reflect a subjective perception of a task as similar to contexts where the main goal was to avoid losses (e.g., the initial rounds of survival tournaments where the worst performers are eliminated). A tendency to focus on the payoff sign can help explain phenomena like the reflection effect (Markowitz, Reference Markowitz1952).Footnote 3
The mental sampling mechanism in BEAST implies a very different process than mainstream models of choice under risk, like CPT, and has very different implications. Mainstream models, like CPT, capture deviations from expected value (EV) maximization like those mentioned above by assuming they reflect subjective transformations of each possible payoff. These are then weighted together by subjective transformations of their respective probabilities, thus creating a subjective utility of a given choice option. Most commonly, these transformations are nonlinear but monotonic, such that higher and more probable payoffs necessarily contribute more to the subjective utility. In contrast, in BEAST, due to the inherent stochasticity in the mental sampling process, it is plausible that some payoffs may be overweighted, underweighted, or even neglected entirely. Furthermore, depending on the sampling tool that is applied, higher or more probable payoffs might contribute similarly or less to the decision than other payoffs.
Moreover, in most mainstream choice models, including CPT, the evaluation of each choice option, and its subjective utility, is formed independently of the alternative options. In contrast, the use of sampling tools in BEAST is a function of the choice task, not of particular choice options. For example, if a task is perceived as (also) similar to adversarial situations, then the contingent pessimism tool will be used on all options. Furthermore, the outcomes are sampled from the choice options in a correlated manner and are often directly influenced by the properties of the alternative choice options. This correlated sampling process implies that BEAST in essence includes a mechanism of anticipated regret from choosing one option over the other: The attractiveness of one choice option is a function of the expected attractiveness of the other choice option. This type of context dependence was recently suggested to be crucial for useful choice prediction (Peterson et al., Reference Peterson, Bourgin, Agrawal, Reichman and Griffiths2021; Plonsky & Erev, Reference Plonsky and Erev2021b), but is absent in many classical models.
3. Method
The current study includes four sets of analyses that aim to replicate and extend the analyses conducted by (He et al., Reference He, Analytis and Bhatia2022). First, we replicate HAB’s use of an ensemble of choice datasets to examine how well dozens of models of decisions under risk predict the decisions made by familiar individuals in new (out-of-sample) choice tasks. The term “familiar” is used here to highlight that the models were evaluated based on choice data from the same individual decision makers on whom the models were also estimated (but using different choice tasks). Importantly, we now add to this examination two nonanalytic models, BEAST and its modified version, AdaBEAST. Second, as HAB, we analyze the psychological mechanisms that are embedded in the most successful models, aiming to understand better the underlying choice processes. Third, we investigate the predictive power of the different models in each of the different datasets that are part of the ensemble used in the main analysis. The aim here is to identify whether specific features of the dataset are associated with greater success of specific models. Finally, we extend the analysis by evaluating the predictive power of the models on “unknown” (rather than “familiar”) out-of-sample individuals. Specifically, we assess the models’ ability to predict the choices—in new choice tasks as before—of decision makers whose data was not included in the model estimation phase. Under this approach, neither the participants nor the tasks that models are required to predict are seen by the models during the model fit, testing the models’ ability to generalize to new decision makers and new decision contexts.
3.1. Models
Details for all 58 “analytic” models in the model comparison and their estimation procedures can be retrieved from (He et al., Reference He, Analytis and Bhatia2022). As mentioned, we add to the comparison two nonanalytic models: BEAST and AdaBEAST.
3.1.1. BEAST
The following describes the main implementation assumptions of BEAST that are relevant to the current investigation. More complete details can be found in the supplementary material (SM) as well as in Erev et al., Reference Erev, Ert, Plonsky, Cohen and Cohen2017.
In binary decision under risk problems, BEAST implies option A is preferred over option B if:
where $E{V}_A-E{V}_B$ is the advantage of option A over option B based on their EVs, $S{T}_A-S{T}_B$ is the advantage of option A over option B based on mental sampling using sampling tools, and e is a normally distributed error term with a mean 0 and standard deviation ${\unicode{x3c3}}_{\mathrm{i}}$ , where $i$ represents an individual. If one option stochastically dominates the other, it is assumed $e = 0$ .
ST is the average of ${\kappa}_i$ outcomes that are each mentally sampled using one of four possible sampling tools. In each of the ${\kappa}_i$ independent sampling instances, two outcomes, one from each option, are sampled using the same sampling tool, and under the assumption that the payoff distributions from which sampling takes place are positively correlated (a “luck level” procedure – see SM for details). This implies high sensitivity to the anticipated regret, defined as the difference between the outcome sampled in one option and the one sampled from its alternative.
Sampling tool unbiased implies unbiased draw from the options’ described distributions. The remaining three sampling tools imply biased sampling. The sampling tool uniform ignores the described probabilities and samples as if all potential outcomes are equally likely (Thorngate, Reference Thorngate1980). Sampling tool contingent pessimism yields the worst possible outcome (Edwards, Reference Edwards1954) under some lexicographic conditions (Brandstätter, Gigerenzer & Hertwig, Reference Brandstätter, Gigerenzer and Hertwig2006) that depend on some value ${\gamma}_i$ and on the ratio between the worst outcomes of the two options (see SM for details). The sampling tool sign is similar to the unbiased tool, but samples only the sign of the outcome, ignoring magnitudes (Payne, Reference Payne2005). BEAST assumes that the probability to use each of the three biased sampling tools is the same and that the probability of using the unbiased tool is $1-\frac{\beta_i}{\beta_i+1}$ .
Finally, an individual decision maker i’s parameters are assumed to be drawn from uniform distributions as follows: ${\sigma}_i\sim U\left[0,\sigma \right]$ , ${\kappa}_{\mathrm{i}}\sim \mathrm{U}\left(1,2,\dots, \kappa \right)$ , ${\gamma}_i\sim U\left[0,\gamma \right],{\beta}_i\sim U\left[0,\beta \right]$ with $\sigma, \kappa, \gamma, \beta$ free parameters to be estimated.
3.1.2. AdaBEAST
The original BEAST was designed to capture behavior under diverse conditions, including decisions under risk with multiple outcome gambles, decisions under ambiguity, and decisions from experience. To deal with this complexity and avoid overfitting, its developers made several rather arbitrary implementation assumptions that heavily restrict the model but save free parameters. Since here we focus on one-shot decisions under risk with up to two outcomes without feedback (the setting investigated by (He et al., Reference He, Analytis and Bhatia2022)), we developed a modified, more adaptable version of the model, AdaBEAST, which relaxes some arbitrary restrictions yet preserves the main logic and mechanisms underlying BEAST. The following presents details on the changes from BEAST.
One highly restrictive (and theoretically arbitrary) assumption embedded in BEAST is that each of the biased sampling tools is used with the same probability. Clearly, however, different choice contexts may lead decision makers to perceive the choice tasks as more similar to some situations than to others, which implies different likelihoods for using different sampling tools. To capture possible idiosyncratic contextual effects of different datasets, AdaBEAST relaxes this restrictive assumption. In AdaBEAST, the probability of using each of the biased sampling tools is a free parameter. Specifically, ${W}_{uf},{W}_s$ and ${W}_{cp}$ represent the probability of using the uniform, sign, and contingent pessimism tools respectively. The probability to use the unbiased sampling tool is then simply ${W}_{ub} = $ 1 − $\left({W}_{uf}+{W}_s+{W}_{cp}\right)$ .
Another restrictive assumption originally implemented in BEAST is that the difference in averages of the mental samples taken ( $S{T}_A-S{T}_B)$ is equally weighted with the difference between the options’ EVs $\left(E{V}_A-E{V}_B\right)$ . Sensitivity to the difference between EVs was originally introduced to BEAST based on empirical evidence that choice is highly correlated with EV maximization. Yet, it is not clear how it fits within a framework of intuitive subjective classification that BEAST derives from. Further, even if good models of choice under risk should reflect sensitivity to EV difference, the choice to weight it equally to the output of the mental sampling process is highly restricting and rather arbitrary (but saves a free parameter). To preserve the model’s possibility to account for high rates of EV maximization, remain within the general framework of mental sampling as a reflection of intuitive subjective classification, and increase the model’s adaptability, we chose to make two changes when developing AdaBEAST.
First, the size of the mental sample taken from each option, ${\kappa}_{\mathrm{i}}$ , is assumed to be drawn from a Geometric distribution with parameter p (free parameter). That is, $\mathit{\Pr}\left({\unicode{x3ba}}_i = k\right) = {\left(1-p\right)}^{k-1}p$ . This change is based on the observation that most decision makers behave as if they rely on small samples (e.g., Plonsky et al., Reference Plonsky, Teodorescu and Erev2015), but allows for some to behave as if they rely on large ones (Erev et al., Reference Erev, Ert, Plonsky and Roth2023). Second, we completely removed the fixed dependence on EV difference. That is AdaBEAST implies option A is preferred over option B if $\left[S{T}_A-S{T}_B\right]+e>0$ . Note that because the weight of the unbiased sampling tool in AdaBEAST ( ${W}_{ub}$ ) can now range from 0 to 1 and ${\kappa}_i$ can be large (which is more likely when p is small), it is possible for AdaBEAST to rely on an approximation of the EV: a large unbiased sample from the outcome distribution. Hence, AdaBEAST can still capture the behavior that appears as sensitivity to the differences between EVs without having to assume such sensitivity explicitly.
The differences between BEAST and AdaBEAST are at the level of implementation assumptions, not at the level of the underlying mechanisms and logic. AdaBEAST preserves the idea of subjective intuitive classification explained above, as well as the idea that in choice under risk, the implications of this intuitive classification process can be summarized by the use of four sampling tools. Yet, AdaBEAST arguably implements the process in a more natural and cognitively plausible manner as it allows the environment to influence the likelihood for each classification and avoids explicit computations of the EVs. Python code for AdaBEAST can be found in the SM (see https://osf.io/ca6bn/).
3.2. Data
The original data analyzed by (He et al., Reference He, Analytis and Bhatia2022) included 19 distinct datasets. However, upon inspection, we found that four of these were not usable for this analysis. Three of the datasets, all from the same experiment (Pachur et al., Reference Pachur, Schulte-Mecklenbeck, Murphy and Hertwig2018), had discrepancies between the raw data used by HAB and the actual choice rates as reported in the original article by Pachur et al. (Reference Pachur, Schulte-Mecklenbeck, Murphy and Hertwig2018; Table A3). The source of these discrepancies lies in inconsistencies between task IDs in the raw data and the original task IDs, which, unfortunately, led to distorted computed model performances in HAB’s analysis. In a fourth dataset (from Stewart et al., Reference Stewart, Reimers and Harris2015), participants faced a substantial proportion of the choice tasks twice in the same session, which implied those tasks appeared in both the training and test samples, leading to data leakage. We thus excluded these four datasets from our analysis, leaving 15 datasets that include a total of 1565 choice tasks published in: Erev et al. (Reference Erev, Ert, Plonsky, Cohen and Cohen2017), Fiedler & Glöckner (Reference Fiedler and Glöckner2012), Pachur et al. (Reference Pachur, Mata and Hertwig2017; Reference Pachur, Schulte-Mecklenbeck, Murphy and Hertwig2018), Rieskamp (Reference Rieskamp2008), and Stewart et al. (Reference Stewart, Reimers and Harris2015; Reference Stewart, Hermens and Matthews2016).Footnote 4
Each dataset includes choice data from a different experimental context. Participants (sample sizes range from 30 to 208) in each context made multiple one-shot binary choices between lotteries with up to two possible outcomes each. The lotteries’ payoff distributions were fully and accurately described and participants did not receive any feedback on their choices. Number of tasks per dataset (and participant) ranged from 50 to 150. Figure S1 in the SM shows an example of a choice task from one experiment.
3.3. Estimation and cross-validation
Fitting the models to the new data requires the estimation of the parameters $\sigma, \kappa, \gamma, \beta$ for BEAST and the parameters $p,\sigma, {W}_{uf},{W}_s,{W}_{cp}$ for AdaBEAST. Because the models are simulation-based and do not have a differentiable likelihood function, we performed a grid search to find the best set of parameters. Specifically, in each dataset, we first generated the models’ predictions for each choice task and each profile of parameters implied by the grid. We then estimated the models using a cross-validation technique similar to that used for the other 58 models in the original study by (He et al., Reference He, Analytis and Bhatia2022). Each participant’s choice data was split into the same exact 10 folds of choice tasks as in HAB. In each cross-validation iteration, we chose the profile of parameters that best fits 9 of these folds (training data, representing 90% of the choice tasks in each dataset), based on the maximum likelihood criterion (Cousineau & Allen, Reference Cousineau and Allen2015), and then elicited the fitted models’ predictions for the held-out fold (test data, 10% of the choice tasks in each dataset). This process was repeated 10 times for each participant, with each of the 10 folds serving as the held-out fold once, which implies each observation is predicted once out of the sample. The SM includes further details on the grid search fitting procedure.
3.4. Analysis
3.4.1. Prediction error
Similar to (He et al., Reference He, Analytis and Bhatia2022), we focus on the prediction of the choices of individual decision makers in each task. In the main analysis, for each individual i, each model m is fitted to the training choice data of that individual and generates a prediction ${\widehat{y}}_{i,m,t}$ for each out-of-sample task t. We then compute, for each individual, the mean squared error (MSE) of the individually fitted model across all tasks: ${MSE}_{m,i} = \frac{1}{N}{\Sigma}_{t = 1}^N{\left({\widehat{y}}_{i,m,t}-{y}_{i,t}\right)}^2$ where ${y}_{i,t}$ represents the observed choice of individual i in task t and N is the number of tasks the individual faced. Finally, we compare models based on their average MSE across all individuals (i.e., giving each individual equal weight regardless of the number of tasks he or she faced).
To extend the previous analysis, we performed an additional comparison aimed at assessing the prediction error models make for an unknown, out-of-sample individual ${i}_{new}$ . Here, models are not trained on any of the choice data produced by this individual. To create a model m’s prediction for ${i}_{new}$ in task t, we average the predictions the model makes for all other participants in the dataset who faced the same task, excluding the target individual ${i}_{new}$ . That is: ${\widehat{y}}_{i_{new},m,t} = \frac{1}{n-1}\sum \nolimits_{i\ne {i}_{new}}{\widehat{y}}_{i,m,t}$ , with n the number of participants in the dataset.Footnote 5 The MSE for each out-of-sample individual is then calculated as before, and we report the average of these MSEs.
To check for statistical significance between the prediction errors of any two behavioral models, we implemented (using packages lme4, Bates et al., Reference Bates, Mächler, Bolker and Walker2014, and lmerTest, Kuznetsova et al., Reference Kuznetsova, Brockhoff and Christensen2017, in R) a linear mixed-effects statistical model with a fixed effect for the behavioral model and random intercepts for participants and for cross-validation fold of a dataset. We use the Satterthwaite approximation (Satterthwaite, Reference Satterthwaite1946) to compute degrees of freedom.
3.4.2. Psychological mechanism classification
As part of the large-scale comparison of risky choice models, (He et al., Reference He, Analytis and Bhatia2022) classified the evaluated models as having or not each of nine different psychological mechanisms: payoff transformation, probability transformation, attention, sampling, regret, disappointment, ranking, threshold, and dispersion. We use HAB’s classification for all models that they evaluated. BEAST and AdaBEAST we classify as involving both sampling and regret but none of the other mechanisms. The inclusion of sampling and regret follows from the description of the models provided above. Concerning the exclusion of other mechanisms, it may be argued that since the uniform sampling tool assumes outcomes are sampled uniformly, the models include a mechanism of probability transformation (e.g., to 0.5 in 2-outcome gambles). Indeed, this tool allows BEAST to capture the behavior that appears as if small probabilities are overweighted nonlinearly. Yet, the essence of a nonlinear probability transformation mechanism, as understood in almost every case, is the consistent nonlinear treatment of described probabilities. In contrast, in BEAST and AdaBEAST, the uniform sampling tool, which may not even be used in the decision process, never even considers the objective probabilities, let alone transforms them (indeed, it operates identically even when probabilities are unknown to the agents). Hence, in our analysis, we do not consider these models as involving a nonlinear probability transformation mechanism.
4. Results
4.1. Replicating He, Analytis, & Bhatia (Reference He, Analytis and Bhatia2022)
To assess the effectiveness of BEAST, we first repeated the primary analysis in (He et al., Reference He, Analytis and Bhatia2022) by comparing the predictive performance of the models on datasets containing only choice tasks in the gain domain separately from datasets containing tasks in the mixed domain (Figure 1). Our results indicate that relative to all other behavioral models, the original BEAST model (thin arrow) demonstrated decent predictive performance in the mixed domain (Figure 1a) but poor predictive performance in the gain domain (Figure 1b).
Analysis of the distribution of fitted parameters (see Figure S2 in SM) suggests that in the gain domain, the best fit of BEAST often reflects a maximal attempt to account for deviations from maximization (β ≈ 0), under the original constraint that the difference between EVs receives considerable weight (50% weight in the original BEAST). AdaBEAST relaxes this extreme constraint (as detailed in the methods section) and improves the prediction accuracy. Linear mixed-effects modeling (see Table S3 in the SM) showed that the difference in MSEs between AdaBEAST and BEAST is significant in both the gain domain, $\unicode{x3b2} = -0.0892,{t}_{\left(\mathrm{5,536}\right)} = -37.6,p<.001$ , and the mixed domain $\unicode{x3b2} = -0.0084,{t}_{\left(\mathrm{6,841}\right)} = -5.478,p<.001$ . This improvement of AdaBEAST can be attributed to the relaxation of the stringent constraints in the original BEAST, providing it with more context adaptability compared to the original BEAST. For example, the distributions of fitted parameters (Figure S3) for AdaBEAST show that often the weights given to the different biased sampling tools are very different than one another, an aspect that cannot be accounted for in the original BEAST.Footnote 6
Overall, AdaBEAST is ranked 5th amongst the behavioral models in the mixed domain and 10th amongst the models in the gain domain (Figure 1, thick arrow). Statistical comparison of AdaBEAST with CPT that uses Prelec (Reference Prelec1998) functions, the best behavioral model in this analysis, shows that CPT predicts significantly better than AdaBEAST in the mixed domain, $\unicode{x3b2} = 0.0054,{t}_{\left(\mathrm{6,842}\right)} = 3.129, p = .002$ , but the difference in the gain domain is only marginally significant, $\unicode{x3b2} = 0.0029,{t}_{\left(\mathrm{5,532}\right)} = 1.712,p = .087$ .
4.2. Psychological mechanisms
A major focus in the analysis done by (He et al., Reference He, Analytis and Bhatia2022) was on the exploration of the assumed psychological mechanisms embedded in successful predictive models. The assumption was that if an assumed mechanism consistently appears in the models that predict best, then it is likely an essential mechanism for useful predictions of choice behavior, possibly as it reflects an actual human choice mechanism. In their analysis, HAB observed a clear pattern concerning two specific psychological mechanisms: nonlinear payoff transformation and nonlinear probability transformation. Specifically, they found that all top-performing models integrate both payoff and probability transformations while those at the lower spectrum of performance generally exclude them. This pattern is evident in Figure 1. These results might suggest that for a model to attain top-tier predictive performance of choice under risk, the integration of both of these mechanisms is crucial.
However, this clear pattern is challenged when the nonanalytic BEAST models are included in the comparative assessment. While these models do not assume nonlinear transformations of either payoff or probability, AdaBEAST exhibits strong performance and is ranked among the top models. Intriguingly, AdaBEAST stands as the only model to achieve such top performance without assuming these mechanisms, thereby putting in question the previous pattern that these transformations are indispensable for accurate prediction.
To gain further insight into the success of BEAST, we repeated this qualitative analysis, this time focusing on the psychological mechanisms BEAST incorporates (see Figure S3 in SM). Intriguingly, we found that BEAST and AdaBEAST are the only models among those examined that assume both a regret and a sampling mechanism, pointing to a possible reason that may help explain why AdaBEAST is the only model that performs well despite not assuming either payoff or probability transformations.
Specifically, models that include regret but not sampling normally go over all the possible states of the world, compute the expected regret in each state of the world, and then compute a weighted average of these. BEAST, in contrast, incorporates regret within a sampling framework. In each sample, a single state is “realized” and is incorporated into the decision. Hence, it is quite plausible that not all states of the world will be considered. Consequently, in models with regret but no sampling, it is possible that high regret in a single state of the world will result in an extreme prediction for choice: all decisions are influenced by all states of the world. In BEAST, behavior that does not consider even extreme regret in some states is plausible and expected.
4.3. Dataset-specific analysis
To discern the specific settings in which BEAST performs well or poorly, we analyzed the model’s effectiveness across various individual datasets (see Figure 2). Our findings showed that the poor performance of the original model in the gain domain was largely driven by datasets published by Stewart et al. (Reference Stewart, Reimers and Harris2015; Reference Stewart, Hermens and Matthews2016). The sets of choice tasks in these studies were specifically designed to elicit within-individual-context effects and aimed to demonstrate how careful task design can alter decision-making behavior in predicted ways. Conversely, most other datasets (e.g., Rieskamp, Reference Rieskamp2008) derive from studies that mostly incorporated choice tasks that were randomly selected from a large space of tasks, and arguably provide a less biased framework to evaluate models more broadly. Our examination yielded an intriguing revelation: AdaBEAST outperformed all 58 behavioral models in HAB’s comparison in the only two datasets (Erev et al., Reference Erev, Ert, Plonsky, Cohen and Cohen2017; Rieskamp, Reference Rieskamp2008) that exclusively involved randomly selected choice tasks.
Overall, the nonanalytic BEAST models seem to display a bi-modal performance profile. They tend to falter in cases where choice tasks are systematically generated to elicit idiosyncratic context effects but excel in cases where choice tasks are elicited randomly to cover some large space. This trend underscores the models’ potential aptitude for predicting decisions in broader settings.
4.4. Predicting out-of-sample individual decision makers
One of the goals in comparing the predictive performance of models is to enhance our ability to predict the behavior of people in the real world. For example, highly accurate predictive models of human choice can be used to simulate humans when training artificial agents that would later be deployed in real-world environments (e.g., Moisan & Gonzalez, Reference Moisan and Gonzalez2017; Raifer et al., Reference Raifer, Rotman, Apel, Tennenholtz and Reichart2022). However, in the study conducted by (He et al., Reference He, Analytis and Bhatia2022), models were trained and tested on the same sample of participants (although predicting behavior in choice tasks on which models were not trained on). This raises the question of the generalizability of the results to new samples or the population at large. Hence, our subsequent analysis focuses on assessing the predictive power of these models for “unknown”, out-of-sample individuals (see methods).
Upon extending our analytical focus to assess predictive accuracy for out-of-sample individuals, the performance landscape showed notable changes (Figure 3; see also Table S4 in the SM). Our analysis yielded not only a remarkable enhancement for the original BEAST model but also a notable change in the rank of AdaBEAST. Specifically, BEAST has now moved up in ranking significantly and, together with AdaBEAST, is among the top three models on HAB’s data in the mixed domain (Figure 3a). A linear mixed-effects model with random factors for participants and cross-validation fold of the dataset shows that the performance of both AdaBEAST, $\unicode{x3b2} = 0.0016,{t}_{\left(\mathrm{6,841}\right)} = 1.147,p = .251$ , and the original BEAST model, $\unicode{x3b2} = 0.0014,{t}_{\left(\mathrm{6,841}\right)} = 0.986,p = .324$ , does not differ from that of the most accurate model in the mixed domain (CPT with Prelec functions). Similarly, the difference between AdaBEAST and the most predictive model in the gain domain under this analysis (Salience; Bordalo et al., Reference Bordalo, Gennaioli and Shleifer2012) was also not significant, $\unicode{x3b2} = 0.0021,{t}_{\left(\mathrm{5,528}\right)} = 1.455,p = .146$ .
5. Discussion
The current research is motivated by inconsistent results observed in prior large-scale decision-making model comparisons and aims to understand the underlying reasons for these differences. While BEAST emerged as the leading predictive model in two CPCs (Erev et al., Reference Erev, Ert, Plonsky, Cohen and Cohen2017; Plonsky et al., Reference Plonsky, Apel, Ert, Tennenholtz, Bourgin, Peterson and Erev2024), a recent analysis by (He et al., Reference He, Analytis and Bhatia2022) identified a variant of CPT as superior. Our research examines several key methodological differences between these studies that may explain the divergent results. First, in the CPCs, competition participants could submit any model they wished, whereas in HAB the set of competitor models was limited to those adhering to specific criteria set by the authors, and incidentally did not include nonanalytic models. Second, the CPCs featured a broader range of choice tasks, including decision-making under multi-outcome gambles, under ambiguity, and from experience, whereas HAB focused exclusively on decision-making under risk with up to two outcome gambles. Third, models in the CPCs were evaluated based on their predictive accuracy in choice tasks randomly sampled from a large space, whereas in HAB models were evaluated on mostly systematically hand-crafted choice tasks. Lastly, models in the CPCs were required to predict the choice rates of new samples of decision makers, whereas, in HAB, models were required to predict the behavior of individuals already “familiar” to them from the training data. Here, through multiple analyses, we aim to deepen our understanding of the usefulness and predictive efficacy of different models of decisions under risk and offer insights for future studies.
While the CPCs did not apply strict criteria for participation, welcoming all models (including CPT), (He et al., Reference He, Analytis and Bhatia2022) chose to exclude nonanalytic models because of their complex fitting process. This is an understandable and legitimate criterion since the necessary computational and time resources required for fitting nonanalytic models can be very demanding. However, the exclusion of nonanalytic models may hinder our understanding and can lead to overestimated implications about the psychological mechanisms that help predict decision-making behavior. In their paper, (He et al., Reference He, Analytis and Bhatia2022) marks nonlinear payoff and probability transformations as essential mechanisms for predictive performance. For example, they write: “Subjective payoff and subjective probability transformation mechanisms stood out as key mechanisms for improving predictive performance in risky choice.” (p. 3656). Yet, our analysis demonstrates that accurate predictions of risky choices are possible without relying on nonlinear payoff and probability transformations. BEAST, and more so AdaBEAST predict choice well despite not including these transformations, relying instead on mechanisms like sampling and regret.
The success of our models that combine sampling and regret raises the question of how these models differ from other models that involve either of these mechanisms. According to (He et al., Reference He, Analytis and Bhatia2022), none of the other evaluated models includes both sampling and regret, making BEAST’s unique combination of these mechanisms potentially key to its strong performance. Specifically, unlike some other sampling models (e.g., PRT, Viscusi, Reference Viscusi1989), BEAST assumes the mental sampling process is a property of a choice task, rather than the choice options, allowing for context-dependent decisions. Unlike other models that include regret (e.g., SEP, Mellers et al., Reference Mellers, Schwartz and Ritov1999), BEAST’s context dependence is also influenced by the sampled state of the world. Most regret-based models assume decisions are influenced by all possible states, which may lead to extreme predictions due to high regret in a single state. BEAST, however, mentally “realizes” only one state per sample, and as a result, behavior that overlooks even extreme regret in some states is plausible and expected.
When considering datasets primarily comprising randomly selected choice tasks, rather than choice tasks manually crafted by the researchers, the implications of our study become increasingly relevant. While most of the datasets in our analysis included only systematically crafted choice tasks, others also incorporated randomly generated tasks. Analysis of specific datasets revealed that while BEAST tends to perform poorly in contrived, context-specific scenarios, it has marked proficiency in randomly sampled environments; indeed, its adaptable version outperforms all other models in the only two datasets that contain strictly randomly sampled tasks. This divergence in performance foregrounds BEAST’s enhanced aptitude for predictive fidelity in settings that potentially contain a broader coverage of the spectrum of possible tasks. The success of our models under such conditions aligns with the original intent for which BEAST was developed. The model was specifically designed to capture a broad spectrum of phenomena in human choice behavior and predict human decision-making in wide sets of environments (Erev et al., Reference Erev, Ert, Plonsky, Cohen and Cohen2017).
Despite the success of the nonanalytic models in specific datasets with exclusively randomly sampled choice tasks, our analysis still showed that in the original analysis as conducted by (He et al., Reference He, Analytis and Bhatia2022), and includes all datasets combined, CPT performs better. One possible reason for this gap may be the fact that CPT is a highly flexible model that when fitted on some choice data by an individual in some context can accurately capture many of that individual-context idiosyncratic interactions well. Indeed, in a recent study, Fudenberg et al. (Reference Fudenberg, Gao and Liang2023) have asserted that CPT is so flexible that it “would have performed well out-of-sample given sufficient data from almost any underlying data-generating process that respects first-order stochastic dominance” (Fudenberg et al., Reference Fudenberg, Gao and Liang2023, p. 21). In contrast, BEAST was designed with the intent to predict behaviors of new unknown individuals and is congruently far less flexible. To deal with BEAST’s rigidity issues, we developed AdaBEAST which allows for increased adaptation across contexts, without fundamentally changing the underlying theory and main underlying assumptions. Indeed, the results showed that AdaBEAST performs better than BEAST in both the mixed domain and the gain domain in which AdaBEAST’s prediction power was statistically indistinguishable from that of CPT. However, the results may also suggest that AdaBEAST remains less flexible than models like CPT and therefore does not predict the choices of familiar individuals as well.
Accurate prediction models of human choice can be highly useful in many practical applications. In some cases, like when gauging a patient’s adherence to treatment, developing a personalized prediction model for a specific individual is the appropriate approach. But in many other cases, prediction models are most useful when they can predict the behavior of unknown decision makers. This is particularly true when making policy decisions when broad population insights are required, for example for assessing the population response to a planned sugary drink tax or pricing strategies in public transportation. In light of this, in our work, we found it beneficial to also assess the predictive capabilities and generalizability of the models presented in (He et al., Reference He, Analytis and Bhatia2022), as well as BEAST and AdaBEAST, when applied to new unknown individuals. Indeed, we found that under this analysis the remaining gap between AdaBEAST and CPT (or Salience) is eliminated. These findings accentuate the importance of BEAST’s foundational design and its robustness across varied participants.
To predict unknown individuals, we essentially create a prediction for an average new person drawn from the population implied by the observed sample. It may thus be argued that the relative additional success of our nonanalytic models in this task could reflect the lower sensitivity of the models to individual differences. It is probably true that when a model’s predictions in a task are not very sensitive to the values of the individual-level parameters its prediction for an average person in the population (based on a sample) would be less noisy and thus better when applied to new individuals. Yet, note that a relatively low (vs. high) sensitivity of the predictions to the choice of parameters is not necessarily evidence against the model’s validity. That a model is highly flexible and allows any behavior given different parameters (like CPT; Fudenberg et al., Reference Fudenberg, Gao and Liang2023) does not mean that it is necessarily a “proper” model.
The aforementioned analysis compared the prediction capabilities of the models for new decision-makers in new choice tasks, but it still involved the prediction of behavior in the same experiment and context. That is, to the extent that choice behavior in one task might be influenced by the other tasks that people face in the same experiment (e.g., Ert & Erev, Reference Ert and Erev2013; Schneider et al., Reference Schneider, Kauffman and Ranieri2016; Stewart et al., Reference Stewart, Reimers and Harris2015), training data of models in this analysis involves access to contextual features that may impact behavior within an experiment but will be irrelevant outside of it. Going forward, aligning with methodologies akin to the CPCs, it would be insightful to train models on datasets of specific experiments and participant groups, followed by prediction for entirely new participant groups in new experiments. Such an approach would further our understanding of model generalizability, bringing both theoretical clarity and practical applicability to the fore.
The present study underscores the importance of considering a range of models, including those that may be perceived as more difficult to estimate, as it can add valuable insights into the underlying mechanisms of human behavior. The relative success of BEAST challenges the research community to venture beyond traditional strategies when building models to achieve even better results.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/jdm.2024.34.
Data availability statement
The data used in this study is publicly available from the OSF website: https://osf.io/ca6bn/.
Funding statement
Small parts of this article were previously reported in Agassi & Plonsky (Reference Agassi and Plonsky2023). Ori Plonsky acknowledges support from the Israel Science Foundation (grant no. 2390/22).