Hostname: page-component-cd9895bd7-hc48f Total loading time: 0 Render date: 2024-12-23T14:45:01.743Z Has data issue: false hasContentIssue false

Muddling-Through and Deep Learning for Managing Large-Scale Uncertain Risks

Published online by Cambridge University Press:  02 September 2019

Tony Cox*
Affiliation:
Cox Associates and University of Colorado, 503 N. Franklin Street, DENVER, CO 80218, USA, e-mail: [email protected]
*
Rights & Permissions [Opens in a new window]

Abstract

Managing large-scale, geographically distributed, and long-term risks arising from diverse underlying causes – ranging from poverty to underinvestment in protecting against natural hazards or failures of sociotechnical, economic, and financial systems – poses formidable challenges for any theory of effective social decision-making. Participants may have different and rapidly evolving local information and goals, perceive different opportunities and urgencies for actions, and be differently aware of how their actions affect each other through side effects and externalities. Six decades ago, political economist Charles Lindblom viewed “rational-comprehensive decision-making” as utterly impracticable for such realistically complex situations. Instead, he advocated incremental learning and improvement, or “muddling through,” as both a positive and a normative theory of bureaucratic decision-making when costs and benefits are highly uncertain. But sparse, delayed, uncertain, and incomplete feedback undermines the effectiveness of collective learning while muddling through, even if all participant incentives are aligned; it is no panacea. We consider how recent insights from machine learning – especially, deep multiagent reinforcement learning – formalize aspects of muddling through and suggest principles for improving human organizational decision-making. Deep learning principles adapted for human use can not only help participants in different levels of government or control hierarchies manage some large-scale distributed risks, but also show how rational-comprehensive decision analysis and incremental learning and improvement can be reconciled and synthesized.

Type
Symposium on Analysis for Uncertain Futures
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© Society for Benefit-Cost Analysis, 2019

1 Introduction

Traditional benefit-cost analysis and decision analysis typically involves multiple steps such as the following (Raiffa, Reference Raiffa1968; Clemen & Reilly, Reference Clemen Robert and Reilly2014; Howard & Abbas, Reference Howard and Abbas2016):

  1. 1. Identify alternative feasible choices, decision rules, or courses of actions. This “choice set,” or set of decision alternatives, may be specified explicitly as a discrete set of alternatives, such as whether or not to fund a public project, or implicitly via constraints on the allowed values of decision variables, such as quantities of limited resources available to be allocated.

  2. 2. Identify preferences and value trade-offs for possible outcomes. These may be formally represented via a net benefit function or via a (possibly multiattribute) von Neumann-Morgenstern utility function or social utility function to be maximized (Keeney & Raiffa, Reference Keeney and Raiffa1976).

  3. 3. If the outcomes for each choice are uncertain, estimate the probabilities of different outcomes for each choice (e.g., its risk profile); and

  4. 4. Optimize choices subject to feasibility constraints (e.g., on available time, budget, or limited resources) to identify and recommend a feasible choice that maximizes expected net benefit, expected utility, or expected social utility of outcomes.

These steps are all well-established parts of prescriptive decision analysis for a single decision-maker and benefit-cost analysis for a social decision-maker (Howard & Abbas, Reference Howard and Abbas2016; Raiffa, Reference Raiffa1968).

In 1959, political economist Charles Lindblom of Yale University pointed out that almost none of these steps can be applied in practice to the decisions and uncertainties faced by real government decision-makers, or by decision-makers in other bureaucracies. Preferences and value trade-offs may be unknown and difficult or impossible to articulate, quantify, and justify. Lindblom (Reference Lindblom1959) wrote, “Typically the administrator chooses – and must choose – directly among policies in which [different] values are combined in different ways. He cannot first clarify his values and then choose among policies,” as multiattribute utility theory prescribes. Even identifying possible outcomes for each feasible choice may be impracticable if the number of possible choices is immense or possible outcomes are unknown. In addition, real-world bureaucratic and organizational decisions are almost never made by a single decision-maker. Rather than seeking to extend or refine normative decision analysis to overcome what he perceived as its fatal practical limitations for large-scale, multiperson organizational decision-making over time, Lindblom instead described a method of successive limited comparisons that he contrasts with the “rational-comprehensive” normative approach favored in benefit-cost analysis, decision analysis, operations research, and optimal control engineering. The rational-comprehensive approach seeks to solve decision optimization problems such as

(1) $$ {\mathrm{max}}_{a\in A}R(a) $$

where

  • a is a decision variable or policy (e.g., a vector or a time series of decision variables, or a feedback control decision rule mapping observations to actions)

  • A is the set of feasible alternative decisions (the “choice set”)

  • R(a) is the reward (expected utility or net benefit) from choosing a. In many traditional economic, policy, and operations research analyses, the reward function to be maximized is assumed to be known. In statistical design of experiments and machine learning, it may have to be discovered. If the reward received depends both on the decision-maker’s choice a and also on other variables not controlled by the decision-maker, collectively referred to as the state and modeled as a random variable s, then R(a) is the expected reward from choosing a given the probability distribution of s. When there are many players, R is often taken to be a weighted sum of individual utility functions (Gilboa et al., Reference Gilboa, Samet and Schmeidler2004).

  • maxa∈A indicates that an act a in A is to be selected to maximize R(a).

Lindblom wrote that “the attention given to, and successes enjoyed by operations research, statistical decision theory, and systems analysis” have strengthened a “tendency to describe policy formulation even for complex problems as though it followed [this] approach,” emphasizing “clarity of objective, explicitness of evaluation, a high degree of comprehensiveness of overview, and, wherever possible, quantification of values for mathematical analysis. But these advanced procedures remain largely the appropriate techniques of relatively small-scale problem-solving where the total number of variables to be considered is small and value problems restricted.”

In contrast, for large-scale real-world decision problems faced by most bureaucracies, Lindblom considers the rational-comprehensive approach in equation (1) to be impracticable because the net benefit or reward function R is not known or agreed to; choice set A may be too large to enumerate or search effectively, or unknown and costly to develop; and often no single centralized authority is capable of, authorized to, or accountable for identifying and implementing the best choice in A. Instead of clarifying values and objectives in advance, goals and actions to achieve them are selected together as opportunities arise. The test of a “good” policy is not that it is the best means to desired ends, or that it maximizes some measure of expected net benefit, utility, or collective welfare, but that people will agree to it (possibly for different, and perhaps conflicting, private reasons). Important possible outcomes, feasible alternative policies, and affected values and trade-offs are neglected in favor of relatively simple comparisons between the current policy and a proposed incremental modification of it. A succession of such modifications may, if all goes well, produce gradually improving policies; this is the process that Lindblom refers to as successive limited comparisons, or, more colloquially, as muddling through. He states that “Making policy is at best a very rough process. Neither social scientists, nor politicians, nor public administrators yet know enough about the social world to avoid repeated error in predicting the consequences of policy moves. A wise policy maker consequently expects that his policies will achieve only part of what he hopes and at the same time will produce unanticipated consequences that he would have preferred to avoid. If he proceeds through a succession of incremental changes, he avoids serious lasting mistakes in several ways” including learning from experience and being able to correct missteps fairly quickly. Of course, this view is optimistic if a single misstep could lead to disaster, ruin, or the destruction of the decision-making organizations, but Lindblom does not dwell on these grim possibilities. To model and evaluate the muddling through approach more formally, however, we will have to consider possibilities for safe learning, i.e., surviving and avoiding disastrous decisions during learning (Garcia & Fernandez, Reference Garcia and Fernandez2015). Lindblom proposes muddling through not only as a descriptive theory of bureaucratic decision-making, but also as a normative one: “Why then bother to describe the method in all of the above detail? Because it is in fact a common method of policy formulation and is, for complex problems, the principal reliance of administrators as well as of other policy analysts. And because it will be superior to any other decision-making method available for complex problems in many circumstances, certainly superior to a futile attempt at superhuman comprehensiveness.” In short, muddling through by successive incremental adjustments of policy is proposed as both more desirable and more widely practiced than the rational-comprehensive approach.

Since Lindblom’s essay, revolutions have occurred in computer science, game theory, collective choice theory, automated and adaptive control, artificial intelligence, robust optimization and risk analysis, machine learning, computational statistics and data science, and the intersection of these fields with political economy, law-and-economics, and management science. It is timely to reexamine the extent to which Lindblom’s critique of rational-comprehensive techniques for risk management decision support still applies; the extent to which the ferment of ideas and technical developments in artificial intelligence and other fields dealing with multiagent control has overcome his objections; how both the strengths and the limitations of muddling through can be understood better, and the technique applied more successfully, in light of progress since 1959; and whether there are circumstances in which muddling through provides a viable alternative or complement to decision analysis. The following sections undertake such a reexamination.

2 Developments in rational-comprehensive models of decision-making

An individual, team, organization, or artificial intelligence that repeatedly makes decisions to achieve some overall purposes or goals must repeatedly decide what to do next – e.g., what subgoals or tasks to undertake next – and how to do it, e.g., which agents should do what, and how much planning should be local and autonomous instead of centralized or hierarchical. In teams with no central coordinator, such as robot soccer teams of cooperating autonomous agents, cooperating swarms of drones, or search-and-rescue teams with autonomous agents and limited communication, the agents may have to infer and adapt to each other’s plans on the fly as they observe each other’s behaviors and messages (Hunt et al., Reference Hunt, Meng, Hinde and Huang2014; Zhao et al., Reference Zhao, Meng and Chung2016). In bureaucracies or other organizations where policies are formulated and adapted via muddling through, success or failure in achieving stated goals may depend on who may propose what, when, and how decisions are made about which proposals to adopt, and how these changes and their consequences are linked to incentives and rewards for those participating in policy-making and administration.

In the face of such complexities, the simple prescriptive model of optimization-based rational-comprehensive decision-making in (1) has been generalized and extended in the following ways.

  • Noncooperative game theory (Luce & Raiffa, Reference Luce and Raiffa1957) replaces the reward function R(a) in (1) with a set of reward functions (also called “payoff functions”), one for each participant (called a “player” or “agent”). Each player has its own choice set of feasible alternatives to choose among, often called strategies in game theory, or policies in decision analysis, machine learning, and artificial intelligence. Player i now seeks to choose ai from Ai to maximize Ri (ai , ai′ ), where ai denotes the strategy selected from Ai by player i; ai′ denotes all the strategies selected by the other players; and Ri (ai , ai′ ) is the reward to player i from choosing strategy ai when the other players choose ai′. There is no single net benefit, social welfare, or public interest to be maximized. Rather, each player seeks to act to maximize its own reward, given the actions of the rest. A Nash equilibrium is a set of choices such that no player can improve its own reward by unilaterally modifying its own choice, given the choices of the other players. Each player’s choice is a best response to the choices of the rest. A set of choices by the players is Pareto-efficient if no other set of choices would give all players equal or greater rewards, and at least to some of them greater rewards. In practical applications such as deciding how to manage air pollution, antibiotic resistance, or climate change, a common challenge is that each player benefits if everyone else exercises restraint to avoid making the current problem worse, but each player also maximizes its own benefits by being unrestrained itself, whatever the other players are doing. In such cases, the unique Nash equilibrium is that no one exercises self-restraint, even though all would gain if all would do so; hence, it is not Pareto-efficient. A variety of “folk theorems” of game theory prove that both Pareto efficiency and multiperiod versions of Nash equilibrium can be achieved if players are sufficiently patient (i.e., they do not discount delayed rewards too steeply) in repeated games with discounted rewards and uncertain time horizons, where the players have a chance to observe each other’s behaviors and make choices repeatedly over time. The trick is to have players make choices that punish those who do not cooperate in sustaining a Pareto-efficient outcome (Fudenberg & Maskin, Reference Fudenberg and Maskin1986; Fudenberg et al., Reference Fudenberg, Levine and Maskin1994; Hörner & Olszewski, Reference Hörner and Olszewski2006).

  • Cooperative game theory further generalizes the multiplayer choice problem by allowing players to form coalitions and to bargain or negotiate with each other. For example, in the treaty participation game model of international cooperation (or lack of it) to limit emissions in hopes of limiting undesired climate change, a coalition of signatories might choose emissions levels to maximize their collective benefits, while nonsignatories choose emission levels to maximize their individual benefits (Barrett, Reference Barrett2013). The final levels of cooperation and emissions achieved in multistage games of coalition formation and decision-making about emissions depend on factors such as whether coalitions, once formed, are exclusive; whether players (e.g., countries) can make and enforce conditional agreements such as that some will reduce their emissions more if and only if others do; whether binding commitments can be made and enforced; how steeply participants discount future rewards and penalties compared to current ones; and whether the timing of catastrophic consequences from failure to muster sufficient cooperation is known or uncertain (Heitzig et al., Reference Heitzig, Lessmann and Zou2011;Wood, Reference Wood2011; Barrett, Reference Barrett2013).

  • Team theory (Marschak & Radner, Reference Marschak and Radner1972) focuses on design of costly communication and agent decision rules (and, in some versions, on allocation of limited resources among the agents) for the special case of cooperating agents in an organization where all of the agents have identical preferences and goals. That is, they all seek to maximize the same reward function of their joint choices, but local observations, actions, and communications are costly. Team theory has been applied to distributed control of systems by agents with sensors and actuators at different locations, as well as to organizational design, design of compensation systems, and dynamic allocation of tasks, roles, and responsibilities within teams of cooperating agents.

  • Mechanism design: Institutions, social and moral norms, legal constraints and liabilities, regulations and their enforcement, wages and contractual incentives, outcome-sharing rules in principal-agent relationships and investment syndicates, and reputations in repeated transactions and long-term relationships all help to shape the rewards (positive or negative) and feedback that players receive for their choices and behaviors. Game theory studies how agents make choices in response to incentives. Mechanism design theory (Nisan, Reference Nisan, Nisan, Roughgarden, Tardos and Vazirani2007) studies the inverse problem of how to design incentives, or the rules determining rewards in the games in which agents participate, to elicit choices that satisfy desired properties. These may include Pareto efficiency, self-enforcing stability (e.g., Nash equilibrium and its multiperiod extensions), implementability using information that can actually be obtained and incentives (e.g., payments) that can actually be provided, and voluntary participation. Although important impossibility theorems show that successful mechanism design satisfying most or all of these properties is impossible if preferences are arbitrary, many positive results are available when preferences satisfy restrictions (e.g., risk neutrality and “quasi-linear preferences” with utility linear in money) commonly assumed in traditional benefit-cost analyses.

  • Organizational design and law-and-economics: Within bureaucracies and other hierarchical organizations (e.g., principal-agent relationships), as well as in the more specialized contexts of designing contracts and auctions, mechanism design can be applied to design incentive systems to promote revelation of local information, elicit desired behaviors despite private information, and optimize delegation and trade-offs between centralization and decentralization, taking into account costs of communication, monitoring, and control and inefficiencies due to remaining private information (Mookherjee, Reference Mookherjee2006). As a prominent application of the mechanism design perspective, the modern theory of law and economics (Miceli, Reference Miceli2017) explains how systems of laws establishing tort liability rules for hazardous activities, remedies for breach of contracts, property rights to internalize externalities, product liability and implicit warranty principles, and so forth can be designed to maximize the expected net economic benefit from voluntary transactions, usually assuming risk-neutral participants with quasi-linear preferences. Practical designs that explain many aspects of observed legal practice account for market imperfections such as private and asymmetric information (e.g., a consumer may not know how much care a manufacturer has taken to keep a product safe, or the manufacturer may not know how much care the consumer will exercise in using the product safely), costs of litigation, misperceptions of risk by buyers, and incentives for socially valuable research and disclosure of information by sellers.

3 Modern algorithms for single-and multiagent decision-making

The intersection of computer science with decision models and algorithms has tremendously advanced the design and practical application of algorithms for solving large-scale single-person and team decision optimization problems, as well as games and collective choice problems, in recent decades. Current state-of-the-art algorithms are briefly described next.

  • Monte Carlo Tree Search (MCTS). Decision trees and game trees showing possible sequences of actions (choice nodes) and uncertainty resolutions (chance nodes, with probabilities for each branch) leading to rewards (utilities) at the ends (leaf nodes) of the tree are perhaps the best known rational-comprehensive models of normative decision analysis for small problems (Raiffa, Reference Raiffa1968; Luce & Raiffa, Reference Luce and Raiffa1957). For large problems, recent MCTS algorithms (Munos, Reference Munos2014; Silver et al., Reference Silver, Huang, Maddison, Guez, Sifre, Driessche and Schrittwieser2016, Reference Silver, Hubert, Schrittwieser, Antonoglou, Lai, Guez and Lanctot2018) sample possible future paths and rewards to avoid enumerating all possibilities. This decouples “rational” decision-making, based on optimizing current decisions using predicted future reward probabilities, from “comprehensive” modeling of the causal relationship between choices and reward probabilities, by selecting only the most promising choice nodes in a tree for further simulation and evaluation. MCTS can be combined with reinforcement learning (RL) techniques discussed next (Vodopivec et al., Reference Vodopivec, Samothrakis and Ster2017) and applied to more general settings, such as those in which it is costly to observe the reward (Schulze & Evans, Reference Schulze and Evans2018), as is the case for many social policy interventions.

  • Reinforcement learning (RL) of high-reward policies through trial and error learning (Sutton & Barto, Reference Sutton and Barto1998, Reference Sutton and Barto2018). Decision-makers (agents) often initially do not know how their choices affect reward probabilities, or expected benefits, but must discover the immediate and longer-term costs and benefits of alternative policies or choices from experience. Denote true expected value starting in state s and acting optimally thereafter by an (initially unknown) value function, V(s) and let Q(a, s) denote an estimate of the value from taking each feasible action a when in each state s and then acting optimally (e.g., to maximize the discounted sum of future rewards) ever after. The initial estimates of these values may be random guesses, but they are updated in light of experience by adjusting current estimates by an amount proportional to the difference between expected and experienced rewards. The constant of proportionality is interpreted as the learning rate. For example, Q-learning uses the current estimate Q(a, s) to select which action to take next in the current state s. Then the resulting reward is used to update the estimate of Q(a, s) based on the difference between estimated and observed rewards. In many settings, estimated Q(a, s) values converge and the policy of selecting a to maximize Q(a, s) is then the optimal policy, while the estimated value of Q(a, s) when that policy is used is the true value function, V(s). This procedure is similar to value iteration in classical stochastic dynamic programming, but without the requirement that the reward function and state transition probabilities be initially known. It converges to yield optimal policies under certain conditions for Markov decision processes (MDPs), in which the actions taken affect next-state probabilities as well as probability distributions of current rewards (Krishnamurthy, Reference Krishnamurthy2015). The main conditions are that learning rates be kept small enough and that the MDPs are ergodic, involving no irreversible choices or fatal outcomes that would limit or prevent future exploration and adaptation (Bloembergen et al., Reference Bloembergen, Tuyls, Hennes and Kaisers2015; Krishnamurthy, Reference Krishnamurthy2015; Xu et al., Reference Xu, Qin, Wang and Liu2017).

  • RL using policy gradient algorithms. RL can also be based on algorithms that emphasize adjusting polices directly rather than estimating values for different actions as in benefit-cost analysis. As usual, a policy in RL is a decision rule mapping observations (e.g., the current state) to actions. In most RL algorithms, however, this mapping is randomized: thus, a policy RL specifies the probability of taking each feasible action when in each state (or, more generally, given current information, which may include imperfect observations of the current state). Polices are updated to favor selecting actions with higher expected values. The tension between exploring further in hopes of finding a more valuable policy and exploiting what has been learned so far by selecting the actions with the highest expected values is managed carefully by choosing action-selection probabilities to avoid premature convergence to suboptimal policies. For example, a simple and effective policy in many settings is to select each action with a probability equal to the currently estimated probability that it is the best (value-maximizing) action; this is called Thompson sampling (Schulze & Evans, Reference Schulze and Evans2018). Such randomized sampling schemes prevent jumping to possibly erroneous conclusions about what works best in clinical trials and similar sequential decision optimization settings (Villar et al., Reference Villar, Bowden and Wason2015). Adjustments of policies continue until expected and experienced average rewards no longer differ. For large classes of adaptive decision problems under uncertainty, the policies arrived at by such successive incremental adjustments are the optimal policies that would be obtained by classical operations research methods (Bloembergen et al., Reference Bloembergen, Tuyls, Hennes and Kaisers2015; Krishnamurthy, Reference Krishnamurthy2015; Xu et al., Reference Xu, Qin, Wang and Liu2017). Table 1 lists important refinements and enhancements used in practice to make RL quicker and more robust to data limitations. Table 2 summarizes methods for safe learning that have proved effective in applications ranging from learning to control helicopters and quadcopters (e.g., allowing them to hover or navigate safely in cluttered environments) to learning to manage power grids and other networked infrastructures, without risking costly accidents and failures during learning. Table 3 summarizes variations and extensions of multiagent reinforcement learning (MARL) in which multiple agents act, learn, and perhaps communicate about how to control a system or accomplish a task. MARL can greatly increase the speed of learning and average rewards generated per unit time, under certain conditions (Omidshafiei et al., Reference Omidshafiei, Pazis, Amato, How and Vian2017; Gupta et al., Reference Gupta, Egorov and Kochenderfer2017).

Table 1 Some enhancements to reinforcement learning (RL) algorithms.

Table 2 Some principles for safe learning, i.e., learning without risking catastrophic failures.

Table 3 Some MARL variations and extensions.

MARL algorithms and architectures that incorporate MCTS and enhancements to speed up convergence, safe learning, and communication and control hierarchies represent the current state of the art in machine learning models and methods for solving large-scale and distributed decision and control problems under uncertainty, including problems with sparse and delayed feedback. Although most MARL algorithms are designed for cooperating agents, Bowling and Veloso (Reference Bowling and Veloso2001) showed that convergence to Nash equilibria can also be achieved in a variety of noncooperative Markov games (generalizations of MDPs to multiple agents) if each agent uses RL but manages its learning rate to take large steps when the agent’s experienced rewards are less than expected (“learn fast when losing”) and small steps otherwise (when it is “winning” by receiving higher than expected rewards). The resulting WoLF (“win or learn fast”) principle has been incorporated into many subsequent MARL algorithms for cooperative learning. It gives agents who are lagging in learning to contribute to the team’s success time to catch up, while agents who are ahead of the rest continue to explore relatively cautiously (via relatively small incremental adjustment steps) for even better policies. In practice, MARL algorithms have been applied successfully to obtain high-reward policies for difficult distributed decision and control problems such as job shop scheduling among multiple agents (Gabel & Riedmiller, Reference Gabel and Riedmiller2007); coordination of military force attacks in increasingly large-scale and realistic war game simulations (e.g., StarCraft battles) (Usunier et al., Reference Usunier, Synnaeve, Lin and Chintala2016); and self-organizing control of swarms of drones to perform missions or to cooperate in choosing locations to obtain full visual coverage of a complex and initially unknown environment (Pham et al., Reference Pham, La, Feil-Seifer and Nefian2018). Safe MARL (SMARL) and Hierarchical MARL (HMARL) algorithms have demonstrated promising performance in controlling autonomous vehicles (Shalev-Shwartz et al., Reference Shalev-Shwartz, Shammah and Shashua2016) and teams of robots performing challenging tasks such as urban search and rescue in complex and uncertain environments (Cai et al., Reference Cai, Yang and Xu2013), respectively. Such results suggest the potential for MARL principles and their extensions to contribute to improved control of complex distributed systems in important practical business, military, and industrial engineering applications.

4 Discussion: implications of advances in rational-comprehensive decision theory for muddling through

A key insight from machine learning is that policy gradient algorithms and other RL and MARL techniques that take successive incremental steps guided by experience – and in this sense muddle through – end up solving dynamic optimization problems. This finding addresses the “rational” component of Lindblom’s critique by showing that muddling through and optimization are not opposed: muddling through provides one way to solve optimization problems. Likewise, RL’s ability to solve adaptive dynamic optimization problems without requiring initial knowledge of the optimization problems being solved – specifically of how different choices affect reward probabilities and next-state transition probabilities in dynamic systems or environments – renders the “comprehensive” knowledge requirement no longer necessary. Sampling-based approximate optimization algorithms such as MCTS further reduce the need for a comprehensive examination and evaluation of decision options. In short, rather than being thesis and antithesis, as Lindblom framed them, optimization and muddling through have undergone a useful synthesis in modern machine learning via RL and MARL.

However, fully automated RL and MARL techniques for quickly discovering optimal or near-optimal policies remain elusive. Computational complexity results for decentralized control of Markov decision processes (MDPs) and their generalizations suggest that some of these limitations are intrinsic for MARL (although not for single-agent RL with MDPs) (Papadimitriou & Tsitsiklis, Reference Papadimitriou and Tsitsiklis1985), and hence that discovery of high-reward policies will always be time-consuming unless there is some measure of centralized control (Bernstein et al., Reference Bernstein Daniel, Zilberstein and Immerman2000). Of course, real organizations do not simply implement computer science algorithms, and it would be simplistic to read into the complexities of human organizational design and behavior all the limitations (or only the limitations) of RL and MARL algorithms. Nonetheless, understanding how and why these algorithms fail in some settings suggests important pitfalls to avoid in organizations that rely on muddling through, insofar as they follow the same basic principles. Conversely, success factors that turn out to be necessary for effective RL or MARL machine learning of high-reward policies in relatively simple environments may help to suggest necessary (although not sufficient) conditions for effective organizational learning within and among human organizations. The following paragraphs summarize key lessons and some comparisons with observed real-world decision processes for human organizations.

  1. 1. Collect accurate, relevant feedback data and use it to improve policies.

    After each new action is taken, RL evaluates the reward received and compares it to the reward that was expected so that the difference can be used to correct erroneous expectations and update the current policy. This requires that the effects of actions be evaluated and compared to prior expectations or predictions, and also that policies then be adjusted in light of the data. In the real world, policy-making and policy-administering bureaucracies frequently violate each of these requirements. For example, finding that investments in a costly course of action have yielded lower-than-expected returns may provoke those who originally chose it to escalate their commitment to it (Molden & Hui, Reference Molden and Hui2011; Schultze et al., Reference Schultze, Pfeiffer and Schulz-Hardt2012). Possible psychological and political explanations for escalating commitment range from loss aversion to seeking to manage the impressions of others, but clearly such resistance to modifying or abandoning previous choices in light of experience inhibits effective learning (Cox, Reference Cox2015; Tetlock & Gardner, Reference Tetlock and Gardner2015). In business as well as government, data needed to evaluate and compare actual to predicted performance of a policy are often not even collected, or are ignored or misinterpreted if they are collected (Russo & Schoemaker, Reference Russo and Schoemaker1989). In social policy application areas as diverse as education, criminal justice, and health care, changes in policy are often implemented without any clear predictions about expected changes in rewards or careful evaluations of actual changes in rewards (Tetlock & Gardner, Reference Tetlock and Gardner2015). These failures of design and analysis prevent the crucial learning from experience that is essential to effective muddling through. The remedy is to collect, retain, candidly communicate, and use accurate data on predicted and observed outcomes from implemented policies to improve them over time.

  2. 2. Explore via experiments to discover how to cause desired changes in outcome probabilities.

    It is tempting for a policy analyst or policy maker steeped in the rational-comprehensive tradition criticized by Lindblom to create the best possible model of how one believes the world works and then to choose the action or policy that maximizes expected utility according to this model, as in equation (1). But in reality, the causal relationship between choices of policies and resulting conditional probabilities of different consequences and rewards is often initially highly uncertain. Prudent and effective policy-making requires acknowledging and coping with this model uncertainty, rather than selecting and using a single model. RL and MARL algorithms do this via randomized selection of actions (e.g., using Thompson sampling or other randomized sampling schemes) (Schulze & Evans, Reference Schulze and Evans2018) to discover which policies work best and to avoid becoming stuck in local optima, but it is counter-cultural among people who believe that one should know and not guess about the best course of action before taking it (Tetlock & Gardner, Reference Tetlock and Gardner2015), and among decision analysts who believe that one should solve an expected utility optimization problem and then make deterministic decisions based on the results. Neither belief fully acknowledges or responds constructively to the reality emphasized by Lindblom that current knowledge is often simply insufficient to permit confident identification of the best policy, and that experimentation is the only practical way to discover how to do better. Fortunately, the use of randomized controlled trials (RCTs) in social policy experimentation and evaluation of interventions has become increasingly accepted and practiced recently, in areas ranging from disrupting poverty (Tollefson, Reference Tollefson2015) to preventing delinquency (de Vries et al., Reference Vries, Hoeve, Asscher and Stams2018) to improving oral health of fifth grade students (Qadri et al., Reference Qadri, Alkilzy, Franze, Hoffmann and Splieth2018) to reducing child abuse by intervening with substance-abusing parents (Barlow et al., Reference Jane, Sembi, Parsons, Kim, Petrou, Harnett and Dawe2019). For collective learning and decision problems, such as controlling air pollution health effects, RCTs may not be practicable or ethical, but natural experiments and quasi-experiments provide valuable opportunities to learn from observed responses to unplanned or nonrandom interventions (Boogaard et al., Reference Hanna, Erp, Walker and Shaikh2017; Henneman et al., Reference Henneman, Liu, Mulholland and Russell2017).

  3. 3. During collective learning, agents should advance slowly when doing better than expected, but retreat quickly when doing worse.

    The “win or lose fast” (WoLF) principle from MARL provides a useful heuristic for coordinating the rates at which agents on a team adjust their individual policies to prevent collective instability, so that they can eventually find and exploit a coordinated set of individual policies for maximizing team reward. In practice, destabilized policy-making processes in human organizations can manifest as “policy churn,” in which new policies are proposed before old ones are well implemented and evaluated by the teams of agents implementing them (Monios, Reference Monios2016). Teachers implementing education reform programs; bankers implementing new risk management regulations and requirements; medical staff implementing new infection control protocols in hospital wards; and workers in bureaucracies implementing policy changes have all been frustrated by policy churn that encourages costly activity and change without providing the opportunities for careful and thorough evaluation and improvement needed to improve outcomes. Perhaps fear of constant deflections and the resulting lack of progress explains some of the previously discussed reluctance to systematically collect and use feedback data to evaluate and improve policies. Conversely, the desire to show action and strong leadership, or to obscure the results of previous ineffective choices, might provide incentives for policy churn. In any case, the study of RL and MARL performance suggests that deliberately controlling step sizes and adjustment rates for policy updates might facilitate productive incorporation of feedback data into policy updates for a group of cooperating agents without destabilizing their learning and improvement process.

  4. 4. Separate actors and critics.

    The RL idealization of frequent small adjustments made without significant costs, delays, or uncertainties in implementation is too simple to describe most real-world decision processes. Nonetheless, some RL and MARL principles may still be useful for human organizations. One of the most useful may be that decision and evaluation of decision performance should be kept distinct processes. Reasons abound in individual and group psychology for keeping those who make decisions about policy adjustments (analogous to “actors” in actor-critic RL algorithms) separate from those who evaluate the performance of the policies and provide feedback and suggestions for improving them (the “critics”). Among these reasons are confirmation bias, motivated reasoning, groupthink, and other heuristics and biases (Cox, Reference Cox2015). RL suggests an additional reason, rooted in statistics: in deep learning RL algorithms, training one network to decide what to do next and a separate one to evaluate how well it is working has been found to prevent overly optimistic assessments of policy performance due to overfitting, i.e., using the same data to both select estimated value-maximizing actions and estimate the values from taking those actions (van Hesselt et al., Reference Hesselt, Guez and Silver2015). The principle of separating the processes for choosing which changes to make and evaluating how well they perform can also be applied usefully to choice of learning rates (i.e., choosing how much to modify current policies in light of feedback) as well as to choice of policies (Xu et al., Reference Xu, Qin, Wang and Liu2017). Possible future advances include deliberately diversifying the learning rates of different agents on the same team to obtain the advantages of both rapid exploration of new policies and thorough exploitation and refinement of old ones. This is an old concept in organizational science (e.g., March, Reference March1991), but is still being developed in MARL research (Potter et al., Reference Potter, Meeden and Schultz2001).As a practical matter, separation of actors and critics can be applied fruitfully to major social learning and improvement initiatives, such as air pollution regulation, through accountability studies that revisit previous regulatory actions or other decisions to assess their results (Boogaard et al., Reference Hanna, Erp, Walker and Shaikh2017; Henneman et al., Reference Henneman, Liu, Mulholland and Russell2017). Use of such evaluation studies to evaluate and update previous policy decisions – ideally, in time to be useful in guiding policy decisions elsewhere – is clearly consistent with the principle of collecting and using relevant feedback data. Separation of actors and critics provides an additional principle for using feedback data to maximum advantage to improve polices and their results.

  5. 5. Shape rewards to promote learning and improvement.

    Recently, it has been found that using causal (counterfactual) models to shape each agent’s reward to reflect the estimated difference it has made – the difference between what was actually achieved and what would have been expected without each agent’s contribution, or its marginal value, in microeconomic terms – can speed collective learning and optimization when each agent seeks to maximize its own reward (Devlin et al., Reference Devlin, Yliniemi, Kudenko and Tumer2014). This research uses mathematical rewards that are costless to implement, so that budget constraints such as that the sum of agent rewards must not exceed the collective reward of the team, do not apply. However, it seems plausible that, even in the presence of budget constraints, rewarding each agent according to its estimated marginal contribution (or its expected marginal contributions, or Shapley values in game theory) might promote joint learning about how to contribute more effectively, as well as having other properties of efficiency and fairness familiar from microeconomics and game theory. Of course, the asymmetric information about relative roles of chance and effort typical in principal-agent problems can inhibit accurate reward shaping in practice, and causal modeling of individual marginal contributions to team performance is challenging. Nonetheless, research on how best to use reward shaping to provide feedback and encourage effective learning, as well as to create incentives, may be useful for human organizations as well as for MARL algorithms.

  6. 6. Learn from the experiences and expertise of others.

    Learning from each other by sharing valuable memories, experiences, and expertise (typically encoded as causal models or trained neural nets) helps teams of MARL agents discover high-reward joint policies for controlling large-scale systems and accomplishing tasks in complex, changing, uncertain environments. In applying such ideas to human organizations, it is valuable to recognize that the “agents” may themselves be organizations, such as different schools, hospitals, or companies; or similar government bureaucracies in different states or countries. States and counties implementing pollution-reducing regulations might learn from each other’s experiences about which combinations of interventions and conditions (possibly involving copollutants, weather variables, and sociodemographic characteristics of the exposed population) generate the greatest public health benefits from pollution reduction measures. As usual, effective learning in human organizations must overcome challenges from various types of learning aversion that have no clear counterparts in machine learning (Cox, Reference Cox2015). For example, human bureaucracies may reorganize to visibly mimic organizational structures in more successful organizations whose reputations they covet, but without corresponding learning of the more effective policies that drive improved performance (Monios, Reference Monios2016). Players preoccupied with managing the perceptions and impressions of others to shape allocations of collective efforts and rewards to their own individual advantages may be unable to achieve Pareto efficiency or to maximize any measure of collective success or reward. These threats do not arise for teams of agents trying to cooperate in maximizing the same reward function. Our recommendation that agents should learn from each other in order to speed mastery of joint policies for obtaining high rewards from the environment is primarily applicable to such teams of cooperating agents.

5 Conclusions

In 1973, two professors of design and city planning offered the following sober assessment of the prospects for scientifically based social policy:

“The search for scientific bases for confronting problems of social policy is bound to fail, because of the nature of these problems. They are ‘wicked’ problems, whereas science has developed to deal with ‘tame’ problems. Policy problems cannot be definitively described. Moreover, in a pluralistic society there is nothing like the undisputable public good; there is no objective definition of equity; policies that respond to social problems cannot be meaningfully correct or false; and it makes no sense to talk about ‘optimal solutions’ to social problems unless severe qualifications are imposed first. Even worse, there are no ‘solutions’ in the sense of definitive and objective answers.” (Rittel & Webber Reference Rittel and Webber1973)

We believe that subsequent developments warrant greater optimism. While it is true that sufficiently heterogeneous preferences may make it impracticable or impossible to define and measure a single indisputable public good to be optimized, it is also true that agents with at least some shared goals have already achieved impressive feats of cooperation and control using MARL principles, in applications as diverse as autonomous vehicle fleet and drone swarm control, search-and-rescue via teams of cooperating autonomous robots, distributed management of supply chains, and military gaming. Such applications are admittedly far less difficult than the wicked problems referred to by Rittel and Webber, but many of the differences are of scale rather than of kind: robot teams are already using RL, MARL, and HMARL to confront, with increasing competence, the difficulties of distributed decision-making with initially unclear roles and priorities, uncertain and changing environments, opportunistic revision of goals and plans, and local information that may be time consuming and expensive to share. Multiple practical applications have demonstrated the advantages of improving via small steps rather than trying to optimize in one big decision, and this insight from Lindblom’s 1959 paper remains true for machine learning as well human organizations. It has been augmented by the discovery that successive incremental improvement based on feedback at each step and careful selection of step sizes is often an effective way to solve dynamic optimization problems when they can be clearly formulated, as well as an effective way to learn how to act when not enough is initially known to formulate a clear decision optimization problem.

As artificial intelligence and machine learning algorithms are tested and improved on increasingly challenging tasks, principles for learning how to manage risks and act effectively in a variety of centralized, decentralized, and hierarchical organizational structures have begun to emerge. We have discussed several based on recent work that uses deep neural networks to approximate value functions in RL, MARL, and HMARL algorithms. These principles are only the beginning of what may soon become a substantial flow from multi agent machine learning to human management science of useful principles for improving organizational design and performance in coping with realistically complex and uncertain collective decision and policy improvement challenges. These principles will doubtless require modifications and extensions for the human world, since human psychology for both individuals and groups differs greatly from RL and MARL agent programming. But the pace of discovery and progress in using machine learning to solve increasingly large, difficult, and important real-world problems of decision-making under uncertainty is now extremely rapid. Discovering how groups and teams of agents can organize, learn, decide, and adapt more effectively is becoming an experimental and applied science, as well as a theoretical one, in current artificial intelligence and machine learning. It seems likely that this research will produce insights and principles to help tame currently wicked problems and develop increasingly effective and beneficial polices in collective choice applications with high stakes for humans.

Acknowledgments:

I thank Vicki Bier and Warner North for valuable comments on an early draft that improved the framing, content and exposition in this paper. I thank Susan Dudley for inviting me to prepare a paper and presentation for panel discussion at the Society for Risk Analysis (SRA) 2018 Annual Meeting, with partial support from the Searle Foundation. This paper reflects the enthusiastic comments and suggestions of organizers and participants in the SRA session. I am grateful for the opportunity to share, discuss, and improve these ideas.

References

Marcin, Andrychowicz, Wolski, Filip, Ray, Alex, Schneider, Jonas, Fong, Rachel, Welinder, Peter, McGrew, Bob, Tobin, Josh, Abbeel, Pieter, and Zaremba, Wojciech. 2018. Hindsight Experience Replay. Available at https://arxiv.org/pdf/1707.01495.pdf (accessed May 27, 2019).Google Scholar
Kai, Arulkumaran, Deisenroth, Marc P., Brundage, Miles, and Bharath, Anil A.. 2017. A Brief Survey of Deep Reinforcement Learning IEEE Signal Processing Magazine, Special Issue On Deep Learning for Image Understanding (ArxivExtended Version). Available at https://arxiv.org/pdf/1708.05866.pdf (accessed May 27, 2019).Google Scholar
Jane, Barlow, Sembi, Sukhdev, Parsons, Helen, Kim, Sungwook, Petrou, Stavros, Harnett, Paul, and Dawe, Sharon. 2019. “A Randomized Controlled Trial and Economic Evaluation of the Parents Under Pressure Program for Parents in Substance Abuse Treatment.” Drug and Alcohol Dependence, 194: 184194. https://doi.org/10.1016/j.drugalcdep.2018.08.044.Google Scholar
Barrett, Scott. 2013. “Climate Treaties and Approaching Catastrophes.” Journal of Environmental Economics and Management, 66: 235250. https://doi.org/10.1016/j.jeem.2012.12.004.Google Scholar
Bloembergen, D, Tuyls, K, Hennes, D, and Kaisers, M. 2015. “Evolutionary Dynamics of Multi-Agent Learning: A Survey.” Journal of Artificial Intelligence Research, 53: 659697.Google Scholar
Felix, Bernkamp, Turchetta, Matteo, Schoellig, Angela P., and Krause, Andreas. 2017. “Safe Model-based Reinforcement Learning with Stability Guarantees.” In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, California. Available at https://papers.nips.cc/paper/6692-safe-model-based-reinforcement-learning-with-stability-guarantees.pdf Google Scholar
Bernstein Daniel, S., Zilberstein, Shlomo, and Immerman, Neil. 2000. “The Complexity of Decentralized Control of Markov Decision Processes.” Uncertainty in Artificial Intelligence Proceedings, Stanford, California. Available at https://arxiv.org/ftp/arxiv/papers/1301/1301.3836.pdf Google Scholar
Bowling, M, and Veloso, M. 2001. “Rational and Convergent Learning in Stochastic Games.” In IJCAI’01 Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, WA, USA, August 04-10, 2001, Vol. 2: 1021-1026. San Francisco, CA: Morgan Kaufmann Publishers Inc. Google Scholar
Hanna, Boogaard, van Erp, Annemoon M., Walker, Katherine D., and Shaikh, Rashid. 2017. “Accountability Studies on Air Pollution and Health: The HEI Experience.” Current Environmental Health Reports, 4(4): 514522. https://doi.org/10.1007/s40572-017-0161-0.Google Scholar
Cai, Yifan, Yang, Simon X., and Xu, Xin. 2013. “A Combined Hierarchical Reinforcement Learning Based Approach for Multi-robot Cooperative Target Searching in Complex Unknown Environments.” In IEEE Symposium onAdaptive Dynamic Programming and Reinforcement Learning (ADPRL), Singapore, IEEE.Google Scholar
Clavira, I, Liu, Simin, Fearing, Ronald S., Abbeel, Pieter, Levine, Sergey, Finn, Chelsea. 2018. Learning to Adapt: Meta-Learning for Model-Based Control. Available at https://arxiv.org/abs/1803.11347 (accessed May 27, 2019).Google Scholar
Clemen Robert, T., and Reilly, Terence. 2014. Making Hard Decisions, with the Decision Tools Suite. 3rd ed. Pacific Grove, CA: Duxbury Press.Google Scholar
Cox, L.A. Jr. 2015. “Overcoming Learning Aversion in Evaluating and Managing Uncertain Risks.” Risk Analysis, 35(10):1892–910. https://doi.org/10.1111/risa.12511.Google Scholar
de Vries, S.L.A, Hoeve, M., Asscher, J.J, and Stams, G.J.J.M. 2018. “The Long-Term Effects of the Youth Crime Prevention Program "New Perspectives" on Delinquency and Recidivism.” International Journal of Offender Therapy and Comparative Criminology, 62(12): 36393661. https://doi.org/10.1177/0306624X17751161.Google Scholar
Devlin, S., Yliniemi, L., Kudenko, K., and Tumer, K. 2014. Potential-Based Difference Rewards for Multiagent Reinforcement Learning. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), edited by Lomuscio, Scerri, Bazzan, Huhns, May 5–9, Paris, France. Available at http://web.engr.oregonstate.edu/~ktumer/publications/files/tumer-devlin_aamas14.pdf.Google Scholar
Fuji, T., Ito, K., Matsumoto, K., and Yano, K. 2018. Deep Multi-Agent Reinforcement Learning using DNN-Weight Evolution to Optimize Supply Chain Performance. In Proceedings of the 51st Hawaii International Conference on System Sciences, Hawaii.Google Scholar
Fudenberg, D., Levine, D., and Maskin, E.. 1994. “The Folk Theorem with Imperfect Public Information,” Econometrica, 62(5): 9971040.Google Scholar
Fudenberg, D., and Maskin, E.. 1986. “The Folk Theorem in Repeated Games with Discounting or with Incomplete Information,” Econometrica, 54: 533554.Google Scholar
Gabel, T., and Riedmiller, M. 2007. On a Successful Application of Multi-Agent Reinforcement Learning to Operations Research Benchmarks. In Proceedings of the IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL), Honolulu. Available at http://ml.informatik.uni-freiburg.de/former/_media/publications/gabelriedmiller07a.pdf.Google Scholar
Garcia, J., and Fernandez, F. 2015. “A Comprehensive Survey on Safe Reinforcement Learning.” Journal of Machine Learning Research, 16: 14371480 http://www.jmlr.org/papers/volume16/garcia15a/garcia15a.pdf (accessed May 27, 2019).Google Scholar
Gilboa, I., Samet, D., and Schmeidler, D. 2004. “Utilitarian Aggregation of Beliefs and Tastes.” Journal of Political Economy, 112(4): 932938. https://doi.org/10.1086/421173 Google Scholar
Grondman, I., Busoniu, L., Lopes, G.A.D, and Babuska, R. 2012. “A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients.” IEEE Transactions on Systems, Man And Cybernetics Part C, 42(6): 12911307. Available at http://busoniu.net/files/papers/ivo_smcc12_survey.pdf Google Scholar
Gupta, J.K., Egorov, M., and Kochenderfer, M. 2017. Cooperative Multi-Agent Control Using Deep Reinforcement Learning. In International Conference on Autonomous Agents and Multi-agent Systems, São Paulo, Brazil. Available at http://ala2017.it.nuigalway.ie/papers/ALA2017_Gupta.pdf Google Scholar
Heitzig, J., Lessmann, K., and Zou, Y. 2011. “Self-Enforcing Strategies to Deter Free-Riding in the Climate Change Mitigation Game and Other Repeated Public Good Games.” Proceedings of the National Academy of Sciences of the United States of America, 108(38): 1573915744. https://doi.org/10.1073/pnas.1106265108.Google Scholar
Henneman, L.R., Liu, C., Mulholland, J.A., and Russell, A.G. 2017. “Evaluating the Effectiveness of Air Quality Regulations: A Review of Accountability Studies and Frameworks.” Journal of the Air & Waste Management Association, 67(2): 144172. https://doi.org/10.1080/10962247.2016.1242518.Google Scholar
Hörner, J., and Olszewski, W. 2006. “The Folk Theorem for Games with Private Almost‐Perfect Monitoring.“ Econometrica, 74: 14991544. doi:10.1111/j.1468-0262.2006.00717.x Google Scholar
Howard, R., and Abbas, A. 2016. Foundations of Decision Analysis. New York, NY: Pearson.Google Scholar
Hu, J., and Wellman, M.P. 1998. Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML). pp. 242250.Google Scholar
Hu, J., and Wellman, M.P. 2003. “Nash Q-learning for general-sum stochastic games.” The Journal of Machine Learning Research, 4: 10391069.Google Scholar
Hu, Y., Gao, Y., and An, B. 2015. Multiagent reinforcement learning with unshared value functions. IEEE Transactions on Cybernetics, 45(4): 647662.Google Scholar
Hunt, S., Meng, Q., Hinde, C., and Huang, T. 2014. “A Consensus-Based Grouping Algorithm for Multi-agent Cooperative Task Allocation with Complex Requirements.” Cognitive Computation, 6(3): 338350.Google Scholar
Krishnamurthy, V. 2015. Reinforcement Learning: Stochastic Approximation Algorithms for Markov Decision Processes. Available at: https://arxiv.org/pdf/1512.07669.pdf Google Scholar
Keeney, R., and Raiffa, H. 1976. Decisions with Multiple Objectives: Preferences and Value Tradeoffs. Hoboken, NJ: John Wiley & Sons.Google Scholar
Lindblom, CE. 1959. The science of muddling through. Public Administration Review, 19(2): 7988. https://doi.org/10.2307/973677.Google Scholar
Lemke, C, Budka, M, and Gabrys, B. 2015. Metalearning: A Survey of Trends and Technologies. Artificial Intelligence Review, 44(1): 117130. https://www.ncbi.nlm.nih.gov/pubmed/26069389 Google Scholar
Luce, D.R., and Raiffa, H. 1957. Games and Decisions. New York: John Wiley & Sons.Google Scholar
Lütjens, B., Everett, M., and How, J.P. 2018. Safe Reinforcement Learning with Model Uncertainty Estimates. Available at https://arxiv.org/abs/1810.08700.Google Scholar
Mannion, P., Duggan, J., and Howley, E. 2017. “Analysing the Effects of Reward Shaping in Multi-Objective Stochastic Games. In Adaptive and Learning Agents workshop, Sao Paulo. Available at http://ala2017.it.nuigalway.ie/papers/ALA2017_Mannion_Analysing.pdf.Google Scholar
March, J.G. 1991. “Exploration and Exploitation in Organizational Learning.” Organization Science, 2(1): 7187. Special Issue: Organizational Learning: Papers in Honor of (and by) James G. March. Available at http://www.jstor.org/stable/2634940.Google Scholar
Marschak, J., and Radner, R. 1972. Economic Theory of Teams. New Haven: Yale University Press.Google Scholar
Miceli, T.J. 2017 The Economic Approach to Law. 3rd ed. Stanford University Press.Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., et al. 2015. “Human-Level Control Through Deep Reinforcement Learning.” Nature, 518(7540): 529533.Google Scholar
Molden, D.C., and Hui, C.M. 2011. “Promoting De-Escalation of Commitment: A Regulatory-Focus Perspective on Sunk Costs.” Psychological Science, 22(1): 812. https://doi.org/10.1177/0956797610390386.Google Scholar
Monios, J. 2016. “Policy Transfer or Policy Churn? Institutional Isomorphism and Neoliberal Convergence in the Transport Sector.” Environment and Planning A: Economy and Space, 49(2). https://doi.org/10.1177/0308518X16673367 Google Scholar
Mookherjee, D. 2006. “Decentralization Hierarchies, and Incentives: A Mechanism Design Perspective.” Journal of Economic Literature, 44(2): 367390.Google Scholar
Munos, R. 2014. “From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning.” Foundations and Trends in Machine Learning 7(1):1129. https://doi.org/10.1561/2200000038 Google Scholar
Nisan, N. 2007. “Introduction to Mechanism Design (For Computer Scientists).” In Algorithmic Game Theory, edited by Nisan, N., Roughgarden, T., Tardos, E., and Vazirani, V.. New York, NY: Cambridge University Press.Google Scholar
Omidshafiei, S., Pazis, J., Amato, C., How, J.P., and Vian, J. 2017. Deep Decentralized Multi-Task Multi-Agent Reinforcement Learning Under Partial Observability. Available at https://arxiv.org/abs/1703.06182 (accessed May 27, 2019).Google Scholar
Papadimitriou, C., and Tsitsiklis, J.N. 1985. The complexity of Markov Decision Processes. Available at https://dspace.mit.edu/bitstream/handle/1721.1/2893/P-1479-13685602.pdf?sequence=1 (accessed May 27, 2019).Google Scholar
Petrik, M., Chow, Y., and Ghavamzadeh, M. 2016. Safe Policy Improvement by Minimizing Robust Baseline Regret. Available at https://arxiv.org/abs/1607.03842 (accessed May 27, 2019).Google Scholar
Pham, H.X., La, H.M., Feil-Seifer, D., and Nefian, A. 2018. Cooperative and Distributed Reinforcement Learning of Drones for Field Coverage. Available at https://arxiv.org/pdf/1803.07250.pdf (accessed May 27, 2019).Google Scholar
Potter, M., Meeden, L., and Schultz, A. 2001. Heterogeneity in the Coevolved Behaviors of Mobile Robots: The Emergence of Specialists. In Proceedings of the Seventeenth International Conference on Artificial Intelligence (IJCAI), Seattle.Google Scholar
Qadri, G., Alkilzy, M., Franze, M, Hoffmann, W, and Splieth, C. 2018. “School-Based Oral Health Education Increases Caries Inequalities.” Community Dental Health, 35(3): 153159. https://doi.org/10.1922/CDH_4145Qadri07.Google Scholar
Raiffa, H. 1968. Decision Analysis: Introductory Lectures on Choices Under Uncertainty. Reading, MA: Addison Wesley Google Scholar
Rittel, H.W.J, and Webber, M.W. 1973. “Dilemmas in a General Theory of Planning.” Policy Sciences, 4(2): 155169. Available at http://urbanpolicy.net/wp-content/uploads/2012/11/Rittel+Webber_1973_PolicySciences4-2.pdf.Google Scholar
Russo, J.E., and Schoemaker, P.J.H. 1989. Decision Traps: Ten Barriers to Brilliant Decision-Making and How to Overcome Them. New York: Doubleday.Google Scholar
Schultze, T., Pfeiffer, F., and Schulz-Hardt, S. 2012. “Biased Information Processing in the Escalation Paradigm: Information Search and Information Evaluation as Potential Mediators of Escalating Commitment.” Journal of Applied Psychology, 97(1): 1632. https://doi.org/10.1037/a0024739.Google Scholar
Schulze, S., and Evans, O. 2018. Active Reinforcement Learning with Monte-Carlo Tree Search. Available at https://arxiv.org/abs/1803.04926 (accessed May 27, 2019).Google Scholar
Shalev-Shwartz, S., Shammah, S., and Shashua, A. 2016. Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving. Available at https://arxiv.org/pdf/1610.03295.pdf (accessed May 27, 2019).Google Scholar
Shiarlis, K., Messias, J., and Whiteson, S. 2016. Inverse Reinforcement Learning from Failure. In Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Richland, SC, 1060–1068. http://www.cs.ox.ac.uk/people/shimon.whiteson/pubs/shiarlisaamas16.pdf.Google Scholar
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., et al. 2016. “Mastering the game of Go with deep neural networks and tree search.” Nature, 529(7587): 484–9. https://doi.org/10.1038/nature16961.Google Scholar
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., et al. 2018. “A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go Through Self-Play.” Science, 362(6419): 11401144. https://doi.org/10.1126/science.aar6404.Google Scholar
Sutton, R.S., and Barto, A.G. 2018. Reinforcement Learning: An Introduction. 2nd ed. Cambridge, MA: MIT Press.Google Scholar
Sutton, R.S., and Barto, A.G. 1998. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.Google Scholar
Tang, H., Hao, J., Lv, T., Chen, Y., Zhang, Z., Jia, H., Ren, C., Zheng, Y., Fan, C., and Wang, L. 2018. Hierarchical Deep Multiagent Reinforcement Learning. Available at https://arxiv.org/pdf/1809.09332.pdf (accessed May 27, 2019).Google Scholar
Tetlock, P.E., and Gardner, D. 2015. Superforecasting: The Art and Science of Prediction. New York, NY: Penguin Random 1780 House LLC.Google Scholar
Thomas, P.S., Theocharous, G., and Ghavamzadeh, M. 2015. High Confidence Policy Improvement. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France. Available at https://people.cs.umass.edu/~pthomas/papers/Thomas2015b.pdf Google Scholar
Tollefson, J. 2015. “Can randomized trials eliminate global poverty?Nature, 524(7564): 150153. https://doi.org/10.1038/524150a.Google Scholar
Usunier, N., Synnaeve, G., Lin, Z., and Chintala, S. 2016. Episodic Exploration for Deep Deterministic Policies: An Application to StarCraft Micromanagement Tasks. Available at https://arxiv.org/pdf/1609.02993.pdf (accessed May 27, 2019).Google Scholar
van Hesselt, H., Guez, A., and Silver, D. 2015. Deep Reinforcement Learning with Double Q-learning. Available at https://arxiv.org/pdf/1509.06461.pdf (accessed May 27, 2019).Google Scholar
Villar, S, Bowden, J, and Wason, J. 2015. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical Science, 30(2): 199215. https://doi.org/10.1214/14-STS504.Google Scholar
Vodopivec, T., Samothrakis, S., and Ster, B. 2017. “On Monte Carlo Tree Search and Reinforcement Learning.” Journal of Artificial Intelligence Research, 60: 881936. https://doi.org/10.1613/jair.5507 Google Scholar
Wood, P.J. 2011. “Climate Change and Game Theory.” Annals of the New York Academy of Sciences, 1219: 153–70. https://doi.org/10.1111/j.1749-6632.2010.05891.x.Google Scholar
Xu, C., Qin, T., Wang, G., and Liu, T-Y. 2017. Machine Learning Reinforcement Learning for Learning Rate Control. Available at https://arxiv.org/abs/1705.11159 (accessed May 27, 2019).Google Scholar
Zhang, K., Yang, Z., Liu, H., Zhang, T., and Basar, T. 2018. Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents. In Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80. Available at http://proceedings.mlr.press/v80/zhang18n/zhang18n.pdf.Google Scholar
Zhang, C., Abdallah, S., and Lesser, V. 2008. MASPA: Multi-Agent Automated Supervisory Policy Adaptation. In UMass Computer Science Technical Report #08-03. Amherst: Computer Science Department University of Massachusetts. Available at https://pdfs.semanticscholar.org/418f/2ddfea52da09f21fea633e128ffccd00c8f6.pdf.Google Scholar
Zhao, W., Meng, Q., and Chung, P.W. 2016. “A Heuristic Distributed Task Allocation Method for Multivehicle Multitask Problems and Its Application to Search and Rescue Scenario.” IEEE Transactions on Cybernetics, 46(4): 902915. https://doi.org/10.1109/TCYB.2015.2418052.Google Scholar
Figure 0

Table 1 Some enhancements to reinforcement learning (RL) algorithms.

Figure 1

Table 2 Some principles for safe learning, i.e., learning without risking catastrophic failures.

Figure 2

Table 3 Some MARL variations and extensions.