We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
Online ordering will be unavailable from 17:00 GMT on Friday, April 25 until 17:00 GMT on Sunday, April 27 due to maintenance. We apologise for the inconvenience.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Safety is an essential requirement as well as a major bottleneck for legged robots in the real world. Particularly for learning-based methods, their trial-and-error nature and unexplainable policy have raised widespread concerns. Existing methods usually treat this challenge as a trade-off between safety assurance and task performance. One reason for this drawback stems from the inaccurate inference for the robot’s safety. In this paper, we re-examine the segmentation of the robot’s state space in terms of safety. According to the current state and the prediction of the state transition trajectory, the states of legged robots are classified into safe, recoverable, unsafe, and failure, and a safety verification method is introduced to online infer the robot’s safety. Then, task, recovery, and fall protection policies are trained to ensure the robot’s safety in different states, forming a safety supervision framework independently from the learning algorithm. To validate the proposed method and framework, experiment results are conducted both in the simulation and on the real-world robot, indicating improvements in terms of safety and efficiency.
Demands in the Ultimatum Game in its traditional form with one proposer and one responder are compared with demands in an Ultimatum Game with responder competition. In this modified form one proposer faces three responders who can accept or reject the split of the pie. Initial demands in both ultimatum games are quite similar, however in the course of the experiment, demands in the ultimatum game with responder competition are significantly higher than in the traditional case with repeated random matching. Individual round-to-round changes of choices that are consistent with directional learning are the driving forces behind the differences between the two learning curves and cannot be tracked by an adjustment process in response to accumulated reinforcements. The importance of combining reinforcement and directional learning is addressed. Moreover, learning transfer between the two ultimatum games is analyzed.
We study the interaction of the effects of the strategic environment and communication on the observed levels of cooperation in two-person finitely repeated games with a Pareto-inefficient Nash equilibrium and replicate previous findings that point to higher levels of tacit cooperation under strategic complementarity than under strategic substitutability. We find that this is not because of differences in the levels of reciprocity as previously suggested. Instead, we demonstrate that slow learning coupled with noisy choices may drive this effect. When subjects are allowed to communicate in free-form online chats before making choices, cooperation levels increase significantly to the extent that the difference between strategic complements and substitutes disappears. A machine-assisted natural language processing approach then shows how the content of communication is dependent on the strategic environment and cooperative behavior, and indicates that subjects in complementarity games reach full cooperation by agreeing on gradual moves toward it.
Altered reinforcement learning (RL) and decision-making have been implicated in the pathophysiology of anorexia nervosa. To determine whether deficits observed in symptomatic anorexia nervosa are also present in remission, we investigated RL in women remitted from anorexia nervosa (rAN).
Methods:
Participants performed a probabilistic associative learning task that involved learning from rewarding or punishing outcomes across consecutive sets of stimuli to examine generalization of learning to new stimuli over extended task exposure. We fit a hybrid RL and drift diffusion model of associative learning to model learning and decision-making processes in 24 rAN and 20 female community controls (cCN).
Results:
rAN showed better learning from negative outcomes than cCN and this was greater over extended task exposure (p < .001, ηp2 = .30). rAN demonstrated a reduction in accuracy of optimal choices (p = .007, ηp2 = .16) and rate of information extraction on reward trials from set 1 to set 2 (p = .012, ηp2 = .14), and a larger reduction of response threshold separation from set 1 to set 2 than cCN (p = .036, ηp2 = .10).
Conclusions:
rAN extracted less information from rewarding stimuli and their learning became increasingly sensitive to negative outcomes over learning trials. This suggests rAN shifted attention to learning from negative feedback while slowing down extraction of information from rewarding stimuli. Better learning from negative over positive feedback in rAN might reflect a marker of recovery.
We begin with the theoretical and empirical foundations of happiness economics, in which the aim of economic policy is to maximize self-reported happiness of people in society. We also discuss the economic correlates of self-reported happiness. We outline some of the key insights from the literature on behavioral industrial organization, such as phishing for phools and the effects of limited attention on the pricing decisions of firms. When products have several attributes, we explain how some might be more salient than others. We also explain the effects of limited attention on economic outcomes. We introduce the basics of complexity economics. Here, people use simple rules of thumb and simple adaptive learning models in the presence of true uncertainty. We show that the aggregate systemwide outcomes are complex, characterized by chaotic dynamics, and the formation of emergent phenomena. The observed fluctuations in the system arise endogenously, rather than from stochastic exogenous shocks. We introduce two kinds of learning models – reinforcement learning and beliefs-based learning. Finally, we critically evaluate the literature on competitive double auction experiments.
This study introduces an advanced reinforcement learning (RL)-based control strategy for heating, ventilation, and air conditioning (HVAC) systems, employing a soft actor-critic agent with a customized reward mechanism. This strategy integrates time-varying outdoor temperature-dependent weighting factors to dynamically balance thermal comfort and energy efficiency. Our methodology has undergone rigorous evaluation across two distinct test cases within the building optimization testing (BOPTEST) framework, an open-source virtual simulator equipped with standardized key performance indicators (KPIs) for performance assessment. Each test case is strategically selected to represent distinct building typologies, climatic conditions, and HVAC system complexities, ensuring a thorough evaluation of our method across diverse settings. The first test case is a heating-focused scenario in a residential setting. Here, we directly compare our method against four advanced control strategies: an optimized rule-based controller inherently provided by BOPTEST, two sophisticated RL-based strategies leveraging BOPTEST’s KPIs as reward references, and a model predictive control (MPC)-based approach specifically tailored for the test case. Our results indicate that our approach outperforms the rule-based and other RL-based strategies and achieves outcomes comparable to the MPC-based controller. The second scenario, a cooling-dominated environment in an office setting, further validates the versatility of our strategy under varying conditions. The consistent performance of our strategy across both scenarios underscores its potential as a robust tool for smart building management, adaptable to both residential and office environments under different climatic challenges.
Expert drivers possess the ability to execute high sideslip angle maneuvers, commonly known as drifting, during racing to navigate sharp corners and execute rapid turns. However, existing model-based controllers encounter challenges in handling the highly nonlinear dynamics associated with drifting along general paths. While reinforcement learning-based methods alleviate the reliance on explicit vehicle models, training a policy directly for autonomous drifting remains difficult due to multiple objectives. In this paper, we propose a control framework for autonomous drifting in the general case, based on curriculum reinforcement learning. The framework empowers the vehicle to follow paths with varying curvature at high speeds, while executing drifting maneuvers during sharp corners. Specifically, we consider the vehicle’s dynamics to decompose the overall task and employ curriculum learning to break down the training process into three stages of increasing complexity. Additionally, to enhance the generalization ability of the learned policies, we introduce randomization into sensor observation noise, actuator action noise, and physical parameters. The proposed framework is validated using the CARLA simulator, encompassing various vehicle types and parameters. Experimental results demonstrate the effectiveness and efficiency of our framework in achieving autonomous drifting along general paths. The code is available at https://github.com/BIT-KaiYu/drifting.
In this note we provide an upper bound for the difference between the value function of a distributionally robust Markov decision problem and the value function of a non-robust Markov decision problem, where the ambiguity set of probability kernels of the distributionally robust Markov decision process is described by a Wasserstein ball around some reference kernel whereas the non-robust Markov decision process behaves according to a fixed probability kernel contained in the ambiguity set. Our derived upper bound for the difference between the value functions is dimension-free and depends linearly on the radius of the Wasserstein ball.
The flexible flat cable (FFC) assembly task is a prime challenge in electronic manufacturing. Its characteristics of being prone to deformation under external force, tiny assembly tolerance, and fragility impede the application of robotic assembly in this field. To achieve reliable and stable robotic automation assembly of FFC, an efficient assembly skill acquisition strategy is presented by combining a parallel robot skill learning algorithm with adaptive impedance control. The parallel robot skill learning algorithm is proposed to enhance the efficiency of FFC assembly skill acquisition, which reduces the risk of damaging FFC and tackles the uncertain influence resulting from deformation during the assembly process. Moreover, FFC assembly is also a complex contact-rich manipulation task. An adaptive impedance controller is designed to implement force tracking during the assembly process without precise environment information, and the stability is also analyzed based on the Lyapunov function. Experiments of FFC assembly are conducted to illustrate the efficiency of the proposed method. The experimental results demonstrate that the proposed method is robust and efficient.
One in eight children experience early life stress (ELS), which increases risk for psychopathology. ELS, particularly neglect, has been associated with reduced responsivity to reward. However, little work has investigated the computational specifics of this disrupted reward response – particularly with respect to the neural response to Reward Prediction Errors (RPE) – a critical signal for successful instrumental learning – and the extent to which they are augmented to novel stimuli. The goal of the current study was to investigate the associations of abuse and neglect, and neural representation of RPE to novel and non-novel stimuli.
Methods
One hundred and seventy-eight participants (aged 10–18, M = 14.9, s.d. = 2.38) engaged in the Novelty task while undergoing functional magnetic resonance imaging. In this task, participants learn to choose novel or non-novel stimuli to win monetary rewards varying from $0 to $0.30 per trial. Levels of abuse and neglect were measured using the Childhood Trauma Questionnaire.
Results
Adolescents exposed to high levels of neglect showed reduced RPE-modulated blood oxygenation level dependent response within medial and lateral frontal cortices particularly when exploring novel stimuli (p < 0.05, corrected for multiple comparisons) relative to adolescents exposed to lower levels of neglect.
Conclusions
These data expand on previous work by indicating that neglect, but not abuse, is associated with impairments in neural RPE representation within medial and lateral frontal cortices. However, there was no association between neglect and behavioral impairments on the Novelty task, suggesting that these neural differences do not necessarily translate into behavioral differences within the context of the Novelty task.
This study proposes a novel hybrid learning approach for developing a visual path-following algorithm for industrial robots. The process involves three steps: data collection from a simulation environment, network training, and testing on a real robot. The actor network is trained using supervised learning for 500 epochs. A semitrained network is then obtained at the $250^{th}$ epoch. This network is further trained for another 250 epochs using reinforcement learning methods within the simulation environment. Networks trained with supervised learning (500 epochs) and the proposed hybrid learning method (250 epochs each of supervised and reinforcement learning) are compared. The hybrid learning approach achieves a significantly lower average error (30.9 mm) compared with supervised learning (39.3 mm) on real-world images. Additionally, the hybrid approach exhibits faster processing times (31.7 s) compared with supervised learning (35.0 s). The proposed method is implemented on a KUKA Agilus KR6 R900 six-axis robot, demonstrating its effectiveness. Furthermore, the hybrid approach reduces the total power consumption of the robot’s motors compared with the supervised learning method. These results suggest that the hybrid learning approach offers a more effective and efficient solution for visual path following in industrial robots compared with traditional supervised learning.
Selective serotonin reuptake inhibitors (SSRIs) are first-line pharmacological treatments for depression and anxiety. However, little is known about how pharmacological action is related to cognitive and affective processes. Here, we examine whether specific reinforcement learning processes mediate the treatment effects of SSRIs.
Methods
The PANDA trial was a multicentre, double-blind, randomized clinical trial in UK primary care comparing the SSRI sertraline with placebo for depression and anxiety. Participants (N = 655) performed an affective Go/NoGo task three times during the trial and computational models were used to infer reinforcement learning processes.
Results
There was poor task performance: only 54% of the task runs were informative, with more informative task runs in the placebo than in the active group. There was no evidence for the preregistered hypothesis that Pavlovian inhibition was affected by sertraline. Exploratory analyses revealed that in the sertraline group, early increases in Pavlovian inhibition were associated with improvements in depression after 12 weeks. Furthermore, sertraline increased how fast participants learned from losses and faster learning from losses was associated with more severe generalized anxiety symptoms.
Conclusions
The study findings indicate a relationship between aversive reinforcement learning mechanisms and aspects of depression, anxiety, and SSRI treatment, but these relationships did not align with the initial hypotheses. Poor task performance limits the interpretability and likely generalizability of the findings, and highlights the critical importance of developing acceptable and reliable tasks for use in clinical studies.
Funding
This article presents research supported by NIHR Program Grants for Applied Research (RP-PG-0610-10048), the NIHR BRC, and UCL, with additional support from IMPRS COMP2PSYCH (JM, QH) and a Wellcome Trust grant (QH).
Developing an artificial design agent that mimics human design behaviors through the integration of heuristics is pivotal for various purposes, including advancing design automation, fostering human-AI collaboration, and enhancing design education. However, this endeavor necessitates abundant behavioral data from human designers, posing a challenge due to data scarcity for many design problems. One potential solution lies in transferring learned design knowledge from one problem domain to another. This article aims to gather empirical evidence and computationally evaluate the transferability of design knowledge represented at a high level of abstraction across different design problems. Initially, a design agent grounded in reinforcement learning (RL) is developed to emulate human design behaviors. A data-driven reward mechanism, informed by the Markov chain model, is introduced to reinforce prominent sequential design patterns. Subsequently, the design agent transfers the acquired knowledge from a source task to a target task using a problem-agnostic high-level representation. Through a case study involving two solar system designs, one dataset trains the design agent to mimic human behaviors, while another evaluates the transferability of these learned behaviors to a distinct problem. Results demonstrate that the RL-based agent outperforms a baseline model utilizing the first-order Markov chain model in both the source task without knowledge transfer and the target task with knowledge transfer. However, the model’s performance is comparatively lower in predicting the decisions of low-performing designers, suggesting caution in its application, as it may yield unsatisfactory results when mimicking such behaviors.
The increase in Electrical and Electronic Equipment (EEE) usage in various sectors has given rise to repair and maintenance units. Disassembly of parts requires proper planning, which is done by the Disassembly Sequence Planning (DSP) process. Since the manual disassembly process has various time and labor restrictions, it requires proper planning. Effective disassembly planning methods can encourage the reuse and recycling sector, resulting in reduction of raw-materials mining. An efficient DSP can lower the time and cost consumption. To address the challenges in DSP, this research introduces an innovative framework based on Q-Learning (QL) within the domain of Reinforcement Learning (RL). Furthermore, an Enhanced Simulated Annealing (ESA) algorithm is introduced to improve the exploration and exploitation balance in the proposed RL framework. The proposed framework is extensively evaluated against state-of-the-art frameworks and benchmark algorithms using a diverse set of eight products as test cases. The findings reveal that the proposed framework outperforms benchmark algorithms and state-of-the-art frameworks in terms of time consumption, memory consumption, and solution optimality. Specifically, for complex large products, the proposed technique achieves a remarkable minimum reduction of 60% in time consumption and 30% in memory usage compared to other state-of-the-art techniques. Additionally, qualitative analysis demonstrates that the proposed approach generates sequences with high fitness values, indicating more stable and less time-consuming disassembles. The utilization of this framework allows for the realization of various real-world disassembly applications, thereby making a significant contribution to sustainable practices in EEE industries.
The use of machine learning in robotics is a vast and growing area of research. In this chapter we consider a few key variations using: the use of deep neural networks, the applications of reinforcement learning and especially deep reinforcement learning, and the rapidly emerging potential for large language models.
We often forego a larger future reward in order to obtain a smaller reward immediately, known as impatient intertemporal choice. The current study investigated the role of Pavlovian-to-instrumental transfer (PIT) as a mechanism contributing to impatient intertemporal choice, following a theoretical framework proposing that cues associated with immediate gratification trigger a Pavlovian approach response, interfering with goal-directed (instrumental) inhibitory behavior. We developed a paradigm in which participants first learned to make instrumental go/no-go responses in order to win rewards and avoid punishments. Next, they learned the associations between Pavlovian cues and rewards varying in amount and delay. Finally, we tested whether these (task-irrelevant) cues exerted transfer effects by influencing instrumental actions while participants again completed the go/no-go task. Across two experiments, Pavlovian cues associated with larger (versus smaller) and immediate (versus delayed) rewards were evaluated more positively, reflecting the successful acquisition of Pavlovian cue–outcome associations. These findings replicated the previously reported classical transfer effect of reward amount on instrumental behavior, as large (versus smaller) cues increased instrumental approach. In contrast, we found no evidence for the hypothesized transfer effects for reward delay, contrary to the proposed role of PIT in impatient intertemporal choice. These results suggest that although both reward amount and delay were important in the evaluation of cues, only the amount associated with cues influenced instrumental choice. We provide concrete suggestions for future studies, addressing instrumental outcome identity, competition between cue–amount and cue–delay associations, and individual differences in response to Pavlovian cues.
Individuals with cocaine use disorder or gambling disorder demonstrate impairments in cognitive flexibility: the ability to adapt to changes in the environment. Flexibility is commonly assessed in a laboratory setting using probabilistic reversal learning, which involves reinforcement learning, the process by which feedback from the environment is used to adjust behavior.
Aims
It is poorly understood whether impairments in flexibility differ between individuals with cocaine use and gambling disorders, and how this is instantiated by the brain. We applied computational modelling methods to gain a deeper mechanistic explanation of the latent processes underlying cognitive flexibility across two disorders of compulsivity.
Method
We present a re-analysis of probabilistic reversal data from individuals with either gambling disorder (n = 18) or cocaine use disorder (n = 20) and control participants (n = 18), using a hierarchical Bayesian approach. Furthermore, we relate behavioural findings to their underlying neural substrates through an analysis of task-based functional magnetic resonanceimaging (fMRI) data.
Results
We observed lower ‘stimulus stickiness’ in gambling disorder, and report differences in tracking expected values in individuals with gambling disorder compared to controls, with greater activity during reward expected value tracking in the cingulate gyrus and amygdala. In cocaine use disorder, we observed lower responses to positive punishment prediction errors and greater activity following negative punishment prediction errors in the superior frontal gyrus compared to controls.
Conclusions
Using a computational approach, we show that individuals with gambling disorder and cocaine use disorder differed in their perseverative tendencies and in how they tracked value neurally, which has implications for psychiatric classification.
A fleet of aircraft can be seen as a set of degrading systems that undergo variable loads as they fly missions and require maintenance throughout their lifetime. Optimal fleet management aims to maximise fleet availability while minimising overall maintenance costs. To achieve this goal, individual aircraft, with variable age and degradation paths, need to operate cooperatively to maintain high fleet availability while avoiding mechanical failure by scheduling preventive maintenance actions. In recent years, reinforcement learning (RL) has emerged as an effective method to optimise complex sequential decision-making problems. In this paper, an RL framework to optimise the operation and maintenance of a fleet of aircraft is developed. Three cases studies, with varying number of aircraft in the fleet, are used to demonstrate the ability of the RL policies to outperform traditional operation/maintenance strategies. As more aircraft are added to the fleet, the combinatorial explosion of the number of possible actions is identified as a main computational limitation. We conclude that the RL policy has potential to support fleet management operators and call for greater research on the application of multi-agent RL for fleet availability optimisation.
Dive into the foundations of intelligent systems, machine learning, and control with this hands-on, project-based introductory textbook. Precise, clear introductions to core topics in fuzzy logic, neural networks, optimization, deep learning, and machine learning, avoid the use of complex mathematical proofs, and are supported by over 70 examples. Modular chapters built around a consistent learning framework enable tailored course offerings to suit different learning paths. Over 180 open-ended review questions support self-review and class discussion, over 120 end-of-chapter problems cement student understanding, and over 20 hands-on Arduino assignments connect theory to practice, supported by downloadable Matlab and Simulink code. Comprehensive appendices review the fundamentals of modern control, and contain practical information on implementing hands-on assignments using Matlab, Simulink, and Arduino. Accompanied by solutions for instructors, this is the ideal guide for senior undergraduate and graduate engineering students, and professional engineers, looking for an engaging and practical introduction to the field.
In this chapter, we introduce some of the more popular ML algorithms. Our objective is to provide the basic concepts and main ideas, how to utilize these algorithms using Matlab, and offer some examples. In particular, we discuss essential concepts in feature engineering and how to apply them in Matlab. Support vector machines (SVM), K-nearest neighbor (KNN), linear regression, Naïve Bayes algorithm, and decision trees are introduced and the fundamental underlying mathematics is explained while using Matlab’s corresponding Apps to implement each of these algorithms. A special section on reinforcement learning is included, detailing the key concepts and basic mechanism of this third ML category. In particular, we showcase how to implement reinforcement learning in Matlab as well as make use of some of the Python libraries available online and show how to use reinforcement learning for controller design.