Introduction
Rapid advancements in technologies, rise in population, and urbanization led to a high demand for electrical and electronic products. After usage due to their depreciation and damages, the products accumulate, which is a significant problem. The repair–reuse–recycle (RRR) process comes into action to manage this equipment’s wastage. To carry out this process smoothly, efficient planning has to be followed. Here, the disassembly planning comes into action. Disassembly planning is the process of forming an efficient plan to dismantle the products into separate entities. After that, the RRR process is carried out based on their quality and requirements.
The disassembly sequence planning (DSP) comes under the primary disassembly planning method. In DSP, a sequence is generated based on which the parts are disassembled. The sequence generation is based on the various input data of the product. Both humans and robots can be utilized to disassemble the products. But robot-based disassembly decreases manpower and time consumption. Also, it increases the safety of the laborers. DSP is considered a problem in the manufacturing sector because of its difficulty in planning sequences for complex products, time, and cost consumption.
After completing the assembling process in the manufacturing unit, the next stage is the repair and maintenance process. Based on the product model obtained from the assembling unit, the repair and maintenance of the products are done in factories and service centers. The disassembling process has three primary purposes. They are repairing, remanufacturing, and recycling the products. These three purposes help reduce the mining of new raw materials and encourage the re-usage of the same product/parts. However, in most cases, due to high time and cost consumption, the disassembling process is discouraged and not utilized by the manufacturers. So, to overcome this problem, an efficient method for generating a disassembly sequence must be planned. This particular disassembly sequence must reduce the respective time and cost consumption. DSP is classified based on the types of disassembly and levels of disassembly (Chand and Ravi, Reference Chand and Ravi2023). The type of DSP varies from product to product based on its structure and complexity.
Based on the type of disassembling process, DSP is divided into the following:
-
1. Sequential DSP – The disassembly of parts is sequentially done one by one.
-
2. Parallel DSP – The disassembly of parts is done in a parallel manner, where two or three parts are disassembled simultaneously at a time.
Similarly, DSP is divided into three types based on the levels of disassembly.
-
1. Complete DSP – Every part of the product is dismantled into individual parts.
-
2. Partial DSP – Dismantling is done up to a particular level. Based on the requirements, it can be the product’s initial, middle, or final levels.
-
3. Selective DSP – A particular part of the product is selectively disassembled based on its requirement.
Various algorithms have been employed for the DSP sequence generation process, including primary, traditional methods to advanced meta-heuristic techniques. In the DSP process, two primary stages are involved. Initially, the product model undergoes analysis, and disassembly attributes are extracted. Subsequently, algorithms analyze these disassembly attributes to generate both feasible and optimal/near-optimal disassembly sequences.
Disassembly attributes
The disassembly attributes are extracted from the computer-aided design (CAD) models of the product automatically. The various types of matrices used for product representation in this work are explained below using the ink–pen example shown in Figure 1.
-
1. Stability
-
2. Liaison
-
3. Geometric feasibility
-
4. Precedence
Stability matrix
The stability matrix (S) is generated based on the stability data between the parts. The stability data gives information about the product, whether the other parts of the product remain stable when one part is disassembled. The matrix shows the relationship between part i and part j. If a part i (Pi) faces no disturbance at any direction during the dismantling of part j (Pj), then this relation is considered as “completely stable” and denoted as “2” in the matrix S. In case Pi is stable in one direction and gets disturbed in another direction during the disassembly of Pj, then this relation is termed as “partially stable” and denoted as “1.” If the Pi has no possibility of proper stability during the disassembly of Pj, it is considered as “unstable” and denoted as “0” in the matrix S. This stability equation is given in Eq. (1), and the stability matrix for the considered ink–pen example is represented in Figure 2.
Liaison matrix
This matrix L denotes the contact relationship between the elements. If part i has contact or connection with part j, it is denoted by “1.” If not, it is denoted by “0.” The contact within the parts of a product is represented in this matrix. The liaison matrix equation is given in Eq. (2). The liaison matrix for the ink–pen example is given in Figure 3.
Disassembly feasibility matrix (geometric)
The parts’ geometric direction-based relationship is given in the disassembly feasibility matrix (D). A part can be disassembled in any direction based on the product structure. The ±XYZ directions (d) in which a part can be dismantled from another part are given as six matrices. If part i can be disassembled from part j in that direction, it is denoted as “1.” If it is not feasible for dismantling, it is denoted as “0.” The XY-axis represents the horizontal and vertical directions, respectively, whereas the Z-axis represents the gravitational direction. To explain the concept of geometric feasibility with a simple example, six directions are represented. In the case of more oversized products with complex connections, the number of directions is high. A part can be disassembled in different angles, which cannot be grouped into these six directions. In such cases, each possible angle will be considered as a direction (Anil Kumar et al., Reference Anil Kumar, Bahubalendruni, Prasad and Sankaranarayanasamy2021). This representation is termed as “oblique-directional interference matrix.” However, the objective will be to reduce the total number of directional changes that occurred during the disassembly process.
The disassembly feasibility matrix equation is given in Eq. (3). The disassembly feasibility matrix for ink–pen is given in Figure 4. The orientation change (Oc) score is calculated from the matrix (D) based on the number of possible directions between part connections i and j. It is given in Eq. (4).
Precedence matrix
The precedence matrix (Pr) gives information regarding the precedence relationship between the parts of a product. The matrix is constructed between any two parts i and j. It is given the value “1” if a particular part i has to be disassembled before part j. If the part has no dependency on the other part and can be disassembled freely, the matrix value is “0.” These matrix data are used to check the feasibility condition of the disassembly sequence. The precedence matrix for the ink–pen example is given in Figure 5, and its equation is given in Eq. (5).
Usually, a product comprises various parts with multiple connections, resulting in many possible disassembly sequences. As the product’s part count increases, the connections and its complexity also increase exponentially. In addition to this, DSP requires the processing of more product and materials’ data based on the objectives considered. This makes the DSP, a challenging and complex problem that requires an efficient and systematic framework to solve it and generate feasible and optimal sequences.
The objectives of DSP
The common objective of DSP is to minimize the time and cost taken for the disassembling of a complete product into separate parts. The primary disassembly time ( $ {D}_p $ ) is the essential time required to dismantle one part of the given product. A constant value is taken as the primary disassembly time (Luo et al., Reference Luo, Peng and Gu2016). The disassembly fitness function derived from the total time and cost incurred to do the disassembly process is used to evaluate the objective. The optimality of a solution is based on the minimization of disassembly time ( D t ) and disassembly cost ( D c ). The disassembly time ( D t ) depends on the number of directional changes. The disassembly cost is based on the cost consumed for various tools and labors used for the disassembly process.
In this work, due to the non-availability of actual cost data, the feasibility property between various parts is considered for the calculation of Dc. Here, the cost does not refer to the actual cost. Instead, it refers to how the parts’ connection affects the disassembly sequence’s feasibility. So, if a sequence has a high cost, the particular sequence is less feasible or non-feasible. Based on the liaison, stability attributes, and assumed weight factors, Dc is calculated. The liaison property is the basic need for a feasible sequence; hence, its weight factor must be high and taken as 10. If the connection of parts (Pi, Pj) that are to be disassembled has no liaison relationship between them, the initial cost $ ({D}_c=1 $ ) is increased by 10. The next important requirement is that the parts in the sequence must be either completely stable or partially stable, so their weight factor is given accordingly. If the connection of the parts in the sequence has low stability, the cost is increased by 3, and for no stability, it is 5. This is done in order to make the feasible sequence have a low-cost value. After calculating the cost for every part connection, the disassembly cost (Dc) data for all the part connections are generated. The disassembly time (Dt) is calculated based on the time penalty (T p ) and the primary disassembly time (Dp) as mentioned in Eq. (6). In T p , a time penalty of 3 is given for part connections with the number of feasible orientations less than or equal to 2. The Dp is assumed to be “1.” The DSP fitness function (Df) is based on the maximization of the inverse of Dt and Dc. It is given in Eq. (7).
Table 1 represents the disassembly sequences for the ink–pen with their respective directions and directional changes, which are generated through the optimization algorithms used for comparison purposes in this work.
Related works
Several research works have been published based on the DSP problem. For better clarity, the related works are divided based on their basic types and explained in this section.
Graph-based methods
At the beginning stage of research in DSP, graph-based methods were used by most of the researchers. The nodes denote the parts in the graph representation, and the edges represent the relationship between those parts. The most used graph-based approach is the AND/OR graph (De Floriani and Nagy, Reference De Floriani and Nagy1989), and an extended version of the AND/OR graph was introduced by Ma et al. (Reference Ma, Jun, Kim and Lee2011). Other graph methods include the extended process graph (Kim and Lee, Reference Kim and Lee2017; Tian et al., 2019a), the disassembly precedence graph (Han et al., Reference Han, Yu and Lee2013), and the graph cut-set method (Gunji et al., Reference Gunji, Pabba, Rajaram, Sorakayala, Dubey, Deepak, Biswal and Bahubalendruni2021). The graph-based approach is preferred because it generates feasible sequences from the product data. To process the uncertainties in dynamic DSP, Vongbunyong et al. (Reference Vongbunyong, Kara and Pagnucco2012) proposed a cognitive robot-based DSP approach. Later work (Vongbunyong et al., Reference Vongbunyong, Kara and Pagnucco2013) introduced a cognition-based robot’s basic and advanced behavior control strategies for disassembly planning. The graph-based method gives a proper representation of various disassembly possibilities, but the definition for graphs needs to be given manually; also, it has the problem of combinatorial explosion when generated in an automatic manner.
Matrix-based methods
Various matrix-based representations like contact (liaison), translational (Smith et al., Reference Smith, Smith and Chen2012), precedence (Azab et al., Reference Azab, Ziout and ElMaraghy2011; Ren et al., Reference Ren, Tian, Zhao, Yu and Zhang2017), directional, and interference (Kheder et al., Reference Kheder, Trigui and Aifaoui2017) matrices are proposed by researchers. These matrices depict the pairwise relationship between the parts of a product regarding their stability, priority, and space interference. Matrix-based data are the input in the optimization and sequence generation processes. Apart from graph and matrix-based approaches, Petri net-based (Petri and Reisig, Reference Petri and Reisig2008; Kuo, Reference Kuo2013) representation is also used. The matrix-based representation is the most suitable type for different computational processes. Another advantage is that the matrix data are obtained from the respective product’s CAD models automatically.
Mathematical and meta-heuristic methods
Various computational techniques and optimization algorithms are used to generate feasible or optimal disassembly sequences. Mathematics-based computational methods like branch-and-bound (Kim and Lee, Reference Kim and Lee2017), linear (Zhu et al., Reference Zhu, Sarigecili and Roy2013), and nonlinear (Ullerich, Reference Ullerich2014) methods are used to solve DSP. The mathematical models have the capability to find the optimal solutions, but the quality of the solution is entirely based on the product’s objective function and representation format. Meta-heuristic methods like genetic algorithm (GA) (Giudice and Fargione, Reference Giudice and Fargione2007; Hui et al., Reference Hui, Dong and Guanghong2008; Tseng and Lee, Reference Tseng and Lee2018), ant colony optimization (ACO) (Wang et al., Reference Wang, Liu, Li and Zhong2003; Kheder et al., Reference Kheder, Trigui and Aifaoui2017), particle swarm optimization (PSO) (Kheder et al., Reference Kheder, Trigui and Aifaoui2017), and artificial bee colony optimization (Ren et al., Reference Ren, Zhang, Zhao, Xiao and Tian2018; Tian et al., 2019a,b; Liu et al., Reference Liu, Zhou, Pham, Xu, Yan, Liu, Ji and Liu2018, Reference Liu, Zhou, Pham, Xu, Ji and Liu2020) are used by various researchers to solve the DSP problem. The meta-heuristic approaches can be applied to complex products to get near-optimal or optimal solutions, but the quality of solutions varies based on the constraints of that particular approach.
Hence, these approaches are enhanced and combined with other meta-heuristic methods to produce hybrid approaches (Tian et al., Reference Tian, Zhou and Li2018). For instance, Tseng et al. (Reference Tseng, Chang, Lee and Huang2019) introduced a hybrid technique based on ACO. A combined version of GA and artificial fish swarm algorithm was submitted by Guo et al. (Reference Guo, Zhong, Li, Du and Guo2019). The recent work of Xing et al. (Reference Xing, Wu and Qu2021) is based on an improved version of the max–min and ant colony system (IMMAS). Hybrid methods can generate better solutions when compared to single-heuristic or meta-heuristic methods. One main problem with these methods is that they are not in a straightforward manner. They use different techniques in different stages of the algorithm, which is not a generalized approach.
Other approaches in DSP
In addition to computational and optimization approaches, various advanced techniques and technologies have been explored for DSP. They are simulation-dependent techniques like CAD (Issaoui et al., Reference Issaoui, Aifaoui and Benamara2017), decision-making-based de-manufacturing (Anil Kumar et al., Reference Anil Kumar, Bahubalendruni, Prasad and Sankaranarayanasamy2021), virtual reality (Mitrouchev et al., Reference Mitrouchev, Wang and Chen2016, Reference Mitrouchev, Wang, Chen, Eynard, Nigrelli, Oliveri, Fajarnes and Rizzuti2017), and augmented reality (AR) (Osti et al., Reference Osti, Ceruti, Liverani and Caligiana2017; Chang et al., Reference Chang, Nee and Ong2020). The robot–human collaborative approach for DSP was implemented following the artificial bee colony algorithm (Liu et al., Reference Liu, Zhou, Pham, Xu, Yan, Liu, Ji and Liu2018, Reference Liu, Zhou, Pham, Xu, Ji and Liu2020; Xu et al., Reference Xu, Tang, Liu, Liu, Zhou and Pham2020). This work tries to minimize the labor process and utilizes human knowledge for both single-objective and multi-objective problems. All these advanced techniques can be used to generate disassembly sequences for specific products with special orientations so that they cannot be used as a general technique for other products, and benchmarking with standard algorithms is a difficult process.
Synthesis of literature study
Though many works are done in DSP, most are based on traditional optimization algorithms and nature-inspired heuristic algorithms. The human–robot collaborative approaches have been employed for various disassembly problems. But specifically, for DSP only a few works have been published. More research work can be done in those areas to solve the DSP problem. Machine learning approaches are used only to a limited extent in DSP due to its problems of slow convergence and local minima. To address these challenges, the utilization of adaptive parameter techniques has emerged as a viable solution. Numerous adaptive parameter techniques have been introduced across various learning problems, including learning automata-based approaches (Beigy and Meybodi, Reference Beigy and Meybodi2000, Reference Beigy and Meybodi2001; Meybodi and Beigy, Reference Meybodi and Beigy2000, Reference Meybodi and Beigy2002). Simulations of these techniques demonstrate their feasibility and effectiveness in various learning problems. In particular, the adaptation of parameters in the back-propagation (BP) algorithm has been applied to train multilayer neural networks. Inspired by this concept of parameter adaptation to overcome the problems of local minima and slow convergence rate, an enhanced simulated annealing (ESA) algorithm has been implemented in this work. By enhancing the convergence rates, a wide range of machine learning models can be explored to address the challenges in DSP (Syed Shahul Hameed and Rajagopalan, Reference Syed Shahul Hameed and Rajagopalan2022, Reference Syed Shahul Hameed and Rajagopalan2023).
Based on the study of related works, it is evident that there is a lack of exploration of machine learning methods, particularly reinforcement learning (RL), in solving DSP problems. Additionally, there is a significant need for an efficient framework capable of effectively handling various products.
Research contributions
To address these gaps, an optimized Q-learning (OQL)-based RL approach, specifically tailored for DSP (RL-DSP), is proposed in this research. The primary objective of this work is to overcome the challenges in DSP by utilizing the potential of RL technique. The contributions in this work are as follows:
-
1. A novel RL framework for DSP: An original RL framework has been contributed that generates optimal sequences for various products with diverse levels of complexity. While limited DSP attributes have been incorporated for processing in most existing works, all necessary attributes have been considered and appropriately weighted according to their influence in generating optimal disassembly sequences in this study.
-
2. Implementation of proposed ESA: An ESA is proposed in this work. Its innovative approach narrows the search space over time and optimally balances between exploration and exploitation, thus resulting in avoiding local optima and enhancing the likelihood of finding global optima.
-
3. The development of the OQL approach: The proposed OQL approach innovatively integrates an epsilon-greedy (EG) approach and the ESA. This approach improves the action selection process by introducing a flexible, adaptive learning model that increases the ability to solve problems more robustly. This leads to the OQL method outperforming the classic QL (CQL) technique in a variety of complex scenarios.
The proposed work is benchmarked against standard and state-of-the-art techniques, selected based on their demonstrated performance in current literature. The results show that the proposed RL-DSP approach provides better solutions in terms of optimality, as well as reduced time and memory consumption.
The organization of the remainder of this paper is as follows: Section “Background study” presents a background study of RL and QL in relation to the research. Section “Proposed methodology” outlines the proposed methodology, including the novel ESA approach and the OQL-based RL framework for DSP. Section “Experiments and analysis” analyzes the experimental study and discusses the results, highlighting the effectiveness of the proposed approach. Finally, the paper concludes with a summary of key findings, contributions, and suggestions for future research.
Background study
Reinforcement learning (RL)
The RL technique is explored because of its efficiency in handling nondeterministic polynomial time (NP)-hard problems (Sutton and Barto, Reference Sutton and Barto2018). This technique aims to make the agent select an appropriate action in the current state that produces the best results. Based on the reward (feedback) received for every action, the agent (decision-maker) analyzes the current state and environment (scenario) to take a particular action (step) at that time. The state of a given problem changes based on the actions picked by the agent. If the agent takes appropriate action to solve the problem, it is rewarded. The agent aims to get maximum reward points by taking the desired actions to maximize the reward based on its feedback experience. In this method, the state (S) denotes the group of all possible states in a problem, action (A(S)) represents the group of possible actions that can be taken at a particular state (S), and reward (R) denotes the reward points given to the agent for taking a desirable action; the penalty is the low reward or negative points given to the agent for taking an undesirable or wrong action, the cost is the measure that denotes the quality of the solution or state, and time indicates the period taken for the learning process. The general framework of RL is illustrated in Figure 6. The main task of an agent is policy $ (\pi $ ) learning, $ \pi $ : S → A. The policy (knowledge) learned must be able to generate a maximum of total rewards (M). Both learning rate (α) and discount factor (γ) must be 0 < α and γ < 1. The total rewards M are defined as r0 + γ*r1 + γ2*r2 + …
Classic Q-learning (CQL) approach
CQL is the standard QL technique, which is based on temporal difference (TD) learning. The updating of Q-values is based on the Q-value [Eq. (8)]. The calculated values are updated in the Q-table. From Eq. (8), s and a are the state and action at the current time (t). Qt (s, a) is the current Q-value considering the current state and the action taken. Qt + 1 is the Q-value of the following state (s’) and action (a’). R (s, a) denotes the reward obtained for that pair of (s, a). max Q’ (s’, a’) defines the maximum number of rewards gained given the new state (s’) and all possible actions at the new state. Algorithm 1 gives the step-by-step approach of the CQL algorithm. Figure 7 depicts the flowchart of the CQL process.
Algorithm 1. Classic Q-Learning (CQL) Algorithm.
-
1 Set the parameters: (learning rate α, discount factor γ)
-
2 For each pair of state (s) & action (a), Q-matrix (s,a) = 0
-
3 Observe the state s
-
4 repeat
-
5 Select the action a using Epsilon-greedy method
-
6 Take the action a
-
7 Receive immediate reward r(s, a)
-
8 Observe the new state s’
-
9 Update Q (s, a) with Eq. (8)
-
10 s = s’
-
11 until all the stopping criterion is satisfied
Epsilon-greedy (EG) approach
The exploitation and exploration trade-offs are the pivotal aspect of RL. The agent must choose the best from already exploited actions to maximize the rewards. Still, it is also required to explore more actions to find the other potentially best solutions for a given problem. For action selection, there are various strategies followed. One of them is the epsilon-greedy method. The epsilon $ (\varepsilon $ ) values should be in the range of $ 0<\varepsilon <1 $ . Initially, when the epsilon rates are higher (~1), the agents explore the environment more; eventually, the epsilon rate decreases, and consequently, the agent starts to exploit more.
Based on the increasing exploration process, the agent gets more knowledge about the environment and the required policy is built. The policy $ \pi $ (s) is applied according to the given Eq. (9) (Ottoni et al., Reference Ottoni, Nepomuceno, de Oliveira and de Oliveira2021). However, this EG approach lacks efficiency in its epsilon-decreasing structure.
$ \pi (s) $ denotes the decision policy for the current state s, a* denotes the best-estimated action for the state s at the current time, and ar denotes the random action selected with probability $ \varepsilon $ .
Proposed methodology
Proposed enhanced simulated annealing (ESA) algorithm
The proposed ESA algorithm is the enhanced version of the standard SA method. In the standard SA method, the new solution is found using the solution function and compared with the old for calculating the difference value (Kirkpatrick et al., Reference Kirkpatrick, Gelatt and Vecchi1983). The difference value is compared with the temperature (temp) factor, and the next decision is taken. For annealing, the temperature reduction factor (β) is used. The standard SA algorithm is given in Algorithm 2.
Algorithm 2. The Standard Simulated Annealing Algorithm.
-
1 Set the parameters: (temp, β)
-
2 solution_new = sol_function
-
3 diff = Solution_new - Solution_prev
-
4 repeat
-
5 if diff <0 or e-(−diff/temp) > random (0, 1));
-
6 solution = new_solution
-
7 temp = temp*β
-
8 until all the stopping criterion is satisfied
In SA, the temperature has to be reduced in a phased manner from its initial rate. This helps the algorithm achieve convergence. In ESA, the additional parameters ε, λ, and tr are used. The SA needs a more structured decreasing rate for the temperature. In order to solve this issue, λ and tr are added. λ and tr denote the temperature decay factor and the temperature regularizing criterion, respectively. Their values are given between 0 and 1. The temperature regularizing criterion is used to reduce the temperature rate in a slow-phased and structured manner, whereas the temperature decay factor is used to maintain the temperature value within the positive range of values.
Based on these parameters, the equation for ESA is formulated. The epsilon gets decreased for each run based on Eq. (10). Then, a decision is taken based on the comparison with random values as given in Algorithm 3.
Algorithm 3. Proposed Enhanced Simulated Annealing (ESA) Algorithm.
-
1 Set the parameters: (temp, $ \lambda $ , tr)
-
2 solution_new = sol_function
-
3 diff = Solution_new – Solution_prev
-
4 repeat
-
5 if diff <0 or (ε > random (0, 1));
-
6 solution = new_solution
-
7 $ \varepsilon =\varepsilon $ – (1 / (log (temp+𝜆 ) + tr)
-
8 until all the stopping criterion is satisfied
Proposed optimized Q-learning (OQL) approach
The existing EG approach in CQL uses immense randomness to decrease the epsilon rate, resulting in unstructured decay of epsilon values. A structured decrease in the epsilon rate is required to obtain more accurate results from the algorithm. The proposed ESA approach acts as a good optimization technique to gradually reduce the epsilon value. Its cooling schedule, which reduces the exploration rate over time, aligns well with the RL learning process. This approach is used in conjunction with the standard EG approach to make the eventual decay of epsilon value more controlled and organized. To define the standard values for the various parameters in this proposed OQL technique, the temperature (temp) is defined as the total number of iterations. This is because the more iterations the program runs, the less exploration is required to find the best solution.
Thus, 1/log(temp) decreases the epsilon value based on the number of iterations consumed by the problem. Additional parameters are used to add more regularization to the decaying epsilon values. The value of decay rate $ (\lambda $ ) is set as (1-α) due to the interdependence of exploration and learning factors. The decreasing rate of epsilon is also based on the learning rate value of the algorithm. As the agent learns more about the environment, less effort is required for exploration. There is a need for a regularizing value that further adds structure to the decreasing rate. For that, the temperature regularizing criterion (tr) is set to the range (0, 1) to maintain the same proportion as the other parameters used in RL. The algorithm for the proposed OQL technique is given in Algorithm 4.
Algorithm 4. Proposed Optimized Q-Learning Technique (OQL) Algorithm.
-
1 Set the parameters: (learning rate α)
-
2 For each pair of state (s) & action (a), Q-matrix (s,a) = 0
-
3 Observe the state s
-
4 repeat
-
5 Select the action a using OQL method
ε ∈ (0, 1); ¥ = random(0,1); tr = 0.1995
temp = no. of. Iterations; $ \lambda $ = (1-α);
ε - = 1 / (log (temp + $ \lambda $ ) + tr);
If ¥ > ε:
Select the best action
Else:
Select a random action
-
6 Take the action a
-
7 Receive immediate reward r(s, a)
-
8 Observe the new state’s
-
9 Update Q (s, a) with Eq. (6)
-
10 s = s’
-
11 until all the stopping criterion is satisfied
For the proposed ESA [Eq. (10)], the following values are taken: epsilon (ε) = 1.0; temperature (temp) = no. of iterations; $ \lambda $ = 1 - learning rate (α); and temperature regularizing criterion (tr) = 0.1995.
To determine the optimal value for tr, a range of values from 0.01 to 0.2 was tested through a series of experiments. It was observed that initially the values of (Df) showed significant variation, but they eventually converged when the value of tr exceeded 0.1995. These experiments were conducted across all eight different products, and the results consistently indicated that the OQL method achieved better convergence and higher quality solutions when tr was equal to or greater than 0.1995. Based on these findings, tr is chosen as 0.1995. Since the OQL technique offers better convergence and higher quality solutions by controlling the decay of epsilon values in a structured manner, this proposed OQL technique helps in obtaining the required results in less time compared to CQL.
Proposed RL framework for DSP
RL is preferred to solve DSP because of its efficiency in solving optimization problems. The supervised and unsupervised learning techniques are not used because of the unavailability of huge datasets for the DSP problem. DSP requires the processing of disassembly attributes and generating the solution based on the objectives but without any training process, which is different from the usual problems solved using supervised/unsupervised learning techniques. For applying RL to DSP, the precedence, stability, liaison, and geometric feasibility data are considered and taken as conditions to generate the reward and penalty values. The initial state is the complete product with all parts connected, and the terminal state is the disassembled product with individual parts. The disassembly sequence is the required output. The primary objective of the RL agent is to generate a sequence that consumes less disassembly time and cost. Every part of a product’s disconnection is taken as an action to perform. So, the state and action of DSP are formulated based on the parts disassembled and the remaining parts to be disassembled.
The main objective of this RL framework is to prepare the agent to predict an efficient DSP sequence that is feasible and optimal with minimal disassembly directions. The RL structure for DSP is given as follows:
States: The states are based on the total number (n) of parts of the product to be disassembled. Initially, the complete product with all the products is the starting state. As each part is removed from the product, the state gets changed eventually, and at the final state, only the last part will be left.
Action: Action is the disassembling of one part from the remaining parts of the product. Action is taken based on the information of the set of parts that can be possibly disassembled given the current state. In each state, an action is taken (a part is removed), and it is carried out till there is no part left to disassemble.
Rewards: These functions are defined to associate the disassembly time/cost with the dismantling process. For each action (disassembly of a part), a reward is given. This reward generation is based on the disassembly fitness (Df) function given in Eq. (11). The better the optimality, the higher the reward is obtained.
RL-DSP methodology
The RL-DSP methodology proposed in this work is built on four steps.
-
1. The stability, liaison, and geometric data are processed to generate the cost (Dc) and time values (Dt) of the disassembly. Then, the disassembly fitness (Df) value is calculated based on the Dc and Dt values.
-
2. The Df values are given as the reward matrix to the RL program; then, the actions are taken based on the OQL method. This is followed by the feasibility checking process using precedence data.
-
3. Once the feasibility conditions are satisfied, the Q-table is generated based on the states, actions, and the QL formula given in Eq. (8).
-
4. Based on the Q-table values, the RL agent generates the disassembly sequences. The parameters tr, α, γ, and λ are tuned to get better optimality.
The flowchart of the proposed RL-DSP framework is shown in Figure 8. The calculation process of fitness and reward values is given in Algorithm 5. The DSP attributes are considered, and the D f value is calculated based on the Dc and Dt values. The Dc and Dt values are allotted to each attribute based on their importance in generating feasible and optimal sequences. Reward values are declared as directly proportional to Df. So, the main objective of the agent will be to maximize the rewards, thereby increasing the fitness value.
Algorithm 5. Proposed RL-DSP Algorithm.
-
1 Initialize RL-DSP product data; (stability, liaison, geometric feasibility); Dp = 1;
-
2 Calculate Dc, Tp, Dt, D f
-
3 repeat
-
4 If L = 0,
Dc = 15;
Else,
Dc = 1;
-
5 If S = 0,
Dc = Dc + 10
Else if S = 1,
Dc = Dc + 5
Else Dc = Dc
-
6 If Ocij < 2, then Tp = Tp + 3
Else, Tp = 0
-
7 Dt = Tp + Dp
-
8 D f = 1/(Dc + Dt)
-
9 Reward = 1000*D f
-
10 until all the stopping criterion is satisfied
The rewards’ data are taken as the OQL algorithm’s input. Randomly, a part is selected, and Q-value calculation is initiated. It is followed by the generation of subsequent part connection’s Q-values. The OQL is followed for taking proper action after analyzing the exploration and exploitation factors. Based on that analysis, the algorithm either searches for new solutions in the entire search space or searches for the best solution within the explored space. Then, the action is checked for feasibility conditions based on the precedence matrix (Pr). If it satisfies the criteria, the action is selected, else another action is taken. This process is repeated until all the part connections are processed. Then, the Q-values are calculated for all the part disassembles, and the final sequence is generated. Finally, a disassembly sequence with the highest reward is chosen as the solution by the agent.
Experiments and analysis
Several algorithms presented in the current literature were analyzed for their performance, and the selection was based on the following factors:
-
1. This study preferred methods widely employed for solving DSP problems across various products, including traditional methods such as brute force (BF), dynamic programming (DP), and GM. In the case of meta-heuristic methods, ACO and GA are often preferred, either in their original, enhanced, or hybrid forms. Hence, these methods are considered for comparison purposes.
-
2. With this consideration, the recent advancement in DSP is the improved max–min ant system (IMMAS), which is included in the comparison (Xing et al., Reference Xing, Wu and Qu2021).
-
3. Some methods were excluded due to their inferior performance or unsatisfactory results (Alshibli et al., Reference Alshibli, El Sayed, Kongar, Sobh and Gupta2016; Ren et al., Reference Ren, Tian, Zhao, Yu and Zhang2017).
-
4. Additionally, many algorithms proposed in other research were product-specific, limiting their effectiveness across various products (Guo et al., Reference Guo, Zhong, Li, Du and Guo2019; Tseng et al., Reference Tseng, Yu and Huang2011; Yeh et al., Reference Yeh, Lin and Wei2012; Wu et al., Reference Wu, Cao and Wang2019).
-
5. The CQL method, which employs a standard QL approach, provides a necessary comparison to the proposed OQL method, an enhanced version of CQL.
Experimental setup
The RL-based DSP program is implemented as Python program and tested on different products using the PyCharm IDE running on a Windows 10 system. The results are analyzed and visualized using the Matplotlib library. The system used for testing has the following specifications: 8GB random access memory (RAM), 1 TB hard disk drive (HDD), and an Intel i5 6th Generation processor. For testing the efficiency of the proposed RL-DSP technique, eight different products with disassembly attributes are taken into consideration. The product information is given in Table 2.
Out of eight products, five are taken from the literature work from which the data are extracted manually, and for the remaining three products, the attributes are obtained from their CAD models. All these products are of different complexities with multidirectional disassembly feasibilities.
Influence of parameters
The proposed RL-DSP framework has been tested on eight different products with multiple parts across various episodes (100, 200, 500, and 1000). Along with the number of episodes, parameters such as temperature regularizing criterion (tr), alpha (α), gamma (γ), and epsilon (ε) also play a crucial role in generating solutions. Different parameter settings have been explored to generate optimal sequences for the diverse set of products. Considering the satisfactory results obtained within 100 episodes, the default number of episodes is set to 100. To determine the appropriate tr value, fitness values of various disassembly sequences are compared with their respective tr values across 10 runs. The experimental study reveals that tr values below 0.1895 exhibit significant variation in solution quality. However, for tr values of 0.1995 and 0.2095, the quality of solutions remains consistent. This can be attributed to tr, which facilitates the reduction of temperature and structured decay of epsilon rate. Consequently, the algorithm achieves a more organized and controlled epsilon decay, leading to improved convergence. Similarly, the values of $ \alpha =0.825 $ , $ \gamma =0.35 $ , and $ \lambda =0.0125 $ are chosen based on their ability to produce high-quality results.
Based on the experimentation, it is concluded that higher learning rates (α) and lower discount factors (γ) contribute to the generation of optimal solutions. On the other hand, higher values of λ result in increased randomness of epsilon (ε), which negatively affects the convergence rate, leading to higher iterations and time consumption.
Evaluation metrics
For the evaluation of the proposed RL framework, three metrics are considered. They are time consumption, memory consumption, and the optimality (fitness value) of the solution. The time consumption given in seconds (s) denotes the period required by the various optimization methods to give the results. The memory consumption given in terms of megabytes (MB) refers to the amount of RAM consumed by the application (optimization methods) to run the program and provide the results. The optimality of the solution is determined by fitness values, where higher values indicate better quality results. An optimal sequence with a high fitness value exhibits feasibility, stability, and minimal directional changes. In this work, the optimality criterion is categorized as no-results, infeasible, near-optimal, and best near-optimal results for clarity and explanation purposes. Based on these three metrics, the performance of the RL-DSP framework (CQL and OQL) and other algorithms is compared.
Results and discussion
A detailed analysis and a comprehensive discussion interpreting the results based on its performance metrics are given in this section.
Time and memory observation
The exact methods explore all the possible solutions. Hence, it produces correct results all the time but consumes more time for the huge number of parts. The optimization algorithms such as ACO, GA, and IMMAS process only a selected set of possible solutions based on their strategies. Being strategic in their processing of solutions, these algorithms give near-optimal results for products of high complexity and matched optimal solutions for smaller or less complex products. Thus, they demonstrate a lower consumption of time and memory compared to the exact methods. The gap in metrics such as time and memory between the tested algorithms is mainly due to the nature of their processing and searching mechanisms. Table 3 and Figure 9 show the time consumption comparisons of the various algorithms for the considered products. Similarly, the memory consumption comparison is given in Table 4 and Figure 10.
No Solution Infeasible Solution Near-Optimal Solution Best Near-Optimal Solution
No Solution Infeasible Solution Near-Optimal Solution Best Near-Optimal Solution
From Tables 3 and 4, it is evident that traditional algorithms exhibit increased time and memory consumption for products with larger part counts. The BF approach requires more than 500 seconds and 900 MB of memory for products with more than nine parts. The DP approach experiences longer execution time and memory usage for 21-part products and does not consistently yield optimal results. Similarly, the GM consumes more time and memory for products with 16 and 21 parts, failing to provide optimal solutions for products with fewer parts (8, 12, and 13). In terms of optimization algorithms, IMMAS demonstrates superior time performance compared to ACO and GA.
Additionally, IMMAS delivers best near-optimal solutions for products of up to 13 parts. Following IMMAS, GA exhibits satisfactory performance in terms of time and memory consumption for sequence generation compared to ACO. ACO performs better than all the traditional algorithms but falls short when compared to other optimization algorithms such as GA and IMMAS. In the case of RL methods, both the CQL and the proposed OQL outperform other algorithms in terms of solution quality. The advantage of RL methods is their efficient learning and decision-making processes. They offer best near-optimal solutions for all products by learning from the consequences of past actions, while minimizing time and memory consumption.
When comparing CQL and the proposed OQL, it is observed that OQL consumes less time and memory than CQL. From the observation, the OQL approach seems to give better performance for products with more parts. The graphical representation of these data in Figures 9 and 10 indicates that the overall time and memory consumption of RL methods are lesser than those of other algorithms. Moreover, the time and memory consumption for RL methods steadily increases as the number of parts in the product rises. When comparing CQL and OQL methods, the OQL technique demonstrates a slight advantage and performs better for products with more parts.
Fitness and reward value observation
The fitness values for the disassembly sequences generated by all the considered algorithms are presented in Table 5 and visualized in Figure 11. High fitness values indicate stable sequences with minimal directional changes. From Figure 11, it is observed that both RL methods (CQL and OQL) consistently generate sequences with high fitness values for all the products considered. On the other hand, traditional algorithms, due to their high time consumption, are limited to generating solutions within 900 seconds, resulting in poorer fitness values. Both RL methods, CQL and OQL, provide best near-optimal solutions with slight differences in fitness values for large products. Although this difference may seem small, it can have a significant impact on the performance of larger and more complex products.
No Solution Infeasible Solution Near-Optimal Solution Best Near-Optimal Solution
To compare the performance of CQL and OQL techniques in the RL aspect, the rewards obtained over 100 episodes by both RL methods for four products with 4, 8, 12, and 21 parts are analyzed and visualized in Figure 12. These four products are selected specifically to showcase the variation in rewards obtained in both RL methods.
Observing Figure 12, it is evident that for the 4-part product, the highest reward is achieved at the 40th episode and starts converging in the OQL method, while the CQL approach reaches its highest reward only at the end of 80 episodes. In the case of the 8-part product, convergence occurs within 50 episodes with OQL, while CQL starts converging only from the 80th episode. For the 12-part product, convergence begins at 50 episodes with OQL, whereas CQL achieves convergence only after 70 episodes. In the case of a large product with 21 parts, OQL produces the best result after 50 episodes, while CQL, with variations, fails to achieve high reward result within 100 episodes.
Summarizing the results, it is clear that OQL achieves good solutions within 100 episodes, with the rewards stabilizing beyond 60 episodes on average for all the considered products. In contrast, CQL exhibits more variations in rewards and only reaches good solutions toward the end of the 100 episodes in most of the cases. Based on this experimental study, it is evident that the OQL method shows significant improvements over CQL with the EG approach. The introduction of the ESA technique in OQL allows for a more structured and organized decay of epsilon values, resulting in faster convergence and higher-quality solutions compared to the CQL method. The results clearly indicate that the OQL method can provide better results faster than CQL, making it a more efficient approach for addressing the DSP problem.
Conclusion
The need for an efficient DSP method for effectively managing the repair–reuse–recycle (RRR) process has been addressed in this work. A RL framework has been proposed for the DSP problem. The QL technique has been employed for generating optimal disassembly sequences. To resolve the exploration–exploitation dilemma, an OQL method based on the proposed ESA technique has been introduced. The proposed RL framework with the OQL approach outperforms the standard benchmark algorithms and state-of-the-art frameworks in terms of time, memory consumption, and solution optimality. The optimality of the solution has been evaluated using the DSP objective function. The results have demonstrated that the proposed RL-DSP framework is effective for various products and yields best near-optimal results. In conclusion, this work has demonstrated that the DSP problem can be effectively solved using the RL approach. Moreover, when the proposed ESA method is incorporated into the QL technique, it has been shown to produce superior results compared to the CQL method. In future work, the employment of deep RL (DRL) techniques to handle large products with multiple parts, connections, and sub-connections is planned.
Data availability statement
The data that support the findings will be available upon valid request.
Funding statement
This work received no specific grant from any funding agency, commercial or not-for-profit sectors.
Competing interest
The authors declare none.
M.C. is Research Scholar in the Department of Computer Science and Engineering at the National Institute of Technology Puducherry, Karaikal. His research interests include artificial intelligence, machine learning, and optimization.
C.R. is Assistant Professor in the Department of Computer Science and Engineering at the National Institute of Technology Puducherry, Karaikal. His area of interests includes artificial intelligence, soft computing, and augmented reality.