Hostname: page-component-cd9895bd7-gxg78 Total loading time: 0 Render date: 2024-12-25T04:40:12.914Z Has data issue: false hasContentIssue false

Optimal Wireless Information and Power Transfer Using Deep Q-Network

Published online by Cambridge University Press:  01 January 2024

Yuan Xing*
Affiliation:
Department of Engineering & Technology, University of Wisconsin-Stout, Menomonie, WI, USA
Haowen Pan
Affiliation:
Shanghai Legit Network Technology Co, Shanghai, China
Bin Xu
Affiliation:
Shanghai Legit Network Technology Co, Shanghai, China
Cristiano Tapparello
Affiliation:
Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY 14627, USA
Wei Shi
Affiliation:
Department of Engineering & Technology, University of Wisconsin-Stout, Menomonie, WI, USA
Xuejun Liu
Affiliation:
Department of Engineering & Technology, University of Wisconsin-Stout, Menomonie, WI, USA
Tianchi Zhao
Affiliation:
Department of Electrical and Computer Engineering, University of Arizona, Tucson, AZ 85721, USA
Timothy Lu
Affiliation:
Department of Engineering & Technology, University of Wisconsin-Stout, Menomonie, WI, USA
*
Correspondence should be addressed to Yuan Xing; [email protected]

Abstract

In this paper, a multiantenna wireless transmitter communicates with an information receiver while radiating RF energy to surrounding energy harvesters. The channel between the transceivers is known to the transmitter, but the channels between the transmitter and the energy harvesters are unknown to the transmitter. By designing its transmit covariance matrix, the transmitter fully charges the energy buffers of all energy harvesters in the shortest amount of time while maintaining the target information rate toward the receiver. At the beginning of each time slot, the transmitter determines the particular beam pattern to transmit with. Throughout the whole charging process, the transmitter does not estimate the energy harvesting channel vectors. Due to the high complexity of the system, we propose a novel deep Q-network algorithm to determine the optimal transmission strategy for complex systems. Simulation results show that deep Q-network is superior to the existing algorithms in terms of the time consumption to fulfill the wireless charging process.

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
Copyright © 2021 Yuan Xing et al.

1. Introduction

For a wireless transceiver pair with multiple antennas, optimizing the transmit covariance matrix can achieve high data-rate communication over the multiple-input multiple-output (MIMO) channel. Meanwhile, the radiated radio frequency (RF) energy can be acquired by the nearby RF energy harvesters to charge the electronic devices [Reference Zhang and Ho1].

The problem of simultaneous wireless information and power transfer (SWIPT) has been widely discussed in recent years. SWIPT systems are divided into two categories: (1) the receiver splits the received signals for information decoding and energy harvesting [Reference Lee, Liu and Zhang2, Reference Zong, Feng, Yu, Zhao, Yang and Hu3]; (2) separated and dedicated information decoders (ID) and RF energy harvesters (EH) exist in the systems [Reference Xing, Qian and Dong4]. For the second type of the system, different transmission strategies have ever been proposed to achieve good performance points in the rate-energy region [Reference Zhang and Ho1, Reference Lee, Liu and Zhang2, Reference Park and Clerckx5]. For the multiple RF energy harvesters, which are in the vicinity of the wireless transmitter, the covariance matrix at the transmitter is designed to either maximize the net energy harvesting rate or fairly distribute the radiated RF energy at the harvesters [Reference Wu, Zhang, Wang and Wang6, Reference Thudugalage, Atapattu and Evans7]. The achievable information rate of the wireless transmitter-receiver pair is beyond a minimum requirement for reliable communication. Most of the existing works assume the channel state information (CSI) is completely known. Given the complete CSI, the transmitter designs the transmit covariance matrix to achieve the maximum information rate while satisfying the RF energy harvesting requirement [Reference Xing, Qian and Dong4, Reference Xing and Dong8].

However, in practice, it is difficult for the transmitter to obtain the channel state information to the nearby RF energy harvesters because the scattering distribution of the hardware-limited energy harvesters makes the channel estimation at the RF energy harvesters challenging [Reference Mekikis, Antonopoulos, Kartsakli, Lalos, Alonso and Verikoukis9, Reference Xu and Zhang10]. The analytic center cutting plane method (ACCPM) was proposed for the transmitter to approximate the channel information with a few bits of feedback from the RF energy receiver iteratively [Reference Xu and Zhang10]. Since this method is implemented by solving a convex optimization problem, the algorithm leads to high computational complexity. To reduce complexity, channel estimation based on Kalman filtering was proposed [Reference Choi, Kim and Chung11]. Nevertheless, the disadvantage of this approach is the slow convergence rate. In order to effectively deal with the CSI acquisition problem, in our paper, we will use the deep learning algorithm to solve the optimization problem in the SWIPT system only with partial channel information. The partial CSI is easy to acquire, which is already enough to achieve superior system performance using the deep Q-network. To the best of our knowledge, we are the first one to use the deep Q-network to optimize the SWIPT system performance and validate its superiority.

In our model, the transmitter intends to fully charge all surrounded energy harvesters’ energy buffers in the shortest time while maintaining a target information rate toward the receiver. The communication link is defined as a strong line-of-sight (LOS) transmission, which is supposed to be invariant, but the energy harvesting channel conditions vary over time. Due to current hardware limitations, we assume that the estimation of the energy harvesting channel vectors is not able to be implemented under the fast varying channel conditions. As a result, the wireless charging problem can be modeled as a high complexity discrete-time stochastic control process with unknown system dynamics [Reference Mnih, Kavukcuoglu and Silver12]. In [Reference Xing, Qian and Dong13], a similar problem has been explored. A multiarmed bandit algorithm is used to determine the optimal transmission strategy. In our paper, we apply a deep Q-network to solve the optimization problem and the simulation results demonstrate that the deep Q-network algorithm outperforms the multiarmed bandit algorithm. Historically, deep Q-network has a strongly proven record of attaining mastery over complex games with a very large number of system states, and unknown state transition probabilities [Reference Mnih, Kavukcuoglu and Silver12]. More recently, a deep Q-network has been applied to deal with complex communication problems and has been shown to achieve good performance [Reference He, Zhang and Yu14Reference Xu, Wang, Tang, Wang and Gursoy16]. For this reason, we found deep Q-network fitting for our model. In our model, we consider the accumulated energy of the energy harvesters as the system states, while we define the action as the transmit power allocation. At the beginning of each time slot, each energy harvester sends feedback about the accumulated energy level to the wireless transmitter, and the transmitter collects all the information in order to generate the system state and inputs it into a well-trained deep Q-network. The deep Q-network outputs the Q values corresponding to all possible actions. The action with the maximum Q value is selected as the beam pattern to be used for the transmission during the current time slot.

Based on the traditional deep Q-network, the double deep Q-network and dueling deep Q-network algorithms are applied in order to reduce the observed overestimations [Reference Van Hasselt, Guez and Silver17] and improve the learning efficiency [Reference Wang, Schaul, Hessel, Van Hasselt, Lanctot and De Freitas18]. Henceforth, we apply dueling double deep Q-network to solve the varying channel multiple energy harvester wireless charging problem.

The novelties of this paper are summarized as follows:

  1. (i) The simultaneous wireless information and power transfer problem is formulated as a Markov decision process (MDP) in an unknown varying channel condition for the first time.

  2. (ii) The deep Q-network algorithm is applied to solve the proposed optimization problem for the first time. We demonstrate that, compared to the other existing algorithms, deep Q-network shows the superiority in efficient and stable wireless power transfer.

  3. (iii) Multiple experimental scenarios are explored. By varying the number of transmission antennas and the number of energy harvesters in the system, the performance of both the deep Q-network and the other algorithms is compared and analyzed.

  4. (iv) The evaluation for the algorithms is based on the real experimental data, which validate the effectiveness of the proposed deep Q-network in real-time simultaneous wireless information and power transfer systems.

The rest of the paper is organized as follows. In Section 2, we describe the simultaneous wireless information and power transfer system model. In Section 3, we model the optimization problem as a Markov decision process and present a deep Q-network algorithm to determine the optimal transmission strategy. In Section 4, we present our simulation results for different experimental environments. Section 5 concludes the paper.

2. System Model

As shown in Figure 1, an information transmitter communicates with its receiver while perceived by K nearby RF energy harvesters [Reference Xing and Dong8]. Both the transmitter and the receiver are equipped with M antennas, while each RF energy harvester is equipped with one receive antenna. The baseband received signal at the receiver can be represented as

(1) y = H x + z ,

where H M × M denotes the normalized baseband equivalent channel from the information transmitter to its receiver, x M × 1 represents the transmitted signal, and z M × 1 is the zero-mean circularly symmetric complex Gaussian noise with z CN 0 , ρ 2 I .

FIGURE 1: Wireless information transmitter and receiver surrounded by multiple RF energy harvesters.

The transmit covariance matrix is denoted by Q, i.e., Q = E x x H . The covariance matrix is Hermitian positive semidefinite, i.e., Q 0 . The transmit power is restricted by the transmitter’s power constraint P, i.e., Tr Q P . For the information transmission, we assume that a Gaussian codebook with infinitely many code words is used for the symbols and the expectation of the transmit covariance matrix is taken over the entire codebook. Therefore, x is the zero-mean circularly symmetric complex Gaussian with x CN 0 , Q . With transmitter precoding and receiver filtering, the capacity of the MIMO channel is the sum of the capacities of the parallel noninterfering single-input single-output (SISO) channels (eigenmodes of channel H) [Reference Biglieri, Calderbank, Constantinides, Goldsmith, Paulraj and Poor19]. We convert the MIMO channel to M eigenchannels for information and energy transfer [Reference Timotheou, Krikidis, Karachontzitis and Berberidis20, Reference Mishra and Alexandropoulos21]. A singular value decomposition (SVD) on H gives H = U Σ V H , where Σ = diag σ 1 , σ 2 , , σ M contains the M singular values of H. Since the MIMO channel is decomposed into M parallel SISO channels, the information rate can be given by

(2) r = m = 1 M log 1 + ρ 2 σ m 2 q ^ m ,

where q ^ m are the diagonal elements of Q ^ with Q ^ = V H Q V .

The RF energy harvester received power specifies the harvested energy normalized by the baseband symbol period and scaled by the energy conversion efficiency. The received power at the ith energy harvester is

(3) p i = g i H Q g i ,

where g i M × 1 is the channel vector from the transmitter to the ith energy harvester. With MIMO channel decomposition, the received power at energy harvester i is denoted as

(4) p i = m = 1 M g ^ i m 2 q ^ m ,

where g ^ i m are the elements of vector g ^ i with g ^ i = V H g i .

We define the simplified channel vector from the transmitter to the ith RF energy harvester as

(5) c i = g ^ i 1 2 , g ^ i 2 2 , , g ^ i M 2 T ,

for each i K = 1,2 , , K . The simplified channel vector contains no phase information. The K simplified channel vectors compose matrix C M × K as

(6) C = c 1 , c 2 , , c K .

In what follows, we assume that time is slotted, each time slot as a duration T, and that each energy harvester is equipped with an energy buffer of size B i 0 , B max , i K . Without loss of generality, we assume that, at t = 0, all harvesters’ buffers are empty, which corresponds to system state s 0 = 0,0 , , 0 . At a generic time slot t, the transmitter transmits with one of the designed beam patterns. Each harvester i can harvest the specific amount of power pi, and its energy buffer values increase to B i t + 1 = B i t + p i T . Therefore, each state of the system includes the accumulated harvested energy information of all K harvesters, i.e.,

(7) s t = B 1 t , B 2 t , , B K t ,

where B i t denotes the ith energy harvester’s accumulated energy up to time slot t.

Once all harvesters are fully charged, we assume that the system arrives at a final goal state denoted as s G = B max , B max , , B max . We note that the energy buffer level B max also accounts for situations in which B i > B max .

3. Problem Formulation for Time-Varying Channel Conditions

In this section, we suppose that the communication link is characterized by strong LOS transmission, which results in an invariant channel matrix H, while the energy harvesting channel vector g varies over time slots. We model the wireless charging problem as a Markov decision process (MDP) and show how to solve the optimization problem using reinforcement learning (RL). When the number of system states is very large, we apply a deep Q-network algorithm to acquire the optimal strategy at each particular system state.

3.1. Problem Formulation

In order to model our optimization problem as a RL problem, we define the beam pattern chosen in a particular time slot t as the action at. The set of possible actions A is determined by equally generating L different beam patterns with power allocation vector q ^ = q ^ 1 , , q ^ m that satisfies the power and information rate constraints, i.e., i = 1 M q ^ m = P , i = 1 M log 1 + ρ 2 σ m 2 q ^ m R . Each beam pattern corresponds to a particular power level pi, which depends not only on the action at but also on the channel condition experienced by the harvester during time slot t.

Given the above, the simultaneous wireless information and energy transfer problem for a time-varying channel can be formulated as minimizing the time-consumption n to fully charge all the energy harvesters while maintaining the information rate between the information transceivers:

(8) P 1 : minimize a t n subject to a m t 0 m = 1 M a m t P m = 1 M log 1 + ρ 2 σ m 2 a m t R t = 1 n m = 1 M g ^ i m t 2 a m t T B max , i K .

In general, the action selected at each time slot will be different to adapt to the current channel conditions and current energy buffer state of the harvesters. Therefore, the evolution of our system can be described by a Markov chain, where the generic state s is identified by the current buffer levels of the harvester, i.e., s = B 1 , B 2 , , B K . The set of all states is denoted by S. Among all states, we are interested in the state in which all harvesters’ buffer is empty, namely, s 0 = 0 , , 0 , and the state sG in which all the harvesters are fully charged, i.e., s G = B max , , B max . If we suppose that we know all the channel coefficients at each time slot, problem P 1 can be seen as a stochastic shortest path (SSP) problem from state s 0 to state sG. At each time slot, the system is in a generic state s, the transmitter selects a beam pattern or action a A , and the system moves to a new state s′. The dynamics of the system is captured by transition probabilities p s , s a , s , s S , and a A , describing the probability that the harvesters’ energy buffers reach the levels in s′ after a transmission with beam pattern a. We note that the goal state sG is absorbing, i.e., P s G , s G a = 1 , a A .

Each transition also has an associated reward, w s , a , s , that denotes the reward when the current state is s S , action a A is selected, and the system moves to state s S . Since we aim at reaching sG in the fewest transmission time slots, we consider that the action entails a positive reward related to the difference between the current energy buffer level and the full energy buffer level of all harvesters. When the system reaches state sG, we set the reward as 0. In this way, the system not only tries to fully charge all harvesters in the shortest time but will also uniformly charge all the harvesters. In detail, we define the reward function as

(9) w s , a , s = λ KB max i = 1 K min s i , B max ,

where

(10) λ s i = λ s i + λ m = 1 M g ^ i m 2 a m T ,

and λ denotes the unit price of the harvested energy.

It is noted that different reward functions can also be selected. As an example, it is also possible to set a constant negative reward (e.g., a unitary cost) for each transmission that the system does not reach the goal state and a big positive reward only for the states and actions that bring the system to the goal state sG. This can be expressed as follows:

(11) w s , a , s = + , s = s G , 1 , otherwise .

We note that the reward formulation of equation (11) is actually equivalent to minimizing the number of time slots required to reach state sG starting from state s 0.

Using the above formulation, the optimization problem P = S , A , p , w , s 0 , s G can then be seen as a stochastic shortest path search from state s 0 to state sG on the Markov chain with states S and probabilities p s , s a , actions a A , and rewards w s , a , s . Our objective is to find, for each possible state s S , an optimal action a s so that the system will reach the goal state following the path with maximum average reward. A generic policy can be written as π = a s : s S .

Different techniques can be applied to solve problem P 1, as it represents a particular class of MDPs. In this paper, however, we assume that the channel conditions at each time slot are unknown, which corresponds to not knowing the transition probabilities p s , s a . Therefore, in the next section, we describe how to solve the above problem using reinforcement learning.

3.2. Optimal Power Allocation with Reinforcement Learning

Reinforcement learning is suitable for solving optimization problems in which the system dynamics follow a particular transition probability function, however, the probabilities p s , s a are unknown. In what follows, we first show how to apply the Q-learning algorithm [Reference Barto, Bradtke and Singh22] to solve the optimization problem and then show how we can combine the reinforcement learning approach with a neural network to approximate the system model in case of large states and action sets, using deep Q-network [Reference Mnih, Kavukcuoglu and Silver12].

3.2.1. Q-Learning Method

If the number of system states is small, we can depend on the traditional Q-learning method to find the optimal strategy at each system state, as defined in the previous section.

To this end, we define the cost function of action a on system state s as p s , s a , with s S , a A . The algorithm initializes with Q s , a = 0 and then updates the Q values using the following equation:

(12) Q s , a = 1 α s , a Q s , a + α s , a w s , a , s + γ f s , a ,

where

(13) f s , a = min a A Q s , a ,

and α s , a denotes the learning rate. In each time slot, only one Q value is updated, and hence, all the other Q values remain the same.

At the beginning of the learning iterations, since the Q-table does not have enough information to choose the best action at each system state, the algorithm randomly explores new actions. Hence, we first define threshold ε c 0.5 , 1 , and we then randomly generate a probability p 0,1 . In the case that p ε c , we choose the action a as

(14) a = max a A Q s , a .

On the contrary, if p < ε c , we randomly select one action from the action set A.

When Q converges, the optimal strategy at each state is determined as

(15) π s = arg max a A Q s , a ,

which corresponds to finding the optimal beam pattern for each system state during the charging process.

3.2.2. Deep Q-Network

When considering a complex system with multiple harvesters, large energy buffers, and time-varying channel conditions, the number of system states dramatically increases. In order to learn the optimal transmit strategy at each system state, the Q-learning algorithm described before requires a Q-table with a large number of elements, making it very difficult for all the values in the Q-table to converge. Therefore, in what follows, we describe how to apply the deep Q-network (DQN) approach to find the optimal transmission policy.

The main idea of DQN is to train a neural network to find the Q function of a particular system state and action combination. When the system is in state s, and action a is selected, the Q function is denoted as Q s , a , θ . θ denotes the parameters of the Q-network. The purpose of training the neural network is to make

(16) Q s , a , θ Q s , a .

According to the DQN algorithm [Reference Van Hasselt, Guez and Silver17], two neural networks are used to solve the problem: the evaluation network and the target network, which are denoted as eval_net and target_net, respectively. Both the eval_net and the target_net are set up with several hidden layers. The input of the eval_net and the target_net are denoted as s and s′, which describe the current system state s and the next system state s′, respectively. The output of eval_net and target_net are denoted as Q e s , a , θ and Q t s , a , θ , respectively. The evaluation network is continuously trained to update the value of θ; however, the target network only copies the weight parameters from the evaluation network intermittently (i.e., θ = θ ). In each neural network learning epoch, the loss function is defined as

(17) loss θ = E y Q e s , a , θ 2 .

where y represents the real Q value and is calculated as

(18) y = w s , a , s + ε max a A , Q t s , a , θ ,

where ε is the learning rate. As the loss function updates, the values are backpropagated to the neural network to update the weight of the eval_net.

ALGORITHM 1: Deep Q-network algorithm training process.

In order to better train the neural network, we apply the experience reply method to remove the correlation between different training data. Each experience consists of the current system state s, the action a, the next system state s′, and the corresponding reward w s , a , s . The experience is denoted by the set e p = s , a , w s , a , s , s . The algorithm records D experiences, and randomly select Ds (with D s < D ) experiences from D for training. After the training is finished, target_net clones all the weight parameters from the eval_net (i.e., θ = θ ).

The algorithm used for the DQN training process is presented in Algorithm 1. In the algorithm, we define in each training iteration, we generate D usable experiences ep and select Ds of all for training the eval_net. In total, we suppose there are U training iterations. We consider that, for both the eval_net and the target_net, there are Nl layers in the neural network. In the learning process, we use C to denote all energy harvesters’ channel condition in a particular time slot.

3.2.3. Dueling Double Deep Q-Network

Since more harvesters and time-varying channel conditions incur more system states, even if we utilize the original DQN, it is hard to study the transmit rules for the transmitter. Therefore, we can apply dueling double DQN in order to deal with the overestimating problem during the training process and improve the learning efficiency of the neural network. Doubling DQN is a technique that strengthens the traditional DQN algorithm by preventing overestimating to happen [Reference Van Hasselt, Guez and Silver17]. In traditional DQN, as shown in equation (18), we utilize the target_net to predict the maximum Q value of the next state. However, the target_net is not updated at every training episode, which may lead to an increase in the training error and therefore complicate the training process. In doubling DQN, we utilize both the target_net and the eval_net to predict the Q value. The eval_net is used to determine the optimal action to be taken for the system state s′ as follows:

(19) y = w s , a , s + ε max a A Q e s , arg a A max Q s , a , θ , θ .

It can be shown that, following this approach, the training error considerably decreases [Reference Van Hasselt, Guez and Silver17].

In traditional DQN, the neural network only has the Q value as the output. In order to speed up the convergence, we apply dueling DQN by setting up two output streams from the neural network. The first stream is represented by the output value V s , θ , β results of the neural network, which represents the Q value of each system state. The second stream is called advantage output A s , a , θ , α and describes the advantage of applying each particular action to the current system state [Reference Wang, Schaul, Hessel, Van Hasselt, Lanctot and De Freitas18]. α and β are parameters that relate the two streams and the neural network output, which is denoted as

(20) Q s , a , θ , α , β = V s , θ , β + A s , a , θ , α 1 A a A s , a , θ , α .

Dueling DQN can efficiently eliminate the extra training freedom, which speeds up the training [Reference Wang, Schaul, Hessel, Van Hasselt, Lanctot and De Freitas18].

4. Simulation Results

We simulate a MIMO wireless communication system with nearby RF energy harvesters. The wireless transmitter has at most M = 4 antennas. The 4 × 4 communication MIMO channel matrix H is measured by two Wireless Open-Access Research Platform (WARP) v3 boards. Both WARP boards are mounted with the FMC-RF-2X245 dual-radio module, which is operated in 5.805 GHz frequency band. The Xilinx Virtex-6 FPGA operates as the central processing system and the WARPLab is used for rapid physical layer prototyping which is compiled by MATLAB [Reference Dong and Liu23]. We deploy two transceivers as line-of-sight transmission. The maximum transmitted power is P = 12W. ρ 2 = 70 dBm. The information rate requirement R is 53 bps/Hz. The average channel gain from the transmitter to the energy harvester is −30 dB. The energy conversion efficiency is 0.1. The duration of one time slot is defined as T = 100 ms.

DQN is trained to solve for the optimal transmit strategies for each system state. The simulation parameters used for DQN are presented in Table 1.

TABLE 1: DQN simulation parameters.

As described in Section 3.2, the exploration rate ε c determines the probability that the network selects an action randomly or follows the values of the Q-table. Initially, we set ε c = 1 because the experience pool has to accumulate reasonable amount of data to train the neural network. ε c = 1 decreases with 0.001 at each training interval and finally stops at ε c h = 0.1 , since the experience pool has collected enough training data.

Refer to [Reference Xing, Pan, Xu, Zhao, Tapparello and Qian24]. The dueling double DQN is used in our paper, which is shown in Figure 2. The software environment for simulation is TensorFlow 0.12.1 with Python 3.6 in Jupyter Notebook 5.6.0.

FIGURE 2: Dueling double deep Q-network structure.

For the energy harvesters’ channel, to show an example of the performance achievable by the proposed algorithm, we consider the Rician channel fading model [Reference Cavers25]. We suppose within each time slot t, the channel is invariant and varies in different time slots [Reference Wu and Yang26]. At the end of each time slot, the energy harvester feedbacks the current energy level back to the transmitter. For the Rician fading channel model, the total gain of the signal is denoted as g = g s + g d , where gs is the invariant LOS component and gd denotes a zero-mean Gaussian diffuse component. The channel between the transmit antenna m and the energy harvester i can be denoted as g i m = g i m s + g i m d . The magnitude of the faded envelope can be modeled using the Rice factor Kr such that K i m r = ρ i m 2 / 2 σ i m 2 , where ρ i m 2 denotes the average power of the main LOS component between the transmit antenna m and energy harvester i and σ i m 2 denotes the variance of the scatter component. We can derive the magnitude of the main LOS component as g i m s = 2 K i m r σ i m since 1 / 2 E g i m d 2 = σ i m 2 . The mean and the variance of gim are denoted as μ g i m = g i m s and σ g i m 2 = σ i m 2 , respectively. In polar coordinates, g i m = r i m e j θ i m .

First, we explore the optimal deep Q-network structure under fading channels. We suppose the number of antennas is M = 3 and the number of energy harvesters is K = 2. The channel between each antenna of the transmitter and each harvester is individually Rician distributed. The action set A contains 13 actions satisfying the information rate requirement: 2,2,8 T , 2,4,6 T , 2,6,4 T , 2,8,2 T , 4,2,6 T , 4,4,4 T , 4,6,2 T , 4,8,0 T , 6,2,4 T , 6,4,2 T , 6,6,0 T , 8,2,2 T , and 8,4,0 T .

The LOS amplitude components of all channel links are defined as r i m = 0.5 , with i = 1, 2 and m = 1, 2, 3. The LOS phase components of all channel links are defined as θ 11 = π / 4 , θ 12 = π / 2 , θ 13 = π / 4 , θ 21 = π / 2 , θ 22 = 0 , and θ 23 = 3 π / 4 . The standard deviation of the gim amplitude and phase is denoted as σ i m and 1 / 2 K i m r , respectively. We suppose σ i m = 0.05 , i , m . Hence, 1 / 2 K i m r = r i m / σ i m 1 = 0.1 , i , m .

Using the fading channel model above, in Figure 3, we show how the structure of the neural network together with the learning rate can affect the performance of the DQN, for a fixed number of training episodes (i.e., 40000). The performance of DQN is measured by the average number of time slots required to fully charge two harvesters. The average time-consumption is obtained over 1000 testing data. Figure 3 shows that if the deep Q-network has multiple hidden layers, a smaller learning rate is necessary to achieve better performance. When the learning rate is 0.1, the DQN with 4 hidden layers performs worse than a neural network with 2 or 3 hidden layers. On the other side, when the learning rate decreases, we can see that the neural network with 4 hidden layers and a learning rate of 0.00005 achieves the best overall performance. We do not see a monotonic decrease in the average number of time slots due to the stochastic nature of the channel that causes some fluctuations in the DQN optimization. After an initial improvement, decreasing the learning rate results in a slight increase in the average number of charging steps for all three neural network structures. This is due to the fixed number of training episodes. As a result, for all the simulations presented in this section, we consider a DQN algorithm using a 4 hidden layer deep neural network, with 100 nodes in each layer and a learning rate of 0.00005.

FIGURE 3: Deep Q-network performance on different learning rates and number of hidden layers for the neural network.

In Figure 4, we can observe that the size of the experience pool also affects the performance of DQN (40000 training episodes). To eliminate the correlation between the training data, we select part of the experience pool for training. In our simulation, this parameter, called mini-batch, is set to 10. Larger experience pool contains more training data; hence, selecting the mini-batch from it for training can eliminate the correlation between the training data. However, we need to balance the size of the experience pool and the target_net weight replacement interval. If the experience pool is large but the replacement iteration interval is small, even if we address the correlation problem between the training data, the neural network does not have enough training episodes to reduce the training error before the weight of the target_net is replaced. From Figure 4, we can observe that a large number of replacement iteration intervals may not be the best choice too. Therefore, we determine that, for our problem, DQN achieves the best performance when the size of the experience pool and the neural network replacement iteration interval are 60000 and 1000, respectively.

FIGURE 4: Deep Q-network performance for different values of neural network replacement iteration interval and experience pool.

Figure 5 shows the impact of the reward function (see Section 3.1) on the DQN performance. In this figure, we consider the following reward functions: Reward1: w s , a , s = 0 if s = s G and w s , a , s = λ KB max i = 1 K min s i , B max otherwise; Reward2: w s , a , s = 10 if s = s G and w s , a , s = 1 otherwise; Reward3: w s , a , s = 1 if s = s G and w s , a , s = 1 otherwise. Here, K = 2 and λ = 0.25. All three reward functions are designed to minimize the number of time slots required to fully charge all the harvesters. However, from Figure 5, we can observe that the best performance can be obtained using Reward1. In this case, the energy level accumulated by each harvester increases uniformly, which results in the DQN to converge faster to the optimal policy. Both Reward2 and Reward3, instead, do not penalize states that unevenly charge the harvesters and therefore require more iterations to converge to the optimal solution (not shown in the figure) due to the large number of system states to explore. Therefore, in the following simulations, we use the reward function Reward1 in both Figures 5 and 6, we average 40000 training steps every 100 steps in order to better show the convergence of the algorithm.

FIGURE 5: The deep Q-network performance for different reward functions.

FIGURE 6: The deep Q-network performance for different energy buffer size B max.

Figure 6 shows that when each energy harvester in the system is equipped with a larger energy buffer, the number of system states increases, and therefore, DQN requires more training period to converge to the steady transmit strategy for each system state. We can observe that when B max = 1.6 mJ, the system only needs less than 5000 training episodes to converge to the optimal strategy, and when B max = 3.2 mJ, the system needs around 12000 training episodes to converge to the optimal policy. However, for B max = 4.8 mJ, the system needs as many as 20000 training episodes to converge to the optimal strategy.

In the following simulations, we explore the impact of the channel model on optimization problem P 1. For the Rician fading channel model, we consider K i m r 10 and to be the same for all i,m. In this way, we can approximate the Rician distribution as a Gaussian distribution. We fix r i m = 0.5 , i , m , but allow the standard deviation of both the amplitude and the phase of the channel to change to evaluate the performance on the system under different channel conditions. Since r i m = 0.5 and 2 K i m r = r i m / σ i m , 1 / 2 K i m r = 2 σ i m . We define σ i m 0.1 to guarantee K i m r 10 .

In Figure 7, we express the standard deviation σ amp = σ i m , i , m of the phase and amplitude of the channel, and we compare the performance attained by the optimal policy with the performance of different other algorithms. The multiarmed bandit (MAB) algorithm is also implemented to compare with the DQN. In MAB, each bandit arm represents a particular transmission pattern. The upper confidence bound (UCB) algorithm [Reference Slivkins27] is implemented to maximize the reward w s , a , s and determine the optimal action. Once the action is selected from the action space A, it will be used for transmission continuously. The myopic algorithm is another machine learning algorithm that can be compared with DQN. Myopic solution has the same structure as the DQN; however, the reward discount is defined as γ = 0. As a result, the optimal strategy is determined only according to the current observation instead of considering the future consequence. Myopic solution has been widely used to solve the complex optimization in wireless communication problem and achieve good system performance [Reference Wang, Liu, Gomes and Krishnamachari28]. Besides two machine learning algorithms, another two heuristic algorithms are also used for system performance comparison. For even power allocation, the transmit power P is evenly allocated on parallel channel for transmission. The random action selection is also applied for performance comparison. The random action selection has the worst performance while DQN performs best. Compared to the optimal existing algorithm multiarmed bandit algorithm, the DQN can consume 20% fewer time slots to complete charging. In some channel conditions, the myopic solution can achieve a similar performance as the DQN. However, the myopic solution cannot perform stably. For example, as the standard deviation of the channel amplitude is σ amp = 0.025 , DQN can outperform myopic solution by 45%. The instability can be explained as the myopic solution makes the decision only on the current system state and the current reward, which does not consider the future consequence. Hence, the training effects cannot be guaranteed. Overall, the DQN has superiority in both the charging time consumption and performing stability corresponding to different channel conditions.

FIGURE 7: The comparison of average time consumption between DQN and other algorithms (myopic solution, multiarmed bandit, even power allocation, and random action selection) in the Rician fading channel model. The number of transmit antennas is M = 3. The number of energy harvesters is N = 2.

To better explain the performance of the optimal policy, in Figure 8, we plot the action selected by DQN at a particular system state when σ amp = 0.05 . When σ amp = 0.05 , the optimal action selected by multiarmed bandit is the third action a 3 = 2,6,4 T , which can finish charging both harvesters in around 60 time slots. Meanwhile, the optimal policy determined by DQN can finish charging in around 43 time slots. To this end, Figure 8 shows that the charging process can actually be divided into two parts: before harvester 1 accumulates 1.2 mJ energy and harvester 2 accumulates 0.8 mJ energy, mostly action 4 a 4 = 2,8,2 T is selected. After that, mostly action 1 a 1 = 2,2,8 T is selected. As defined above, both the amplitude and the phase of the channel are Gaussian distributed with zero standard deviation, g ^ 1 0 = 0.05 , 0.59 , 0.11 T and g ^ 2 0 = 0.04 , 0.19 , 0.51 T . So when both the amplitude and the phase of the channel change, the simplified channel state information will be distributed around g ^ 1 0 and g ^ 2 0 . As a result, it can be shown that a policy that selects either action 1 or action 4 with different probabilities can have better performance than the policy that only selects action 3. Henceforth, the DQN can consume 40% fewer time slots to fully charge two energy harvesters.

FIGURE 8: The action selection process of two harvesters scenario when σ amp = 0.05 .

In Figure 9, the performance of the DQN is compared with the other four algorithms by varying the number of energy harvesters in the system. In general, as the number of energy harvesters increases, all four algorithms consume more time slots to complete the wireless charging process. Compared to the random action selection, DQN can consume at least 58% less time slots to complete the charging. The performance of the multiarmed bandit and the even power allocation is very similar, which can be explained as the optimal action determined by the multiarmed bandit algorithm is close to the even power allocation strategy. Compared with two fixed action selection strategies (multiarmed bandit and even power allocation), DQN can reduce the time consumption by up to 72% (when the number of energy harvesters is N = 3). The myopic solution is still not the optimal strategy. From the figure, we can observe that the myopic solution outperforms two fixed action selection algorithms. Even though in some conditions (N = 6), the performance difference between DQN and myopic solution is very small, the myopic solution consumes more than 15% of the time slot than DQN in average. Overall, the DQN is the optimal algorithm which consumes fewest time slots to fully charge all the energy harvesters regardless of the number of energy harvesters.

FIGURE 9: The comparison of average time consumption between DQN and other algorithms (myopic solution, multiarmed bandit, even power allocation, and random action selection) when the number of energy harvesters is N = 2, 3, 4, 5, 6. The number of transmit antennas is M = 3. The standard deviation of the channel amplitude is σ amp = 0.05 .

In Figure 10, the number of transmit antennas is increased from M = 3 to M = 4. The number of energy harvesters varies from N = 2 to N = 6. Though the number of antennas increases, the channel conditions between the transmitter and the energy harvesters become more complicated; DQN still outperforms all the other four algorithms. Compared with myopic solution, multiarmed bandit, even action selection, and random action selection, DQN can consume up to 27%, 54%, 55%, and 76% fewer time slots to fulfill the wireless charging, respectively. As the number of energy harvesters increases, the superiority of the DQN becomes more obvious compared to two fixed action selection algorithms, which can be explained as it is more inefficient to select one fixed action to deal with a more complicated varying channel environment. Even though in some conditions, the performance of the myopic solution and DQN is similar, the myopic solution is not stable in dealing with different energy harvesters conditions. The results from both Figures 9 and 10 demonstrate the superiority of the DQN in optimizing the time consumption for wireless power transfer.

FIGURE 10: The comparison of average time consumption between DQN and other algorithms (myopic solution, multiarmed bandit, even power allocation, and random action selection) when the number of energy harvesters is N = 2, 3, 4, 5, 6. The number of transmit antennas is M = 4. The standard deviation of the channel amplitude is σ amp = 0.05 .

5. Conclusions

In this paper, we design the optimal wireless power transfer system for multiple RF energy harvesters. Deep learning methods are used to enable the wireless transmitter to fully charge the energy buffers of all energy harvesters in the shortest time while meeting the information rate requirement of the communication system.

As the channel conditions between the transmitter and the energy harvesters are time-varying and unknown, we model the problem as a Markov decision process. Due to the large number of system states in the model and the difficulty of training, we adapt a deep Q-network approach to find the best transmit strategy for each system state. In the simulation section, multiple experimental environments are explored. The measured real-time data are used to run the simulation. Deep Q-network is compared with the other four existing algorithms. The simulation results validate that the deep Q-network is superior to all the other algorithms in terms of the time consumption for fulfilling wireless power transfer.

Data Availability

The simulation data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Zhang, R. and Ho, C. K., “MIMO broadcasting for simultaneous wireless information and power transfer,IEEE Transactions on Wireless Communications, vol. 12, no. 5, pp. 19892001, 2013.CrossRefGoogle Scholar
Lee, S., Liu, L., and Zhang, R., “Collaborative wireless energy and information transfer in interference channel,IEEE Transactions on Wireless Communications, vol. 14, no. 1, pp. 545557, 2015.CrossRefGoogle Scholar
Zong, Z., Feng, H., Yu, F. R., Zhao, N., Yang, T., and Hu, B., “Optimal transceiver design for SWIPT in $K$-User MIMO interference channels,IEEE Transactions on Wireless Communications, vol. 15, no. 1, pp. 430445, 2016.CrossRefGoogle Scholar
Xing, Y., Qian, Y., and Dong, L., “Deep learning for optimized wireless transmission to multiple rf energy harvesters,” in Proceedings of the 2018 IEEE 88th Vehicular Technology Conference (VTC-Fall), Chicago, IL, USA, August 2018.Google Scholar
Park, J. and Clerckx, B., “Joint wireless information and energy transfer in a two-user MIMO interference channel,IEEE Transactions on Wireless Communications, vol. 12, no. 8, pp. 42104221, 2013.CrossRefGoogle Scholar
Wu, W., Zhang, X., Wang, S., and Wang, B., “Max-min fair wireless energy transfer for multiple‐input multiple‐output wiretap channels,IET Communications, vol. 10, no. 7, pp. 739744, 2016.CrossRefGoogle Scholar
Thudugalage, A., Atapattu, S., and Evans, J., “Beamformer design for wireless energy transfer with fairness,” in Proceedings of the 2016 IEEE International Conference on Communications (ICC), pp. 16, Kuala Lumpur, Malaysia, May 2016.Google Scholar
Xing, Y. and Dong, L., “Passive radio-frequency energy harvesting through wireless information transmission,” in Proceedings of the 2017 13th International Conference on Distributed Computing in Sensor Systems (DCOSS), pp. 7380, Ottawa, ON, Canada, June 2017.Google Scholar
Mekikis, P.-V., Antonopoulos, A., Kartsakli, E., Lalos, A. S., Alonso, L., and Verikoukis, C., “Information exchange in randomly deployed dense wsns with wireless energy harvesting capabilities,IEEE Transactions on Wireless Communications, vol. 15, no. 4, pp. 30083018, 2016.CrossRefGoogle Scholar
Xu, J. and Zhang, R., “A general design framework for mimo wireless energy transfer with limited feedback,IEEE Transactions on Signal Processing, vol. 64, no. 10, pp. 24752488, 2016.CrossRefGoogle Scholar
Choi, K. W., Kim, D. I., and Chung, M. Y., “Received power-based channel estimation for energy beamforming in multiple-antenna RF energy transfer system,IEEE Transactions on Signal Processing, vol. 65, no. 6, pp. 14611476, 2017.CrossRefGoogle Scholar
Mnih, V., Kavukcuoglu, K., Silver, D. et al., “Playing atari with deep reinforcement learning,” 2013, https://arxiv.org/abs/1312.5602.Google Scholar
Xing, Y., Qian, Y., and Dong, L., “A multi-armed bandit approach to wireless information and power transfer,IEEE Communications Letters, vol. 24, no. 4, pp. 886889, 2020.CrossRefGoogle Scholar
He, Y., Zhang, Z., Yu, F. R. et al., Deep reinforcement learning-based optimization for cache-enabled opportunistic interference alignment wireless networks,IEEE Transactions on Vehicular Technology, vol. 66, no. 11, pp. 10 433510 445, 2017.CrossRefGoogle Scholar
Foerster, J., Assael, I. A., de Freitas, N., and Whiteson, S., “Learning to communicate with deep multi-agent reinforcement learning,” 2016, https://arxiv.org/abs/1605.06676.Google Scholar
Xu, Z., Wang, Y., Tang, J., Wang, J., and Gursoy, M. C., “A deep reinforcement learning based framework for power-efficient resource allocation in cloud rans,” in Proceedings of the 2017 IEEE International Conference on Communications (ICC), pp. 16, IEEE, Paris, France, May 2017.Google Scholar
Van Hasselt, H., Guez, A., and Silver, D., “Deep reinforcement learning with double q-learning,” 2016, https://arxiv.org/abs/1509.06461.CrossRefGoogle Scholar
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., and De Freitas, N., “Dueling network architectures for deep reinforcement learning,” 2015, https://arxiv.org/abs/1511.06581.Google Scholar
Biglieri, E., Calderbank, R., Constantinides, A., Goldsmith, A., Paulraj, A., and Poor, H. V., MIMO Wireless Communications, Cambridge University Press, Cambridge, UK, 2007.CrossRefGoogle Scholar
Timotheou, S., Krikidis, I., Karachontzitis, S., and Berberidis, K., “Spatial domain simultaneous information and power transfer for mimo channels,IEEE Transactions on Wireless Communications, vol. 14, no. 8, pp. 41154128, 2015.CrossRefGoogle Scholar
Mishra, D. and Alexandropoulos, G. C., “Jointly optimal spatial channel assignment and power allocation for mimo swipt systems,IEEE Wireless Communications Letters, vol. 7, no. 2, pp. 214217, 2018.CrossRefGoogle Scholar
Barto, A. G., Bradtke, S. J., and Singh, S. P., “Learning to act using real-time dynamic programming,Artificial intelligence, vol. 72, no. 1-2, pp. 81138, 1995.CrossRefGoogle Scholar
Dong, L. and Liu, Y., “Parallel sub-channel transmission for cognitive radios with multiple antennas,Wireless Personal Communications, vol. 79, no. 3, pp. 20692087, 2014.CrossRefGoogle Scholar
Xing, Y., Pan, H., Xu, B., Zhao, T., Tapparello, C., and Qian, Y., “Multiuser data dissemination in OFDMA system based on deep q-network,” in Proceedings of the IEEE IEMTRONIC (International IOT, Electronics and Mechatronics Conference), Toronto, Canada, April 2021.Google Scholar
Cavers, J., Mobile Channel Characteristics, Springer Science & Business Media, Berlin, Germany, 2006.Google Scholar
Wu, T.-Q. and Yang, H.-C., “On the performance of overlaid wireless sensor transmission with rf energy harvesting,IEEE Journal on Selected Areas in Communications, vol. 33, no. 8, pp. 16931705, 2015.Google Scholar
Slivkins, A., “Introduction to multi-armed bandits,” 2019, https://arxiv.org/abs/1904.07272.CrossRefGoogle Scholar
Wang, S., Liu, H., Gomes, P. H., and Krishnamachari, B., “Deep reinforcement learning for dynamic multichannel access in wireless networks,IEEE Transactions on Cognitive Communications and Networking, vol. 4, no. 2, pp. 257265, 2018.CrossRefGoogle Scholar
Figure 0

FIGURE 1: Wireless information transmitter and receiver surrounded by multiple RF energy harvesters.

Figure 1

ALGORITHM 1: Deep Q-network algorithm training process.

Figure 2

TABLE 1: DQN simulation parameters.

Figure 3

FIGURE 2: Dueling double deep Q-network structure.

Figure 4

FIGURE 3: Deep Q-network performance on different learning rates and number of hidden layers for the neural network.

Figure 5

FIGURE 4: Deep Q-network performance for different values of neural network replacement iteration interval and experience pool.

Figure 6

FIGURE 5: The deep Q-network performance for different reward functions.

Figure 7

FIGURE 6: The deep Q-network performance for different energy buffer size Bmax.

Figure 8

FIGURE 7: The comparison of average time consumption between DQN and other algorithms (myopic solution, multiarmed bandit, even power allocation, and random action selection) in the Rician fading channel model. The number of transmit antennas is M = 3. The number of energy harvesters is N = 2.

Figure 9

FIGURE 8: The action selection process of two harvesters scenario when σamp=0.05.

Figure 10

FIGURE 9: The comparison of average time consumption between DQN and other algorithms (myopic solution, multiarmed bandit, even power allocation, and random action selection) when the number of energy harvesters is N = 2, 3, 4, 5, 6. The number of transmit antennas is M = 3. The standard deviation of the channel amplitude is σamp=0.05.

Figure 11

FIGURE 10: The comparison of average time consumption between DQN and other algorithms (myopic solution, multiarmed bandit, even power allocation, and random action selection) when the number of energy harvesters is N = 2, 3, 4, 5, 6. The number of transmit antennas is M = 4. The standard deviation of the channel amplitude is σamp=0.05.