1. Introduction
In conventional sensor-level tracking, the objective is typically to estimate the hidden state $ {\mathbf{x}}_t $ of an object of interest (e.g., pointing apparatus, pedestrian, vehicle, vessel, airplane, etc.), where $ {\mathbf{x}}_t $ is the target location, orientation, velocity, higher order kinematics, or other spatio-temporal characteristics. This state is assumed to be related to the available noisy sensory measurements (e.g., from camera, radar, inertial measurement units, radio frequency transmissions, global navigation satellite system, acoustic signals, etc.) as per a defined observation model. Plethora of well-established algorithms for estimating $ {\mathbf{x}}_t $ exist, including from multiple data sources, see Bar-Shalom et al. (Reference Bar-Shalom, Willett and Tian2011) and Haug (Reference Haug2012). They often implicitly assume that the object moves in an unpremeditated manner and suitable motion models are accordingly employed.
In this paper, the aim is not to estimate the state $ {\mathbf{x}}_t $ , but instead to infer the underlying intent that is driving the object motion, namely its destination. This capitalizes on the premise that the target motion (e.g., the trajectory followed by a pointing finger while interacting with a display) is dictated by the intended endpoint (e.g., the sought interface item), and that the destination influence on the target movements can be modeled. Therefore, the sought probabilistic modeling and destination predictor(s) belong to a higher system level compared with the sensor-level tracking techniques, hence dubbed meta-level tracking algorithms. They have several applications, such as in surveillance, human–computer interaction (HCI), robotics, and others, since such meta-level approaches can facilitate automated decision-making, resources allocation and informed future action planning. They offer a more integrated viewpoint of a scene where intents can be automatically learnt and conflict or opportunities can be identified in a timely manner. The HCI technology, dubbed predictive touch, is used here as an application or motivation for the proposed Bayesian meta-level inference framework. Nonetheless, this approach can be applied in numerous other areas and scenarios.
1.1. Predictive touch
Predictive touch is an emerging HCI technology for intelligent displays and touchless interactions that can predict the interface component the user intends to select (e.g., a selectable graphical user interface [GUI] displayed on a touch screen), notably early in the pointing-selection task (Ahmad et al., Reference Ahmad, Murphy, Godsill, Langdon and Hardy2017). This is based on the available freehand pointing movements in 3D, for example, provided by gesture trackers, and potentially other available sensory data such as eye-gaze. The pointing-selection task is then simplified and expedited by the predictive touch solution via applying a suitable selection facilitation scheme. This can significantly reduce the effort and distractions associated with using in-vehicle displays while driving (Jæger et al., Reference Ba h, Jæger, Skov and Thomassen2008), including under the influence of perturbations, for example, vibrations and accelerations due to the road and driving conditions. Such perturbations can have a detrimental impact on the usability of displays in moving platforms, such as in-vehicle touch screens (Goode et al., Reference Goode, Lenné and Salmon2012; Ahmad et al. Reference Ahmad, Langdon, Godsill, Hardy, Skrypchuk and Donkor2015), which often act as the gateway to control in-vehicle infotainment systems. For instance, pointing time can be reduced by over 30% and effort/workload halved with predictive touch, see Ahmad et al. (Reference Ahmad, Murphy, Godsill, Langdon and Hardy2017). It is noted that gesture trackers are increasingly becoming commonplace in automotive, gaming, infotainment applications in general and more recently in smartphones, see Quinn et al. (Reference Quinn, Lee, Barnhart and Zhai2019), due to recent advancements in sensing and computer-vision systems. Thus, predictive touch system typically assumes the presence of a gesture tracker (including integrated into the display, e.g., computer-vision solution with several built-in cameras on a touch screen), which it can utilize.
Figure 1 depicts the system block diagram which comprises of the following four main modules:
-
• Pointing gesture tracker: provides, in real-time, the pointing hand/finger(s) location in 3D, for example, $ {\mathbf{y}}_{0:n} $ is the partial (filtered) pointing trajectory pertaining to the time instants $ \left\{{t}_1,{t}_2,\dots, {t}_n\right\} $ at time $ {t}_n $ .
-
• Intent predictor: for a set of $ {N}_{\mathcal{D}} $ selectable interface icons, $ \left\{{\mathcal{D}}_i:i=1,2,\dots, {N}_{\mathcal{D}}\right\} $ , this module calculates the likelihood of each of $ {\mathcal{D}}_i $ being the intended destination at $ {t}_n $ , from the available $ {\mathbf{y}}_{1:n} $ .
-
• Selection facilitation: based on the prediction results, the system simplifies-expedites the selection task. Various such facilitation schemes can be applied (e.g., expand or highlight/fade or drag the item closer to the pointing location, etc.) and were the subject of the studies in Ahmad et al. (Reference Ahmad, Hare, Singh, Shabani, Lindsay, Skrypchuk, Langdon and Godsill2019a) for automotive applications. It was reported that the system autonomously selecting the predicted GUI item on behalf of the user, thus immediate mid-air selection, is an effective facilitation scheme leading to touchless or contact-free interactions.
-
• Additional data: available additional sensory data, such as inertial (accelerometer/gyroscope), eye-gaze measurements, environmental data can be utilized to improve the prediction results. For instance, vehicle CAN-bus data (e.g., suspensions and speed signals) can indicate the level of experienced perturbations due to road-driving conditions.
Therefore, it is software-based touchless technology where the user does not need to physically touch a display to select an interface component. Predictive touch can not only improve the usability and performance of interactive displays, but it also provides the means to interact with new display technologies that do not have a physical surface such as head-up displays, holograms and 3D projections (Bark et al., Reference Bark, Tran, Fujimura and Ng-Thow-Hing2014; Broy et al. Reference Broy, Guo, Schneegass, Pfleging and Alt2015). This novel HCI solution uses the intuitive free hand pointing gestures and intrinsically relies on predicting the user intent, rather than using the pointing finger/arm location or orientation as a pointing apparatus as in Roider and Gross (Reference Roider and Gross2018). Thereby, predictive touch is not a mid-air pointing or ray-casting approach (Plaumann et al., Reference Plaumann, Weing, Winkler, Müller and Rukzio2018), and it is fundamentally distinct from gesture-recognition-based interactions that require the user to pre-learn particular “symbolic” gesture shapes to trigger certain interface responses (May et al. Reference May, Gable and Walker2017). It also offers several design flexibility in terms of the display placement and GUI design which is otherwise limited by the reach and motor capabilities of the user. This can promote inclusive design practices by tailoring the display operation to the user requirements via configuring the prediction algorithms and facilitation schemes.
1.2. Related work and contributions
The Bayesian framework for intent prediction presented in this article was introduced in Ahmad et al. (Reference Ahmad, Murphy, Langdon, Godsill, Hardy and Skrypchuk2016b) and Ahmad et al. (Reference Ahmad, Murphy, Langdon and Godsill2018) for predictive touch and other applications; see Ahmad et al. (Reference Ahmad, Langdon and Godsill2019b) for a short overview. It treats the problem within an object tracking formulation, albeit not necessarily seeking state estimation, such that the influence of intended destination is captured by utilizing suitable stochastic motion model with a few unknown parameters. The latter parameters can be estimated from a small number of example motion patterns or trajectories. Linear Time-Invariant Gaussian systems were considered in the aforementioned papers and more recently nonlinear behavior due to external forces (e.g., jumps and jolts in the pointing movements due to the road/driving conditions) was briefly addressed in Gan et al. (Reference Gan, Liang, Ahmad and Godsill2019). Here and compared with previous work, we
-
1. present an overview and unified treatment of the intent prediction task for linear as well as nonlinear (albeit within a conditionally linear formulation) motion models and systems,
-
2. propose a new approach to the bridging distributions (BD) class of intent-driven models, which have a moderate computational requirement and a clear stochastic interpretation. In this context, the previously unconsidered bridged (nearly) constant acceleration dynamic model is shown to deliver the highest prediction performance for a predictive touch system, and
-
3. benchmark various prediction models using significantly larger data set of pointing gestures recorded in instrumented vehicles under various road-driving conditions.,
In the tracking area, incorporating known predictive information to improve the accuracy of state estimation has a long history, for example Castanon et al. (Reference Castanon, Levy and Willsky1985) and Baccarelli and Cusani (Reference Baccarelli and Cusani1998). Additionally, mean-driven models such as those derived from an Ornstein–Uhlenbeck (OU) process, with known means, were to better estimate behavior of certain objects, for example vessel Millefiori et al. (Reference Millefiori, Braca, Bryan and Willett2016) or financial time series data in Christensen et al. (Reference Christensen, Murphy and Godsill2012). Also, the use of stochastic context-free grammar (SCFG) and conditionally Markov process/reciprocal process has been proposed to predict intent as in Fanaswala and Krishnamurthy (Reference Fanaswala and Krishnamurthy2015) and Rezaie and Li (Reference Rezaie and Li2019a,Reference Rezaie and Lib). In this paper, the destination (i.e., intent) is assumed to be unknown and predictors are developed to infer it. The adopted formulation here leads to significantly simpler algorithms with no constraints on the trajectory followed by the object (e.g., freehand pointing finger), unlike those using SCGF which discretizes the state space. The employed continuous state space models within the introduced Bayesian framework, such as OU-type process and bridging distributions (both are detailed in the next Section), enable treating asynchronous sensory measurements. A noteworthy fact is that the bridging distribution can be viewed as a special case of conditionally Markov models in Rezaie and Li (Reference Rezaie and Li2019a) under certain assumptions.
On the other hand, modeling and inferring complex intentions, such as drivers behaviors at junctions, pedestrians at crosswalks, and human daily activities, can be tackled with data-driven or classification approaches, possibly combined with an a priori learnt pattern of life. They assume the availability of sufficiently complete and diverse training data sets with several well-established such prediction techniques, for example Bando et al. (Reference Bando, Takenaka, Nagasaka and Taniguchi2013), Völz et al. (Reference Völz, Mielenz, Gilitschenski, Siegwart and Nieto2018), and Gaurav and Ziebart (Reference Gaurav and Ziebart2019). However, in this paper, the objective is to develop a simple and computationally efficient destination prediction algorithm where limited training data are available. For example, it can be very challenging and expensive to collect data sets of 3D freehand pointing gestures that sufficiently sample possible paths/trajectories to the display, starting locations of the gesture (e.g., steering wheel, armrest and others), road/driving conditions, context of use, user interface design, screen size/reach, etc. Instead, suitable state space models are employed here, albeit with a few unknown parameters, as is common in object tracking. They enable modeling and robustly inferring the intended endpoint of a tracked object, especially that the possible intentions are a finite set of nominal destinations, for example selectable interface items. Subsequently, the introduced Bayesian intent predictors have minimal training requirements.
1.3. Paper layout
The remainder of the paper is organized as follows. The overall inference framework, various approaches to modeling intent, and the system model are described in Section 2. Destination predictors for linear and nonlinear settings are then outlined in Section 3. Results using real pointing data, recorded by in-vehicle predictive touch prototypes under various road conditions, are presented in Section 4, and conclusions are drawn in Section 5.
2. Bayesian Framework: Modeling Intent and Overall System
Here, the destination inference problem is treated within a Bayesian framework. Let $ \unicode{x1D53B}=\left\{{\mathcal{D}}_i:i=1,2,\dots, {N}_{\mathcal{D}}\right\} $ be the set of $ {N}_{\mathcal{D}} $ nominal endpoints (e.g., selectable on-display interface icons) of a tracked object (e.g., a pointing finger-tip). The objective is to sequentially calculate the probability of each destination (i.e., selectable interface components) $ {\mathcal{D}}_i\in \unicode{x1D53B} $ being the intended endpoint at the current/latest time instant $ {t}_n $ , thus $ p\left(\mathcal{D}={\mathcal{D}}_i|{\mathbf{y}}_{0:n}\right), i=1,2,\dots, {N}_{\mathcal{D}} $ , from the available sensory measurements $ {\mathbf{y}}_{0:n}=\left\{{\mathbf{y}}_0,{\mathbf{y}}_1,\dots, {\mathbf{y}}_n\right\} $ . We recall that in a predictive touch system observations $ {\mathbf{y}}_{0:n} $ are provided by the gesture tracker and other sensors at the successive time instants $ \left\{{t}_0,{t}_1,\dots, {t}_n\right\} $ , for instance, $ {\mathbf{y}}_n $ is the 3D Cartesian coordinates of the pointing finger/hand at $ {t}_n $ as in Figure 1. For each $ {\mathcal{D}}_i\in \unicode{x1D53B} $ and per Bayes’ rule, we have
where $ p\left(\mathcal{D}={\mathcal{D}}_i\right) $ is the prior on the $ i $ th possible destination. In predictive touch this prior can be attained from semantic data, frequency of use, interface design, other sensory data, etc. The task of the inference module (i.e., intent predictor in Figure 1) at $ {t}_n $ is hence to estimate the likelihoods $ p\left({\mathbf{y}}_{0:n}|\mathcal{D}={\mathcal{D}}_i\right) $ , $ i=1,2,\dots, {N}_{\mathcal{D}} $ . This makes the Bayesian formulation particularly appealing since additional contextual information can be easily incorporated, whenever available.
2.1. Destination-driven motion models
A key challenge within the introduced Bayesian approach is employing suitable motion models that represent the influence of intent on the object motion and devising inference algorithms to reveal it. The object motion (e.g., pointing gesture movement) towards an intended item on a display is not deterministic or necessarily takes the shortest path to the endpoint. This is because this movement is driven by a very complex sensorimotor system, capable of autonomous action based on various modalities (e.g., vision and can utilize feedback on the action) and is also subjected to various constraints (e.g., to optimize action required to deliver/predict smooth movement trajectories and minimize the variance of the eye or arm’s position, in the presence of biological noise due to mechanical properties of muscles) and possibly perturbed by external forces such as due to road/driving conditions or walking, see Harris and Wolpert (Reference Harris and Wolpert1998). Thereby, models of such motion are intrinsically uncertain and any prediction of the object movements at a future time instant should not be a single point following a particular deterministic path. Instead, it should be expressed as a probability distribution in space.
Stochastic processes can adequately capture the aforementioned motion uncertainties, where state $ {\mathbf{x}}_n $ (e.g., pointing finger true position in 3D) at $ {t}_n $ is related to its position at the previous time step $ {t}_{n-1} $ , according to a given probability distribution defined by the following evolution of the state over time
where $ {\mathbf{f}}_{i,h}(.) $ is the state transition function between $ {t}_{n-1} $ and $ {t}_n $ and $ h={t}_n-{t}_{n-1} $ . Here, this function can be nonlinear and it is assumed to be dependent on the intended endpoint $ {\mathcal{D}}_i $ ; thus the subscript index $ i $ . Whereas, $ {\boldsymbol{\varepsilon}}_{n-1} $ is the process noise, which is often assumed to be independently and identically distributed (i.i.d) and represents the uncertainty in motion. For example, a zero-mean Gaussian process noise with covariance $ {\mathbf{Q}}_{i,h} $ and a linear time-invariant transition function, for example $ {\mathbf{x}}_{n,i}={\mathbf{F}}_{i,h}{\mathbf{x}}_{n-1,i}+{\boldsymbol{\mu}}_{i,h}+{\boldsymbol{\varepsilon}}_{n-1} $ , lead to a transition density of the state at $ {t}_n $ described by a multivariate Gaussian distribution. It is given by: $ p\left({\mathbf{x}}_{n,i}|{\mathbf{x}}_{n-1,i}\right)=\mathcal{N}\left({\mathbf{x}}_{n,i}|{\mathbf{F}}_{i,h}{\mathbf{x}}_{n-1,i}+{\boldsymbol{\mu}}_{i,h},{\mathbf{Q}}_{i,h}\right) $ where its mean is dependent on the previous position $ {\mathbf{x}}_{n-1,i} $ , input term $ {\boldsymbol{\mu}}_{i,h} $ and covariance $ {\mathbf{Q}}_{i,h} $ . The latter represents the potential level of uncertainty between successive movements.
2.1.1. Linear Gaussian motion models
Approximate motion models that enable inferring intent, that is not necessarily the exact modeling of the object motion, can suffice for the task of destination prediction. Under this assertion, Gaussian Linear Time Invariant (LTI) models can be particularly favorable since they can be easily formulated and lead to computationally efficient prediction algorithms, compared with nonlinear non-Gaussian models (Godsill, Reference Godsill2007; Haug, Reference Haug2012). Next, two classes of Gaussian LTI intent-driven models, namely mean-reverting and bridging distributions, are introduced.
2.1.1.1. Linear Gaussian mean reverting models.
The OU process with mean reverting property offers an effective way to model the destination-driven behavior. By setting the mean term of the underlying model according to the destination information, the target would revert to the premeditated endpoint and finally arrive somewhere nearby. Denote the continuous-time destination dependent target state as vector $ {\mathbf{x}}_{t,i} $ , then the OU-based models can be described in continuous time by the following stochastic differential equation (SDE),
where $ {\beta}_t $ is a multivariate standard Wiener process. For a 3D pointing movement, $ {\mathbf{x}}_{t,i}={\left[{\mathbf{x}}_{t,i,1}^{\prime },{\mathbf{x}}_{t,i,2}^{\prime },{\mathbf{x}}_{t,i,3}^{\prime}\right]}^{\prime } $ , with $ {\mathbf{x}}_{t,i,s}\in {\mathrm{\mathbb{R}}}^2 $ (position and velocity) or $ {\mathrm{\mathbb{R}}}^3 $ (position, velocity and acceleration), $ s=\left\{\mathrm{1,2,3}\right\} $ , and
Different orders of kinematics included in each “substate” $ {\mathbf{x}}_{t,i,s} $ along with the corresponding parameters lead to distinct SDEs as per (3), for instance: (a) the mean reverting diffusion (MRD) model which only includes position in the state (Ahmad et al., Reference Ahmad, Murphy, Langdon, Godsill, Hardy and Skrypchuk2016b), (b) equilibrium reverting velocity (ERV) that model position and velocity (Ahmad et al., Reference Ahmad, Murphy, Langdon, Godsill, Hardy and Skrypchuk2016b), and (c) equilibrium reverting acceleration (ERA) representing position, velocity, and acceleration (Gan et al., Reference Gan, Liang, Ahmad and Godsill2019). These three models have similar mean reverting behavior, that is, the state will revert to the mean term $ {\boldsymbol{\mu}}_i $ , for example set as the destination position for MRD and with (nearly) zero velocity and acceleration for ERV and ERA, respectively.
Here we only discuss the set up for ERA model for simplicity, while other models follow the similar rationale, refer to Ahmad et al. (Reference Ahmad, Murphy, Langdon, Godsill, Hardy and Skrypchuk2016b) for further details. For ERA, the submatrices and vectors in Equation (4) for the sth dimension are
where $ {p}_{i,s} $ is the position of destination $ {\mathcal{D}}_i $ in the $ s $ th dimension, and $ {\dot{x}}_{t,i,s},{\ddot{x}}_{t,i,s} $ denote the second and third derivative (velocity and acceleration) of $ {x}_{t,i,s} $ . The above setup assumes independent transitions for each coordinate, specifically, it can be specified by the following SDE,
One can see that the object motion governed by such an SDE will initially gravitate to the destination position (i.e., $ {p}_{i,s} $ prescribed in the mean vector $ {\boldsymbol{\mu}}_{i,s} $ of this OU process) with increasing acceleration due to the positive reversion factor $ \eta $ , then the positive damping factor $ \rho $ and $ \gamma $ would guarantee the target slows down and arrives the destination in an equilibrium state, with nearly zero velocity and acceleration. This velocity behavior can be demonstrated as the blue line in Figure 2a, which is the deterministic transition (i.e., with $ \sigma $ as zero) of the norm velocity of the ERA model. The norm velocities of an ERA model depicted in Figure 2a, that is sample realizations as well as their mean, are generated from the parameters manually tuned to maximize the intent prediction accuracy. They noticeably capture, on average, an overall profile similar to that exhibited by the real pointing gesture data shown in Figure 2b.
Solving (3) yields the general discrete LTI transition function for all three models (MRD, ERV and ERA) as per,
such as
Whereas, $ \mathbf{A} $ , $ {\boldsymbol{\mu}}_i $ , and $ \boldsymbol{\sigma} $ are parameters set for the specific model, $ \mathbf{I} $ is the identity matrix with the corresponding size. The derivation of this solution and calculation for $ {\mathbf{Q}}_{i,h} $ can be found in Ahmad et al. (Reference Ahmad, Murphy, Langdon, Godsill, Hardy and Skrypchuk2016b) and references therein. Note that the $ {\mathbf{x}}_{t,i} $ in such models is constructed to revert to the destination $ {\mathcal{D}}_i $ , and thus the transition function (7) can be equivalently described as the destination-conditioned transition density, that is, $ p\left({\mathbf{x}}_{n+1}|{\mathbf{x}}_n,{\mathcal{D}}_i\right)=p\left({\mathbf{x}}_{n+1,i}|{\mathbf{x}}_{n,i}\right) $ where $ {\mathbf{x}}_n $ describes the general state (without conditioned-destination information), and the condition $ {\mathcal{D}}_i $ can be further introduced by the destination reverting construction.
2.1.1.2. Bridging distributions.
While the destination information is modeled above by the mean of the OU process, another approach to incorporate such knowledge can be provided by the bridging distributions method. This is particularly relevant if we use a known or legacy motion model, which does not encapsulate the influence of intent on the object motion as with numerous models in the tracking literature, for instance the nearly constant velocity (CV) and acceleration (CA) models; see Li and Jilkov (Reference Li and Jilkov2003) for a comprehensive overview. Additionally, in some scenarios, an OU process might not accurately characterize the destination reverting behavior of the tracked object. In such cases, BD permits more free underlying motion dynamics, and at the same time, ensures the object arrival at/near its endpoint.
Bridging distributions capture the destination influence on the target behavior by constructing a Markov bridge between the intended endpoint and the target current state at $ {t}_n $ . This capitalizes on the premise that the trajectory followed by the object (e.g., pointing finger) must terminate at the endpoint (on-display selectable interface item), at arrival time $ \mathcal{T} $ , despite the random behavior between the current time step $ {t}_n $ and $ \mathcal{T} $ . BD accordingly introduces this knowledge into a motion model via a prior and facilitates destination-aware behavior modeling without requiring the development of specialized stochastic processes that are intrinsically intent-driven. Nonetheless, BD may be applied to OU-type models for means dictated by a destination or not, for endpoint-driven OU process BD can reduce their sensitivity to parameterization as discussed in Ahmad et al. (Reference Ahmad, Murphy, Langdon and Godsill2018).
Assuming that the target will reach the destination at time $ {t}_N=\mathcal{T} $ , a terminal state is defined as $ {\mathbf{x}}_N $ . A bridged state transition distribution in a Markovian system, which conditions on the destination and the arrival time, can be expressed as the conditional distribution $ p\left({\mathbf{x}}_n|{\mathbf{x}}_{n-1},{\mathcal{D}}_i,\mathcal{T}\right) $ . There exists several ways of finding this conditional density and they may differ based on the made assumption(s). For example, Ahmad et al. (Reference Ahmad, Murphy, Langdon and Godsill2018) assumes the terminal state $ {\mathbf{x}}_N $ has exactly the same position as the destination $ {\mathcal{D}}_i $ , and the destination-related information is introduced via a Gaussian prior at $ {t}_0 $ , $ p\left({\mathbf{x}}_N|{\mathcal{D}}_i,\mathcal{T}\right)=\mathcal{N}\left({\mathbf{x}}_N|{\mathbf{a}}_i,{\varSigma}_i\right) $ with $ {\mathbf{a}}_i $ being the mean, $ {\varSigma}_i $ the covariance matrix and $ i=1,2,\dots, {N}_{\mathcal{D}} $ . This covariance can model the size-orientation of the endpoint and hence with BD destinations can be regions rather than single spatial points as with OU-type models. Based on this assumption, the sought transition density $ p\left({\mathbf{x}}_n|{\mathbf{x}}_{n-1},{\mathcal{D}}_i,\mathcal{T}\right) $ is a Markov transition density for the current state ( $ {\mathbf{x}}_n $ ), conditioning on its terminal state ( $ {\mathbf{x}}_N $ ), i.e.,
Given the fact that the terminal state $ {\mathbf{x}}_N $ is fixed, one can construct a joint state vector $ {\mathbf{z}}_n={\left[{\mathbf{x}}_n,{\mathbf{x}}_N\right]}^{\prime } $ and obtain the transition density for $ {\mathbf{z}}_n $ accordingly. The joint state transition will ultimately lead $ {\mathbf{x}}_n $ to its terminal state $ {\mathbf{x}}_N $ which follows the prior $ p\left({\mathbf{x}}_N|{\mathcal{D}}_i,\mathcal{T}\right) $ . When observations are available, such a construction of $ {\mathbf{z}}_n $ permits a joint estimation on destination and kinematic state.
An alternative formulation of BD can be found in Liang et al. (Reference Liang, Ahmad, Gan, Langdon, Hardy and Godsill2019), in which the destination information is interpreted as a “pseudo-observation” instead of as a state prior. Specifically, a linear and Gaussian pseudo-observation model,
was considered with $ \tilde{\mathbf{G}} $ being the mapping matrix. It was shown in Liang et al. (Reference Liang, Ahmad, Gan, Langdon, Hardy and Godsill2019), Algorithm 2, that this interpretation leads to the following destination-conditioned state transition density,
where the Markovian assumption is preserved between the terminal state and the initial state.
Motivated by the pseudo-observation–based formulation of BD, in this paper we introduce a new intent prediction algorithm which utilizes (11) as its main ingredient. Similar to Ahmad et al. (Reference Ahmad, Murphy, Langdon and Godsill2018) and Liang et al. (Reference Liang, Ahmad, Gan, Langdon, Hardy and Godsill2019), we will focus on linear Gaussian models because they lead to analytically tractable results. First, consider the following LTI SDE, where $ {\mathbf{x}}_t={\left[{x}_t,{\dot{x}}_t,{\ddot{x}}_t,{y}_t,{\dot{y}}_t,{\ddot{y}}_t,{z}_t,{\dot{z}}_t,{\ddot{z}}_t\right]}^{\hbox{'}} $ has the same physical meaning as in Section 2.1.1 (i.e., position, velocity and acceleration in 3D Cartesian coordinates),
with $ {\mathbf{A}}_s=\left[\mathrm{0,1,0};\mathrm{0,0,1};\mathrm{0,0,0}\right] $ (see again Section 2.1.1 for further details related to the noise components). It can be shown that the transition density resulting from this SDE is of the form:
with $ {\mathbf{F}}_h $ being the state transition matrix, $ {\mathbf{Q}}_h $ the process noise covariance and $ h={t}_n-{t}_{n-1} $ . In comparison to (7), this transition density has no dependency on a destination. When the process noise level is relatively low, (13) corresponds to the nearly CA model (also known as the Wiener-process acceleration model). Substituting (13) into (11) yields
where
Here (14) serves as the transition density for the pseudo-observation process, that is satisfies $ p\left({\mathbf{x}}_n|{\mathcal{D}}_i,\mathcal{T}\right) $ , based on which the state will evolve under the guidance of destination information. Figure 3 gives an example of the marginal distributions obtained according to the above pseudo-observation based process where the influence of the endpoint on the state distribution over time is evident. It can also be shown that the limiting distribution, $ {\lim}_{t_N=\mathcal{T}\to \infty }p\left({\mathbf{x}}_N|{\mathcal{D}}_i,\mathcal{T}\right) $ , of a state process having (14) as its transition density equates to $ \mathcal{N}\left({\mathbf{a}}_i,{\varSigma}_i\right) $ when $ \tilde{\mathbf{G}}=\mathbf{I} $ . Moreover, setting $ \tilde{\mathbf{G}}=\mathbf{I} $ and $ {\varSigma}_i=\mathbf{0} $ produces the same state transition density as (9), namely a canonical Gaussian bridge (Gasbarra et al., Reference Gasbarra, Sottinen and Valkeila2007) terminated at a certain state, with the fact that the endpoint $ {\mathbf{x}}_N $ is certain. It should be stressed that the form of mapping matrix $ \tilde{\mathbf{G}} $ depends on what destination-related information is available at hand and thus it is not necessarily equal to an identity matrix; any such matrix is included in (14).
The state transition distributions in Equations (9) and (11) build the destination knowledge into the state dynamics and thus form the basis of BD-based destination-driven (or destination-constrained) motion models. For all nominal destinations $ {\mathcal{D}}_i\in \unicode{x1D53B} $ , $ {N}_{\mathcal{D}} $ such bridges are constructed, one per endpoint. In scenarios where we want the terminal state $ {\mathbf{x}}_N $ at $ {t}_N $ as well as $ {\mathbf{x}}_n $ at the current time step $ {t}_n $ to be jointly estimated, the transition model prescribed by (9) may be utilized. However, if the main objective is to predict the intended destination as in this paper with available information on the nominal endpoints (e.g., a certain region/area represented by an ellipsoidal shape), (11) can be used to construct a computationally efficient predictor since the hidden state dimension in this case is less than that of the joint estimation scheme (i.e., includes $ {\mathbf{x}}_N $ ). In Section 3.1.2, we present a new intent predictor based on the destination-constrained prior as with (11). In comparison to Ahmad et al. (Reference Ahmad, Murphy, Langdon, Godsill, Hardy and Skrypchuk2016b), the new predictor requires less computations as it does not estimate the terminal state at $ {t}_N $ . It is constructed using pseudo-observation and therefore the underlying state process is still a Markov process. It also differs from the pseudo-observation based intent predictors presented in Liang et al. (Reference Liang, Ahmad, Gan, Langdon, Hardy and Godsill2019) in that it utilizes a destination-constrained state transition density throughout the filtering procedure (although this implies a slightly higher computational burden). Finally, a pseudo-measurement technique for jointly estimating the object state and its destination is presented in Zhou et al. (Reference Zhou, Li and Kirubarajan2020) based on a linear equality constraint. It dictates that the object follows some straight line to its intended endpoint. Although this simplifies the inference procedure as the condition of arrival time is avoided, it does not capture realistic motion behavior of several objects of interest (e.g., constraint-free pointing motion in 3D). On the contrary, the presented stochastic modeling is general and does not impose such restrictive constraints on the target trajectory.
2.1.2. Nonlinear motion models: conditionally linear Gaussian settings
The computationally efficient Gaussian model assumes that the change in the object motion (i.e., pointing movements) in any time interval always follows a Gaussian distribution. However, for some irregular movements which cause rapid spatial changes (e.g., jolts in the pointing motion due to perturbations or any external nonintent-driven force), such an assumption is unsuitable and can lead to large inference errors. In order to model such erratic perturbations-induced maneuvers, we introduce a pure jump process to the original (destination-aware) Gaussian processes. Such formulations are known as jump diffusion models or Markov/semi-Markov jump models.
The adopted jump diffusion models retain the Brownian motion as one of the driven noise, and thus they can be considered as a conditionally linear Gaussian system. In particular, when the non-Gaussian pure jump process is given as a condition, the dynamics can be constructed in a standard Gaussian form to ensure computational efficiency.
Such approaches have been extensively adopted in financial modeling to describe the discrete movements (Kou Reference Kou2002), and in object tracking field to capture sudden maneuvers undertaken by the target or induced by external forces (Godsill, Reference Godsill2007). Owing to the clear physical representation and computation tractability, such jump diffusion dynamical models have also been employed in Ahmad et al. (Reference Ahmad, Murphy, Langdon and Godsill2014) and Gan et al. (Reference Gan, Liang, Ahmad and Godsill2019) within a predictive touch system under high levels of perturbations due to road-driving conditions. The approach presented in Ahmad et al. (Reference Ahmad, Murphy, Langdon and Godsill2014) embedded a self-decay jump process within a Gaussian process to pre-process the highly-perturbed pointing data, with the aim to obtain a smoothed trajectory for the later intent inference task, whereas Gan et al. (Reference Gan, Liang, Ahmad and Godsill2019) introduces a jump diffusion model for a unified scheme for destination and state estimation. In this paper, we mainly discuss the latter recent work given its improved performance.
Since the target motion (e.g., pointing gesture movements) impacted by severe external perturbations or fast maneuvering is still destination reverting, we consider the jump diffusion model based on the following linear mean reverting SDE (3)
where most parameters have the same definition as described in the previous sections. If we assume that the jumps only occur at key driving elements of the state (e.g., position for MRD, or acceleration in ERA), the parameter $ \mathbf{B}=\operatorname{diag}\left\{{\mathbf{B}}_1,{\mathbf{B}}_2,{\mathbf{B}}_3\right\} $ (for 3D movements) such that $ {\mathbf{B}}_s={\left[\mathrm{0,0,1}\right]}^{\prime } $ $ \left(s=\mathrm{1,2,3}\right) $ for ERA and $ {\left[0,1\right]}^{\prime } $ for ERV. The multivariate jump process $ {\mathbf{J}}_t $ here is a compound Poisson process with Gaussian distributed jump size. Specifically, we have $ {\mathbf{J}}_t={\sum}_{\tau_k<t}{\mathbf{S}}_k $ , with the jump size $ {\mathbf{S}}_k\in {\mathrm{\mathbb{R}}}^3 $ and $ {\mathbf{S}}_k\sim \left({\mathbf{S}}_k|{\boldsymbol{\mu}}_J,{\varSigma}_J\right) $ . Note that if isotropic distributed jump (i.e., the jump on each direction of the space are identically distributed) is considered, the parameters can be simplified as $ {\boldsymbol{\mu}}_J=\mathbf{0} $ and $ {\varSigma}_J={\sigma}_J^2\mathbf{I} $ , where $ {\sigma}_J $ is defined as the standard deviation of the jump size in any dimension. The jump time $ {\tau}_k $ which follows the Poisson process has the property that $ {\tau}_k-{\tau}_{k-1}\sim {\exp}_{\lambda_J}\left(\cdot \right) $ , where $ {\lambda}_J^{-1} $ is the mean value of the jump interarrival time.
Solving SDE (15) yields the transition density as follows,
with
where $ \mathbf{F} $ , $ \mathbf{M} $ and $ \mathbf{Q} $ have been defined in (8), and jump time sequence $ {\tau}_{n:n+1} $ consists of all jump times that occurred in the interval $ \left({t}_n,{t}_{n+1}\right] $ , that is $ {\tau}_{n:n+1}={\cup}_{t_n<{\tau}_k\le {t}_{n+1}}{\tau}_k $ .
2.1.3. Observation model
The available sensory measurement $ {\mathbf{y}}_n $ (e.g., gesture-tracker output) is a noisy observation of the true hidden state $ {\mathbf{x}}_n $ (e.g., pointing finger actual location). In a state space form, it is described at time $ {t}_n $ by
where $ {\mathbf{h}}_n(.) $ is the mapping from the hidden state to the observed measurement(s) and $ {\mathbf{w}}_n $ is the measurement noise. Here and for simplicity, a linear and Gaussian measurement model can be assumed such that $ {\mathbf{y}}_n={\mathbf{Hx}}_{n,i}+{\mathbf{w}}_n $ , with zero mean i.i.d Gaussian noise where $ {\mathbf{w}}_n\sim \mathcal{N}\left(\mathbf{0},{\mathbf{V}}_n\right) $ . For instance, if gesture tracker provides locations of the pointing finger in 3D and latent state $ {\mathbf{x}}_{n,i}\in {\mathrm{\mathbb{R}}}^3 $ consists of the object location, the mapping measurement matrix in (19) is a $ 3\times 3 $ identity matrix, $ \mathbf{H}={\mathbf{I}}_3 $ . The noise covariance matrix $ {\mathbf{V}}_n $ is specified by the tracker accuracy, that is in terms of determining the pointing finger position.
The overall system is described by the motion and observation models in (2) and (19), respectively. Next, we introduce various destination inference algorithms to estimate the sought probabilities $ p\left(\mathcal{D}={\mathcal{D}}_i|{\mathbf{y}}_{1:n}\right),{\mathcal{D}}_i\in \unicode{x1D53B} $ . As shown below, the intent inference routine complexity is dependent on the employed motion model. For instance, a Gaussian LTI set-up leads to a simple and computationally efficient Kalman-filer-based predictor for the destination inference task.
3. Destination Prediction
Recall from (1) that the key to sequentially infer the probability of the destination $ {\mathcal{D}}_i $ being the intended one is to estimate the likelihood $ p\left({\mathbf{y}}_{0:n}|\mathcal{D}={\mathcal{D}}_i\right) $ . Furthermore, this likelihood can be recursively expanded according to prediction error decomposition (PED; Harvey, Reference Harvey1990) given by
where we have abbreviated the condition $ \mathcal{D}={\mathcal{D}}_i $ as $ {\mathcal{D}}_i $ henceforth to simplify notation. This sequential likelihood estimation serves as the basis of online Bayesian intent predictor as it only requires the evaluation of predictive likelihood $ p\left({\mathbf{y}}_n|{\mathbf{y}}_{0:n-1},{\mathcal{D}}_i\right) $ at each time instant. In this section, we discuss the strategy to compute this predictive likelihood for the various models introduced in Section 2.
3.1. LTI Gaussian systems
The destination reverting models in Section 2.1.1 are devised in a Gaussian LTI form, which leads to linear Gaussian transition densities. Meanwhile, a linear Gaussian observation model (e.g., for an off-the-shelf gesture tracker) is assumed for (19). The standard Kalman filter is then sufficient to carry out the recursive filtering for intent inference, namely to produce the (optimal in the mean least squares error sense) PED (Haug, Reference Haug2012), rather than the conventional state estimation task as shown next.
3.1.1. OU-based intent predictors
Recall that the destination conditioned transition function for OU-based model, for example in (7), and the adopted observation function are both linear Gaussian, the estimated target state can thus be explicitly described by a normal distribution. Specifically,
The predictive likelihood can be computed as follows,
and this leads to a Gaussian density description $ p\left({\mathbf{y}}_n|{\mathbf{y}}_{1:n-1},{\mathcal{D}}_i\right)=\mathcal{N}\left({\mathbf{y}}_n|{\boldsymbol{\mu}}_{y_n},{\mathbf{C}}_{y_n}\right) $ , where
To compute $ {\boldsymbol{\mu}}_{n\mid n-1} $ and $ {\mathbf{C}}_{n\mid n-1} $ at each time step, the standard Kalman filter is required to estimate the state recursively, summarized as follows,
The corresponding matrix description is
The above equations specify the the computation of predictive likelihood for a single time step, the likelihood for each destination being the intended one can then be evaluated with (1) and (20).
3.1.2. BD-based intent predictor using pseudo-observation formulation
In principle, BD-based intent predictors, including those in Ahmad et al. (Reference Ahmad, Murphy, Langdon and Godsill2018) and Liang et al. (Reference Liang, Ahmad, Gan, Langdon, Hardy and Godsill2019) and the new approach introduced here, all utilize (1) and (20) for inferring the target destination from the available noisy sensory observations. However, for the BD approach proposed here, we have $ p\left({\mathbf{y}}_{0:n}|{\mathcal{D}}_i,\mathcal{T}\right)=p\left({\mathbf{y}}_{0:n}|{\varTheta}_i,\mathcal{T}\right) $ with Θ $ {}_i $ containing destination-specific parameters (here, $ {\mathbf{a}}_i $ and $ {\varSigma}_i $ ) and $ \mathcal{T} $ being the arrival time at the destination. As the likelihood term is further conditioning on an unknown arrival time $ \mathcal{T} $ , Equation (20) needs to be revised as follows:
based on which $ p\left({\mathbf{y}}_{0:n}|{\mathcal{D}}_i\right) $ can be obtained via
where $ p\left(\mathcal{T}|{\mathcal{D}}_i\right) $ is the prior distribution on the unknown arrival time. In general, the above integration is not analytically tractable and numerical approximation can be implemented. This is especially viable since the arrival time is a one-dimensional quantity (and thereby the integral). In this paper, we will adopt the same quadrature approximation scheme as in Ahmad et al. (Reference Ahmad, Murphy, Langdon and Godsill2018) for obtaining (34).
Henceforth, the aim is to compute the arrival-time-conditioned PED and likelihood (i.e., the unknown arrival time is treated as if it is available). We illustrate how to develop an intent predictor based on the destination-constrained state process defined in Section 2.1.1. Given observations up to $ {t}_n $ , the $ \mathcal{T} $ -conditioned likelihood term of interest can be expressed by
where the first component in the integral is the observation density, the second component is the destination-constrained state transition density as defined in (11) and the last component is a filtering distribution obtained at time $ {t}_{n-1} $ . Next, we outline how to calculate $ p\left({\mathbf{y}}_n|{\mathbf{y}}_{0:n-1},{\mathcal{D}}_i,\mathcal{T}\right) $ at each time step for a linear and Gaussian dynamic system. For simplicity and without loss of generality, we use the same state model as with (13) with destination information incorporated via (10). This implies the availability of the destination-conditioned state transition density in Equation (14). With a linear Gaussian observation model, we have
where $ \mathbf{H} $ is the observation matrix and $ {\mathbf{V}}_n $ is the measurement noise covariance matrix. As a result, the filtering distribution $ p\left({\mathbf{x}}_{n-1}|{\mathbf{y}}_{0:n-1},{\mathcal{D}}_i,\mathcal{T}\right) $ at the previous time step $ {t}_{n-1} $ can be obtained using a standard Kalman filter in which (14) is used as the state transition density. Assuming at $ {t}_n $ we have obtained the filtering distribution given by the Kalman filter associated with $ {\mathcal{D}}_i $ from the last time step $ {t}_{n-1} $ as
with $ {\boldsymbol{\mu}}_{n-1\mid n-1}^i $ and $ {\mathbf{C}}_{n-1\mid n-1}^i $ being the mean and covariance respectively, and substituting (36), (14) and (37) into (35), the sought likelihood can be shown to be
The above calculation can be further simplified by noticing that
are actually the mean and covariance of the intermediate distribution $ p\left({\mathbf{x}}_n|{\mathbf{y}}_{0:n-1},{\mathcal{D}}_i,\mathcal{T}\right)=\mathcal{N}\left({\mathbf{x}}_{n-1}|{\boldsymbol{\mu}}_{n\mid n-1}^i,{\mathbf{C}}_{n\mid n-1}^i\right) $ obtained at the Kalman prediction step. As a result, there is no need to re-calculate these two quantities twice.
Combining (33), (38), and (34), $ p\left({\mathbf{y}}_{0:n}|{\mathcal{D}}_i\right) $ can be evaluated sequentially when new measurements become available. To complete the intent prediction algorithm, the above calculation needs to be performed for each destination $ {\mathcal{D}}_i\in \unicode{x1D53B} $ . Furthermore, when a quadrature approximation scheme is used, (38) needs to be evaluated at each quadrature point of $ \unicode{x1D54B}=\Big\{{\mathcal{T}}_q,q=1,2,\dots, {N}_{\mathcal{T}} $ }. A detailed implementation note is summarized in Algorithm I. It is noted that a guidance on the choice number of quadrature points for BD methods can be found in Ahmad et al. (Reference Ahmad, Murphy, Langdon and Godsill2018).
Algorithm I BD-based Intent Predictor
Input: Observations: $ \left\{{\mathbf{y}}_{0:N}\right\} $ , Pseudo-observations: $ {\left\{{\mathbf{a}}_i,{\varSigma}_i\right\}}_{1\le i\le {N}_{\mathcal{D}}} $ , $ \unicode{x1D54B}={\left\{{\mathcal{T}}_q\right\}}_{1\le q\le {N}_{\mathcal{T}}} $ ;
Initialization: $ {N}_{\mathcal{D}}\times {N}_{\mathcal{T}} $ Kalman filters, each initialized with mean $ {\boldsymbol{\mu}}_{-1\mid -1}^{i,q} $ and covariance $ {\mathbf{C}}_{-1\mid -1}^{i,q} $ .
for $ n=0:N $ do $ \vartriangleright $ For each time instant
for $ {\mathcal{D}}_i\in \unicode{x1D53B} $ do $ \vartriangleright $ For each destination
for $ {\mathcal{T}}_q\in \unicode{x1D54B} $ do $ \vartriangleright $ For each quadrature point
Construct the intent-driven transition density $ p\left({\mathbf{x}}_n|{\mathbf{x}}_{n-1},{\mathcal{D}}_i,{\mathcal{T}}_q\right) $ via (14);
Standard Kalman prediction to obtain $ {\boldsymbol{\mu}}_{n\mid n-1}^{i,q} $ and $ {\mathbf{C}}_{n\mid n-1}^{i,q} $ via (39);
Standard Kalman update to obtain $ {\boldsymbol{\mu}}_{n\mid n}^{i,q} $ and $ {\mathbf{C}}_{n\mid n}^{i,q} $ ;
Compute: $ {l}_n^{i,q}=p\left({\mathbf{y}}_n|{\mathbf{y}}_{0:n-1},{\mathcal{D}}_i,{\mathcal{T}}_q\right) $ via (38);
Update $ {\mathcal{T}}_q $ -conditioned likelihood via (33): $ p\left({\mathbf{y}}_{0:n}|{\mathcal{D}}_i,{\mathcal{T}}_q\right)={L}_n^{i,q}={L}_{n-1}^{i,q}\times {l}_n^{i,q} $
end for
Approximate $ p\left({\mathbf{y}}_{0:n}|{\mathcal{D}}_i\right) $ numerically using $ \left\{{L}_n^{i,q},q=1,2,\dots, {N}_{\mathcal{T}}\right\} $ ;
end for
Obtain destination posterior at $ {t}_n $ : $ p\left({\mathcal{D}}_i|{\mathbf{y}}_{0:n}\right)\approx \frac{p\left({\mathbf{y}}_{0:n}|{\mathcal{D}}_i\right)\times p\left({\mathcal{D}}_i\right)}{\sum_{{\mathcal{D}}_j\in \unicode{x1D53B}}p\left({\mathbf{y}}_{0:n}|{\mathcal{D}}_j\right)\times p\left({\mathcal{D}}_j\right)} $ ;
end for
3.2. Intent predictors for jump diffusion models
The jump diffusion model introduced in Section 2.1.2 is constructed as a conditionally Gaussian form (16), that is, the transition density from time $ t $ to $ t+h $ is a Gaussian density if the nonlinear component jump time sequence $ {\tau}_{t:t+h} $ is given as a condition. Thus an efficient strategy would be estimating $ {\tau}_{t:t+h} $ in a Monte Carlo sense, then for each sample of $ {\tau}_{t:t+h} $ , $ p\left({\mathbf{x}}_{t+h,i}|{\mathbf{x}}_{t,i},{\tau}_{t:t+h}\right) $ is retained as Gaussian form so that the standard Kalman filter can be employed to carry out the estimation. Such strategy, known as Rao-Blackwellized variable rate particle filter (Godsill, Reference Godsill2007; Christensen et al., Reference Christensen, Murphy and Godsill2012), aims to strengthen the estimation accuracy by employing analytical computations as much as possible.
When $ {N}_{\mathcal{D}} $ possible destinations are considered, the same number of particle filters are required, each with $ {N}_{\mathcal{P}} $ particles for a particular destination $ {\mathcal{D}}_i $ . Here, we allow the $ {N}_{\mathcal{D}} $ different particle filters to share the same sample set of jump times. This not only reduces the inference computational complexity, but can also circumvent spurious large differences between the likelihoods of the various destinations, induced by individual sample outlier(s). Nonetheless, this particular consideration is not expected to noticeably impact the intent prediction performance since the aim in this paper is not to accurately estimate the object state or the individual destination likelihood $ p\left({\mathbf{y}}_{0:n}|{\mathcal{D}}_i\right) $ . Instead, the focus is on comparing the likelihoods for all nominal destinations, calculated under the same conditions, in order to determine the intended endpoint from the observed motion. At time $ {t}_n $ , each variable rate particle filter stores the samples $ {\tau}_{0:n}^{(p)} $ ( $ p=1,2,\dots, {N}_{\mathcal{P}} $ ), the normalized weight $ {\omega}_n^{\left(p,i\right)} $ , the mean $ {\boldsymbol{\mu}}_{n\mid n}^{\left(p,i\right)} $ and covariance $ {\mathbf{C}}_{n\mid n}^{\left(p,i\right)} $ for Gaussian density $ p\left({\mathbf{x}}_n|{\mathbf{y}}_{0:n},{\tau}_{0:n}^{(p)},{\mathcal{D}}_i\right) $ . Subsequently, the empirical estimations for jump time and state can be described as follows,
Accordingly, the predictive likelihood can be approximated as
where the updated weight $ {\tilde{\omega}}_{n+1}^{\left(p,i\right)} $ , in the bootstrap particle filter setting, is defined as
and the new jump time samples $ {\tau}_{n:n+1}^{(p)} $ , in the corresponding (bootstrap) setup, are propagated according to the Poisson transition described in Section 2.1.2,
It can be shown that the updated weight $ {\tilde{\omega}}_{n+1}^{\left(p,i\right)} $ is also required to compute the normalized weight $ {\omega}_{n+1}^{\left(p,i\right)} $ :
Similar to (23), the $ p\left({\mathbf{y}}_{n+1}|{\mathbf{y}}_{0:n},{\tau}_{0:n+1}^{(p)},{\mathcal{D}}_i\right) $ in (43) can be computed in a closed form with the stored mean $ {\boldsymbol{\mu}}_{n\mid n}^{\left(p,i\right)} $ and covariance $ {\mathbf{C}}_{n\mid n}^{\left(p,i\right)} $ , that is
where
In order to updated the stored density mean $ {\boldsymbol{\mu}}_{n+1\mid n+1}^{\left(p,i\right)} $ and covariance $ {\mathbf{C}}_{n+1\mid n+1}^{\left(p,i\right)} $ , the following standard Kalman filter updated steps are required:
This procedure completes the variable rate particle filtering for a single time step and the overall intent prediction algorithm is summarized as Algorithm II.
Algorithm II Intent Inference with the jump model
Initialization: Create $ {N}_{\mathcal{D}} $ variable rate particle filters, each with $ {N}_{\mathcal{P}} $ particles;
for each observations $ n=1:N $ captured at $ {t}_n $ do
for particles $ p=1:{N}_{\mathcal{P}} $ do
Sample the jump time sequence from prior $ {\tau}_{n:n+1}^{(p)} $ from (44);
end for
for destinations $ i=1:{N}_{\mathcal{D}} $ do
if Resample then
Resample particles and set weights $ {\omega}_{n-1}^{\left(p,i\right)}=1/{N}_{\mathcal{P}} $ ;
end if
for particles $ p=1:{N}_{\mathcal{P}} $ do
Predict the mean $ {\boldsymbol{\mu}}_{n+1\mid n}^{\left(p,i\right)} $ and covariance $ {\mathbf{C}}_{n+1\mid n}^{\left(p,i\right)} $ via (47);
Calculate the updated weight $ {\tilde{\omega}}_{n+1}^{\left(p,i\right)} $ according to (43)(46);
Update the mean $ {\boldsymbol{\mu}}_{n+1\mid n+1}^{\left(p,i\right)} $ and covariance $ {\mathbf{C}}_{n+1\mid n+1}^{\left(p,i\right)} $ via (48)
end for
Produce the predictive likelihood $ p\left({\mathbf{y}}_{n+1}|{\mathbf{y}}_{0:n},{\mathcal{D}}_i\right) $ from (42);
Calculate the normalized weight $ {\omega}_{n+1}^{\left(p,i\right)} $ according to (45);
Calculate likelihood $ p\left({\mathbf{y}}_{0:n+1}|{\mathcal{D}}_i\right) $ in (20);
end for
Determine endpoint probability: $ p\left({\mathcal{D}}_i|{\mathbf{y}}_{0:n}\right) $ in (1);
end for
4. Results
0Figure 1. This system used the off-the-shelf sensor, Leap Motion, which can reliably track hand and finger positions in 3D during the pointing-selection tasks, at a rate exceeding $ 30 $ Hz. The utilized dataset contains 95 trajectories pertaining to four participants while undertaking pointing-selection tasks under various road and driving conditions. Here, we divide these data into two sets:
-
1. Dataset A with all 95 pointing tracks; this allows us to perform a comprehensive comparison between different algorithms for various levels of present perturbations (e.g., static, motorway driving and off-road driving).
-
2. Dataset B with 10 trajectories when the user input was subjected to severe level of noise due to driving on a badly maintained road or off-road driving. This dataset is a subsect of Dataset A and was collected in a Land Rover. It is particularly relevant to examine the outcome of the algorithms that incorporate a jump process, that is employ jump diffusion models.
During the above interaction tasks, an experimental user interface with multiple selectable circular icons was displayed on a touchscreen mounted to the car dashboard. The number of selectable icons is $ \mid \unicode{x1D53B}\mid =21 $ for Dataset A, and $ \mid \unicode{x1D53B}\mid =37 $ for Dataset B. Two typical pointing trajectories of each dataset are presented in Figure 4. Similar to the common ISO 9241 pointing task, often referred to as Fitt’s law task, one randomly chosen GUI item is highlighted at a time and the user is expected to select it. Identical uniform prior is placed on all of the interface items, that is, $ p\left(\mathcal{D}={\mathcal{D}}_i\right)=1/{N}_{\mathcal{D}} $ for all $ {\mathcal{D}}_i\in \unicode{x1D53B} $ in order for the results to be comparable to those in previous work.
Below, we use the aggregate inference success and the timely successful prediction over pointing duration to evaluate the predictors performance; both apply a maximum a posteriori criterion (i.e., pick the most probable icon) as per:
More specifically, the first is defined as the proportion of the total pointing gesture (in time), from its start at $ {t}_0 $ until touching the display surface at time $ \mathcal{T} $ , for which the predictor correctly inferred the true endpoint $ {\mathcal{D}}_{\mathrm{True}}\in \unicode{x1D53B} $ . The second captures the percentage of the correct prediction over all tested dataset as a function of the percentage of pointing task duration, thus indicating how early the predictor assigns the highest probability for the correct destination.
4.1. Prediction performance with linear Gaussian intent-driven models
For the 95 pointing tracks covering different levels of perturbations (i.e., Dataset A), the computationally efficient LTI Gaussian models are sufficient to predict the intended icon with a high accuracy. In this section, we evaluate all LTI Gaussian models introduced in Section 2.1.1 for this dataset. The parameters for all tested predictors are listed in Table 1. They are chosen in a manual way and from examining a few possible values (i.e., no training or fine tuning across all test trajectories was undertaken).
Abbreviations: BD–CA, bridging distributions-constant acceleration; BD–CV, bridging distributions-constant velocity; ERA, equilibrium reverting acceleration; ERV, equilibrium reverting velocity.
a For all BD models, $ p\left(\mathcal{T}|{\mathcal{D}}_i\right)=\mathrm{Unif}\left(0.1\;\mathrm{s},\mathrm{1.9}\;\mathrm{s}\right) $ , the number of quadrature points $ {N}_{\mathcal{T}}=30 $ , $ \tilde{G}=\mathbf{I} $ and $ {\sigma}^{{\mathcal{D}}_i} $ form the corrosponding $ {\Sigma}_i $ .
It is noted that this model parameterization is aimed at demonstrating the low training requirement of the adopted state-space-modeling-based inference approach, since the models are physically meaningful. Take the linear Gaussian mean reverting model as an example. Although a higher noise parameter $ \sigma $ would lead to higher uncertainty on the final endpoint, it permit more flexibility in the target dynamics manifested in elaborate maneuvers (e.g., swings) of the target (i.e., pointing finger) en-route to its destination, instead of simply following a straight line. A higher reversion parameter $ \eta $ would cause the stronger force towards the endpoint, such that a higher damping factor $ \rho $ (and/or $ \gamma $ ) ensures that the finger speed upon touch is reasonable. A set of fined-tuned parameters can trade-off generalizability of the model to new data for a high (validation) prediction accuracy. Alternatively, the parameters of OU models may be set based on maximization of the likelihood $ {\prod}_{k=1}^Kp\left({\mathbf{y}}_{0:n}^{\left[k\right]}|\mathcal{D}={\mathcal{D}}_i,\Omega \right) $ for a sample of $ K $ typical full pointing finger trajectories; $ \Omega $ is the set of the parameters for an intent-driven dynamic model. As the driver/passenger uses the touch system, the system can refine the applied model parameters from the larger available dataset(s). On the other hand, the automatic parameter tuning for BD models is more complicated due to the condition on unknown arrival time. Nonetheless, from our extensive experiments and Table 1 we can confirm that these empirically selected parameters of the BD methods work sufficiently well.
The timely successful prediction over pointing duration is shown in Figure 5. As expected, all methods generally exhibit an upward trend, that is their performance improves as the the pointing finger-hand approaches the intended endpoint. Specifically, the ERA model can perform poorly at the beginning period of the pointing motion (e.g., in the first 30%); however, it delivers comparable results thereafter. Combined with the overall success rate shown in Table 1, it can be seen that all examined models achieve comparable prediction successes. Hence, the predictive touch system could infer the intended on-display item remarkably early in the pointing-selection tasks. Nonetheless, it can be noticed from Table 1 and Figure 5 that the BD models achieve better results compared with the Gaussian mean reverting models. Furthermore, performance of models whose acceleration is driven by a Wiener process (BD–CA) are also superior to those constructed merely on target position and velocity (BD–CV). This may be due to the fact that present accelerations can reflect the movement trend with more details. Additionally, the advantage of BD methods may be gained from more accurate end state construction such that a successful prediction can always be achieved at the end of pointing period. It is worth mentioning that, in our case, the intent predictors implemented according to Algorithm 1 of Liang et al. (Reference Liang, Ahmad, Gan, Langdon, Hardy and Godsill2019) have the lowest complexity compared with other BD counterparts while the OU-based predictors have the least computational cost among all evaluated methods. Note that for better visualization we have chosen to only display the success rate against gesture time for the BD method proposed in this paper because the lines from previous BD formulations are visually very similar to that introduced here.
4.2. Highly perturbed scenarios and particle filtering
The intent inference performance for highly perturbed trajectories in Dataset B has been tested with jump models and Gaussian mean reverting models in Gan et al. (Reference Gan, Liang, Ahmad and Godsill2019); Ahmad et al. (Reference Ahmad, Langdon, Godsill, Donkor, Wilde and Skrypchuk2016a). Results from the BD models introduced in this paper are also included for comparison. The aggregate inference success for all algorithms and the timely successful prediction from four selected algorithms (i.e., omitting non-BD models for the clarity of presentation) are depicted in Figures 6 and 7, respectively. The applied jump models below are described in Algorithm II and each use 2000 particles, but it has been observed that a comparable performance can be achieved with a small number (e.g., 500) of particles; their parameters are listed in Table 2 (the jumps are assumed to be isotropic) and those for all of the LTI Gaussian models remain the same as in Table 1.
Abbreviations: ERA, equilibrium reverting acceleration; ERV, equilibrium reverting velocity.
From Figure 7, one can see that the BD–CA model always achieves the highest successful prediction after the first 20 percents duration, and the BD models can always achieve the accurate prediction at the end stage of pointing due to its Markov bridge nature. Similar to the LTI ERA model, the jump-ERA model ascends from a relatively low successful prediction, to a comparable successful rate on the second half of the pointing duration. This insensitivity may be caused by a longer reflection on the observation from the acceleration constructed intention. The average success rate in Figure 6 indicates that the BD–CA outperforms other models for this dataset, while the jump-ERV model achieves the second best success rate. This may lead to the conclusion that the BD–CA is the best among other models on characterizing the intention of the hand pointing. However, it is worthwhile to note that the exploration for the parameters of jump models are more restrictive due to their larger number of parameters and time-consuming evaluation process. Thus it is possible that a better results can be achieved with other parameters for jump models, especially for the jump-ERA model. Additionally, the present jumps/jolts in those 10 tracks might not be of the severity (magnitude and/or transience) that a BD–CA model cannot successfully smooth out or follow. Under such high-levels of perturbations, the numerical marginalization of arrival time with BD can be challenging as the pointing-task duration can be subject to large delays, with the risk of it being very distinctive from the prior of $ \mathcal{T} $ . Nevertheless, the use of the particle filtering with a jump process offers additional advantages, not necessarily relevant to the predictive touch usecase, such as detecting the location-time of the perturbations-induced fast maneuvers (jumps) and potentially better destination-aware tracking results, see Gan et al. (Reference Gan, Liang, Ahmad and Godsill2019).
5. Conclusion
In this paper, we presented an overview of the existing stochastic dynamic modeling methods for destination inference, with the in-vehicle predictive touch system as the case study. It covers linear Gaussian and nonlinear setups, both proposed within a Bayesian framework. The adopted continuous time intent-driven state space models naturally facilitate treating asynchronous data, including from multiple sensors. In addition, a new bridging distribution approach was proposed here, which has a moderate computational requirement and a clear stochastic interpretation compared with previous formulations. Results from real data of a predictive touch system demonstrated the efficacy of the various considered prediction algorithms, namely their ability to infer the user intent remarkably early in the pointing-selection task. Thereby, this can facilitate effective touchless interactions via the intuitive free hand pointing gestures. It is emphasized that the presented prediction techniques are also applicable to other fields, for example surveillance, smart navigation, robotics, etc. Nevertheless, there are several extensions to this work, for example bridging distributions for nonlinear and/or non-Gaussian systems (e.g., a stable Lévy system in Gan and Godsill, Reference Gan and Godsill2020), considering intrinsically nonlinear intent-driven motion models for highly maneuverable objects and various measurement models (one such example can be found in Liang et al., Reference Liang, Ahmad and Godsill2020). This paper serves as an impetus to further research on meta-level tracking models and inference algorithms.
Funding Statement
This research was supported by grants from Jaguar Land Rover under the Centre for Advanced Photonics and Electronics CAPE agreement.
Competing Interests
The authors declare no competing interests exist.
Data Availability Statement
The data used in this work is proprietary and confidential; it cannot be made publicly available. Readers are nonetheless encouraged to contact authors where data and code could be shared subject to the recipient abiding by certain terms and conditions.
Ethical Standards
The conducted user studies for predictive touch met all ethical guidelines of the University of Cambridge and Jaguar Land Rover, including adherence to the legal requirements of the study country.
Authorship Contributions
Conceptualization: all; Data curation: B. A.; Formal analysis: R. G., J. L., and B. A.; Funding acquisition: S. G., and B. A.; Investigation: R. G. and J. L.; Methodology: all; Software: R. G., and J. L.; Supervision: B. A., and S. G.; Validation: R. G., and J. L.; Writing-original draft: R.G, J. L., and B. A.; Writing-review editing: all; All authors approved the final submitted draft.
Notation
- $ \unicode{x1D53B} $
-
discrete set of possible destinations, $ \unicode{x1D53B}=\left\{{\mathcal{D}}_i:i=1,2,\dots, {N}_{\mathcal{D}}\right\} $
- $ {N}_{\mathcal{D}} $
-
number of nominal endpoints
- $ {\mathcal{D}}_i $
-
the $ i $ th endpoint
- $ \mathcal{D} $
-
considered intended destination
- $ \hat{\mathcal{D}} $
-
maximum a posteriori estimate for the intended destination
- $ \mathcal{T} $
-
destination arrival time, $ \mathcal{T}={t}_N $
- $ {\tilde{\mathbf{y}}}_N^i $
-
pseudo-observation vector for destination $ {\mathcal{D}}_i $
- $ {\varSigma}_i $
-
covariance of the Gaussian pseudo-observation model for destination $ {\mathcal{D}}_i $
- $ {\mathbf{x}}_n $
-
target dynamic state at time $ {t}_n $
- $ {\mathbf{x}}_{n,i} $
-
dynamic state at time $ {t}_n $ for an object travelling to $ {\mathcal{D}}_i $
- $ {\mathbf{y}}_n $
-
observation vector captured at time $ {t}_n $
- $ {\boldsymbol{\beta}}_t $
-
multivariate standard Wiener process
- $ {\mathbf{J}}_t $
-
compound Poisson process with Gaussian distributed jump size $ {\mathbf{S}}_k $ , that is $ {\mathbf{J}}_t={\sum}_{\tau_k<t}{\mathbf{S}}_k $
- $ {\tau}_k $
-
arrival time of the $ k $ th jump
- $ \mathcal{N}\left(\mathbf{x}|\mathbf{m},\mathbf{C}\right) $
-
multivariate normal distribution for random variable $ \mathbf{x} $ with mean $ \mathbf{m} $ and covariance $ \mathbf{C} $
- $ {N}_{\mathcal{P}} $
-
number of particles used in the particle filtering
- $ {\omega}_n^{\left(p,i\right)} $
-
normalized weight at time $ {t}_n $ for the $ p $ th particle for destination $ {\mathcal{D}}_i $
- $ {\tilde{\omega}}_n^{\left(p,i\right)} $
-
updated weight for the $ p $ th particle for endpoint $ {\mathcal{D}}_i $
- $ \mathbf{I} $
-
identity matrix with the suitable size
- $ {.}^{\prime } $
-
transpose operation
- $ p\left(\cdot \right) $
-
probability density function
Comments
No Comments have been published for this article.