{"title": "Dual Estimation and the Unscented Transformation", "book": "Advances in Neural Information Processing Systems", "page_first": 666, "page_last": 672, "abstract": null, "full_text": "Dual Estimation and the Unscented \n\nTransformation \n\nEricA. Wan \n\nericwan@ece.ogi.edu \n\nRudolph van der Merwe \nrudmerwe@ece.ogi.edu \n\nAlex T. Nelson \n\natneison@ece.ogi.edu \n\nOregon Graduate Institute of Science & Technology \nDepartment of Electrical and Computer Engineering \n20000 N.W. Walker Rd., Beaverton, Oregon 97006 \n\nAbstract \n\nDual estimation refers to the problem of simultaneously estimating the \nstate of a dynamic system and the model which gives rise to the dynam(cid:173)\nics. Algorithms include expectation-maximization (EM), dual Kalman \nfiltering, and joint Kalman methods. These methods have recently been \nexplored in the context of nonlinear modeling, where a neural network \nis used as the functional form of the unknown model. Typically, an ex(cid:173)\ntended Kalman filter (EKF) or smoother is used for the part of the al(cid:173)\ngorithm that estimates the clean state given the current estimated model. \nAn EKF may also be used to estimate the weights of the network. This \npaper points out the flaws in using the EKF, and proposes an improve(cid:173)\nment based on a new approach called the unscented transformation (UT) \n[3]. A substantial performance gain is achieved with the same order of \ncomputational complexity as that of the standard EKF. The approach is \nillustrated on several dual estimation methods. \n\n1 \n\nIntroduction \n\nWe consider the problem of learning both the hidden states Xk and parameters w of a \ndiscrete-time nonlinear dynamic system, \n\nF(Xk , Vk, w) \nH(xk, nk, w), \n\n(1) \n(2) \n\nwhere Yk is the only observed signal. The process noise Vk drives the dynamic system, and \nthe observation noise is given by nk. Note that we are not assuming additivity of the noise \nsources. \n\nA number of approaches have been proposed for this problem. The dual EKF algorithm \nuses two separate EKFs: one for signal estimation, and one for model estimation. The states \nare estimated given the current weights and the weights are estimated given the current \nstates. In the joint EKF, the state and model parameters are concatenated within a combined \nstate vector, and a single EKF is used to estimate both quantities simultaneously. The \nEM algorithm uses an extended Kalman smoother for the E-step, in which forward and \n\n\fDual Estimation and the Unscented Transformation \n\n667 \n\nbackward passes are made through the data to estimate the signal. The model is updated \nduring a separate M-step. \n\nFor a more thorough treatment and a theoretical basis on how these algorithms relate, see \nNelson [6]. Rather than provide a comprehensive comparison between the different algo(cid:173)\nrithms, the goal of this paper is to point out the assumptions and flaws in the EKF (Sec(cid:173)\ntion 2), and offer a improvement based on the unscented transformation/filter (Section 3). \nThe unscented filter has recently been proposed as a substitute for the EKF in nonlinear \ncontrol problems (known dynamic model) [3]. This paper presents new research on the use \nof the UF within the dual estimation framework for both state and weight estimation. In \nthe case of weight estimation, the UF represents a new efficient \"second-order\" method for \ntraining neural networks in general. \n\n2 Flaws in the EKF \n\nAssume for now that we know the model (weight parameters) for the dynamic system in \nEquations 1 and 2. Given the noisy observation Yk, a recursive estimation for Xk can be \nexpressed in the form, \n\nXk = (optimal prediction ofxk) + Gk x [Yk -\n\n(optimal prediction ofYk)] \n\n(3) \n\nThis recursion provides the optimal MMSE estimate for Xk assuming the prior estimate Xk \nand current observation Yk are Gaussian. We need not assume linearity of the model. The \noptimal terms in this recursion are given by \n\nyl: = E[H(xl:, nk)], \n\n(4) \n\nwhere the optimal prediction xl: is the expectation of a nonlinear function of the random \nvariables Xk-l and Vk-l (similar interpretation for the optimal prediction of Yk). The \noptimal gain term is expressed as a function of posterior covariance matrices (with Yk = \nYk - Yl:)\u00b7 Note these terms also require taking expectations of a nonlinear function of the \nprior state estimates. \n\nThe Kalman filter calculates these quantities exactly in the linear case. For nonlinear mod(cid:173)\nels, however, the extended KF approximates these as: \n\nYI: = H(xl:,fl), \n\n(5) \n\nwhere predictions are approximated as simply the function of the prior mean value for es(cid:173)\ntimates (no expectation taken). The covariance are determined by linearizing the dynamic \nequations (Xk+l ~ AXk + BVk, Yk ~ CXk + Dnk), and then determining the posterior \ncovariance matrices analytically for the linear system. As such, the EKF can be viewed \nas providing \"first-order\" approximations to the optimal terms (in the sense that expres(cid:173)\nsions are approximated using a first-order Taylor series expansion of the nonlinear terms \naround the mean values). While \"second-order\" versions of the EKF exist, their increased \nimplementation and computational complexity tend to prohibit their use. \n\n3 The Unscented TransformationIFilter \n\nThe unscented transformation (UT) is a method for calculating the statistics of a random \nvariable which undergoes a nonlinear transformation [3]. Consider propagating a random \nvariable a (dimension L) through a nonlinear function, (3 = g( a). Assume a has mean ct \nand covariance P Q. To calculate the statistics of {3, we form a matrix X of 2L + 1 sigma \nvectors Xi, where the first vector (Xo) corresponds to ct, and the rest are computed from \nthe mean (+ )plus and (-)minus each column of the matrix square-root of P Q. These sigma \n\n\f668 \n\nE. A. Wan, R. v. d. Merwe and A. T. Nelson \n\nvectors are propagated through the nonlinear function, and the mean and covariance for [3 \nare approximated using a weighted sample mean and covariance, \n\n/3 ~ -\u00a3 1 {~g(xo) + -21 I:9(Xi)} , \n\ni=l \n\n+ ~ \n\n(6) \n\nPp\" L:~ {~[g(XO) - il][g(Xo) - il)T + ~ ~[g(X') - il)[g(X,) - ilf} (7) \n\nwhere ~ is a scaling factor. Note that this method differs substantially from general \"sam(cid:173)\npling\" methods (e.g., Monte-Carlo methods and particle filters [1]) which require orders \nof magnitude more sample points in an attempt to propagate an accurate (possibly non(cid:173)\nGaussian) distribution of the state. The UT approximations are accurate to the third order \nfor Gaussian inputs for all nonlinearities. For non-Gaussian inputs, approximations are \naccurate to at least the second-order, with the accuracy determined by the choice of ~ [3]. \nA simple example is shown in Figure 1 for a 2-dimensional system: the left plots shows \nthe true mean and covariance propagation using Monte-Carlo sampling; the center plots \nshow the performance of the UT (note only 5 sigma points are required); the right plots \nshow the results using a linearization approach as would be done in the EKF. The superior \nperformance of the UT is clear. \n\nActual (sampling) \n\nUT \n\nLinearized (EKF) \n\nmean .11\"=- 0 \n\n- I \n(3 = g(o) \n\nYi = g(Xi ) \n\n1 \n\nP(:! = ATPaA \n\nl \n\nI \n(3 = g(o) \n1 \n\nFigure 1: Example of the UT for mean and covariance propagation. a) actual, b) \nUT, c) first-order linear (EKF). \n\nThe unscented filter (UF) [3] is a straightforward extension of the UT to the recursive \nestimation in Equation 3, where we set 0: = Xk, and denote the corresponding sigma matrix \nas X(klk). The UF equations are given on the next page. It is interesting to note that no \nexplicit calculation of lacobians or Hessians are necessary to implement this algorithm. \nThe total number of computations is only order \u00a32 as compared to \u00a33 for the EKF. I \n\n4 Application to Dual Estimation \n\nThis section shows the use of the UF within several dual estimation approaches. As an ap(cid:173)\nplication domain for comparison, we consider modeling a noisy time-series as a nonlinear \n\nINote that a matrix square-root using the Cholesky factorization is of order L 3 /6. However, the \ncovariance matrices are expressed recursively, and thus the square-root can be computed in only order \nL2 by performing a recursive update to the Cholesky factorization. \n\n\fDual Estimation and the Unscented Transformation \n\n669 \n\nUF Equations \n\n, WI . .. W2L = 1/2(L + fl.) \n\nWo = K/(L + K) \nX(klk - 1) = F[X(k - 11k - 1), P!lv2 ] \nx\"k = 2:~!o WiXi(klk - 1) \nP\"k = 2:~!o WdXi(klk - 1) - x\"k][Xi(klk - 1) - x\"kf \nY(klk - 1) = H[X(klk - 1), P~;] \nY\"k = 2:~!o WiYi(klk - 1) \nPhh = 2::!o Wi[Yi(klk - 1) - Y\"kJ[Yi(klk - 1) - Yk\"f \nP XkYk = 2::!o Wi[Xi(klk - 1) - x\"k][Yi(klk - 1) - Y\"kf \nXk = x\"k + PXkYkP~:h (n - Y\"k) \nP k = P\"k - PX\"Yk(P~:yJTP!'kYk \n\nautoregression: \n\nXk = f(Xk-l, ... Xk-M, w) + Vk \nYk = Xk + nk, \n\nVk E {l. . . N} \n\n(8) \n\nThe underlying clean signal Xk is a nonlinear function of its past M values, driven by \nwhite Gaussian process noise Vk with variance 11;. The observed data point Yk includes the \nadditive noise nk, which is assumed to be Gaussian with variance 11;. The corresponding \nstate-space representation for the signal Xk is given by: \n\n+ B\u00b7 Vk-I \n\n(9) \n\nYk = [1 0 \n\n(10) \n\nIn this context, the dual estimation problem consists of simultaneously estimating the clean \nsignal Xk and the model parameters w from the noisy data Yk. \n\n4.1 Dual EKF I Dual UF \n\nOne dual estimation approach is the dual extended Kalman filter developed in [8, 6]. The \ndual EKF requires separate state-space representation for the signal and the weights. A \nstate-space representation for the weights is generated by considering them to be a station(cid:173)\nary process with an identity state transition matrix, driven by process noise Uk: \n\nWk = Wk-l + Uk \nYk = f(Xk-I,Wk) +Vk +nk\u00b7 \n\n(11) \n(12) \n\nThe noisy measurement Yk has been rewritten as an observation on w. This allows the use \nof an EKF for weight estimation (representing a \"second-order\" optimization procedure) \n[7]. Two EKFs can now be run simultaneously for signal and weight estimation. At every \ntime-step, the current estimate of the weights is used in the signal-filter, and the current \nestimate of the signal-state is used in the weight-filter. \n\n\f670 \n\nE. A. Wan, R. v. d. Merwe and A. T Nelson \n\nThe dual UFIEKF algorithm is formed by simply replacing the EKF for state-estimation \nwith the UF while still using an EKF for weight-estimation. In the dual UF algorithm both \nstate- and weight-estimation are done with the UF. Note that the state-transition is linear in \nthe weight filter, so the nonlinearity is restricted to the measurement equation. Here, the \nUF gives a more exact measurement-update phase of estimation. The use of the UF for \nweight estimation in general is discussed in further detail in Section 5. \n\n4.2 Joint EKF I Joint UF \n\nAn alternative approach to dual estimation is provided by the joint extended Kalman filter \n[4,5]. In this framework the signal-state and weight vector are concatenated into a single, \njoint state vector: Zk = [xf wfV. The estimation of Zk can be done recursively by writing \nthe state-space equations for the joint state as: \n\n(13) \n\nand running an EKF on the joint state-space to produce simultaneous estimates of the states \nXk and w . As discussed in [6], the joint EKF provides approximate MAP estimates by \nmaximizing the joint density of the signal and weights given the noisy data. Again, our ap(cid:173)\nproach in this paper is to use the UF instead of the EKF to provide more accurate estimation \nof the state, resulting in the joint UF algorithm. \n\n4.3 EM - Unscented Smoothing \n\nA somewhat different iterative approach to dual estimation is given by the expectation(cid:173)\nmaximization (EM) algorithm applied to nonlinear dynamic systems [2]. In each iteration, \nthe conditional expectation of the signal is computed, given the data and the current es(cid:173)\ntimate of the model (E-step). Then the model is found that maximizes a function of this \nconditional mean (M-step). For linear models, the M-step can be solved in closed form . \nThe E-step is computed with a Kalman smoother, which combines the forward-time esti(cid:173)\nmated mean and covariance (x{ ,pt) of the signal given past data, with the backward-time \npredicted mean and covariance (xf ,pf) given the future data, producing the following \nsmoothed statistics given all the data: \n\n(14) \n\n(15) \n\nWhen a MLP neural network model is used, the M-step can no longer be computed in \nclosed-form, and a gradient-based approach is used instead. The resulting algorithm is \nusually referred to as generalized EM (GEM) 2. The E-step is typically approximated by \nan extended Kalman smoother, wherein a linearization of the model is used for backward \npropagation of the state estimates. \n\nWe propose improving the E-step of the EM algorithm for nonlinear models by using a \nUP instead of an EKF to compute both the forward and backward passes in the Kalman \nsmoother. Rather than linearize the model for the backward pass, as in [2], a neural network \nis trained on the backward dynamics (as well as the forward dynamics). This allows for \na more exact backward estimation phase using the UF, and enables the development of an \nunscented smoother (US). \n\n2 An exact M-step is possible using RBF networks [2]. \n\n\fDual Estimation and the Unscented Transformation \n\n671 \n\n4.4 Experiments \n\nWe present results on two simple time-series to provide a clear illustration of the use of \nthe UP over the EKE The first series is the Mackey-Glass chaotic series with additive \nWGN (SNR ~ 3dB). The second time series (also chaotic) comes from an autoregressive \nneural network with random weights driven by Gaussian process noise and also corrupted \nby additive WGN (SNR ~ 3dB). A standard 5-3-1 MLP with tanh hidden activation \nfunctions and a linear output layer was used in all the filters. The process and measurement \nnoise variances were assumed to be known. \n\nResults on training and testing data, as well as training curves for the different dual estima(cid:173)\ntion methods are shown below. The quoted numbers are normalized (clean signal variance) \nmean-square estimation and prediction errors. The superior performance of the UT based \nalgorithms (especially the dual UF) are clear. Note also the more stable learning curves \nusing the UF approaches. These improvements have been found to be consistent and sta(cid:173)\ntistically significant on a number of additional experiments. \n\nTrain \n\nTest \n\nEst. Pred. Est. Pred. \n0.54 \n0.20 \n0.53 \n0.19 \n0.48 \n0.15 \n0.56 \n0.22 \n0.19 \n0.53 \n\n0.50 \n0.50 \n0.45 \n0.53 \n0.50 \n\n0.21 \n0.19 \n0.14 \n0.22 \n0.18 \n\nChaotic AR-NN \nAlgorithm \n. Dual EKF \nDual UF/EKF \nDual UF \nJoint EKF \nJoint UF \n\nTrain \n\nTest \n\nEst. Pred. Est. Pred. \n0.32 \n0.69 \n0.69 \n0.26 \n0.63 \n0.23 \n0.72 \n0.29 \n0.25 \n0.67 \n\n0.36 \n0.28 \n0.27 \n0.34 \n0.30 \n\n0.62 \n0.58 \n0.55 \n0.58 \n0.55 \n\nMackey-Glass \n\nChaotic AR-NN \n\n. Dual EKF \n. Dual UFIEKF \n\nDual UF \nJointEKF \n...... UF \n\n0 \n0 \n\n\" \n\nMackey-Glass \nAlgorithm \nDual EKF \nDual UF/EKF \nDual UF \nJoint EKF \nJoint UF \n\no. \n\n0.' \n\n01 \n\n~O8 \n:::;; \n\n.. \n\n]05 \nE \ngo\" \n\n03 \n\n02 \n\n0.1 \n0 \n\n\u2022 \n5 \niteration \n\n10 \n\n11 \n\nW en \n:::;; o. \n\" \" ~ \n'\" \nE035 \n5 \nC \n\n0 3 \n\n0 25 \n\n0 2 \n0 \n\n10 \n\n15 \n\n~eration \n\n20 \n\n2S \n\n'\" \n\nThe final table below compares smoother performance used for the E-step in the EM algo(cid:173)\nrithm. In this case, the network models are trained on the clean time-series, and then tested \non the noisy data using either the standard Kalman smoother with linearized backward \nmodel (EKS 1), a Kalman smoother with a second nonlinear backward model (EKS2), and \nthe unscented smoother (US). The forward (F), backward (B), and smoothed (S) estimation \nerrors are reported. Again the performance benefits of the unscented approach is clear. \n\nMackey-Glass \nAlgorithm \nEKSI \nEKS2 \nUS \n\nNorm. MSE \n\nF \n0.20 \n0.20 \n0.]0 \n\nB \n0.70 \n0.3] \n0.24 \n\nS \n0.27 \n0.19 \n0.08 \n\nChaotic AR-NN \nAlgorithm \nEKSI \nEKS2 \nUS \n\nNorm. MSE \n\nF \n0.35 \n0.35 \n0.23 \n\nB \n0.32 \n0.22 \n0.2] \n\nS \n\n0.28 \n0.23 \n0.16 \n\n5 UF Neural Network Training \n\nAs part of the dual UF algorithm, we introduced the use of the UF for weight estimation. \nThe approach can also be seen as a new method for the general problem of training neural \nnetworks (i.e., for regression or classification problems where the input x is observed and \n\n\f672 \n\nE. A. Wan. R. v. d. Merwe and A. T. Nelson \n\nno state-estimation is required). The advantage of the UF over the EKF in this case is not \nas obvious, as the state-transition function is linear (See Equation 11). However, as pointed \nout earlier, the observation is nonlinear. Effectively, the EKF builds up an approximation \nto the expected Hessian by taking outer products of the gradient. The UF, however, may \nprovide a more accurate estimate through direct approximation of the expectation of the \nHessian. We have performed a number of preliminary experiments on standard benchmark \ndata. The figure below shows the mean and std. oflearning curves (computed over 100 \nexperiments with different initial weights) for the Mackay Robot Arm Mapping dataset. \nNote the faster convergence, lower variance, and lower final MSE performance of the UF \nweight training. While these results are encouraging, further study is still necessary to fully \ncontrast differences between UF and EKF weight training. \n\nO.06n---.-----,\"L\"le;O;anmWln;:;r,g1\"it.;'\"luwrv:v:ec.s,..---r= \n\nI::;;U~F (;:=m.==an7=>l) I r \n-\n: : ~~~ ~~~n) \n\nUF(. ld) \n\n0 .05 \n\nW \n~ 0 .004 \nc: \n~ 0,03 \n\n'\\ \n\nE 0.02 ~_. __ -..~ __ -.. ______________ \u2022 _ _ \n\n\"---~ \n\n...... -\n\n-\n\n.. \n\n.. \n\n0.01 \n\n6 Conclusions \n\nThe EKF has been widely accepted as a standard tool in the machine learning community. \nIn this paper we have presented an alternative to the EKF using the unscented filter. The \nUF consistently achieves a better level of accuracy than the EKF at a comparable level \nof complexity. We demonstrated this performance gain on a number of dual estimation \nmethods as well as standard regression modeling. \n\nAcknowledgements \nThis work was sponsored in part by the NSF under grant IRI-9712346. \n\nReferences \n[1] J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet. Sequential Monte Carlo methods \nfor optimisation of neural network models. Technical Report TR-328, Cambridge University \nEngineering Department, Cambridge, England, November 1998. \n\n[2] Z. Ghahramani and S. T. Roweis. Learning nonlinear dynamical systems using an EM algorithm. \nIn M. J. Keams, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing \nSystems II: Proceedings of the 1998 Conference. MIT Press, 1999. \n\n[3] S. J. Julier and J. K. Uhlmann. A New Extension of the Kalman Filter to Nonlinear Systems. In \n\nProc. of AeroSense: The 11th International Symposium on Aerospace/Defence Sensing. Simula(cid:173)\ntion and Controls. Orlando. Florida., 1997. \n\n[4] R. E. Kopp and R. J. Orford. Linear regression applied to system identification for adaptive \n\ncontrol systems. AlAA 1., I :2300-06, October 1963. \n\n[5] M. B. Matthews and G. S. Moschytz. Neural-network nonlinear adaptive filtering using the \n\nextended Kalman filter algorithm. In INNC, pages 115-8, 1990. \n\n[6] A. T. Nelson. Nonlinear Estimation and Modeling of Noisy Time-Series by Dual Kalman Filter(cid:173)\n\ning Methods. PhD thesis, Oregon Graduate Institute, 1999. In preparation. \n\n[7] S. Singhal and L. Wu. Training multilayer perceptrons with the extended Kalman filter. \n\nIn \nAdvances in Neural Information Processing Systems 1, pages 133-140, San Mateo, CA, 1989. \nMorgan Kauffman. \n\n[8] E. A. Wan and A. T. Nelson. Dual Kalman filtering methods for nonlinear prediction, estimation, \n\nand smoothing. In Advances in Neural Information Processing Systems 9, 1997. \n\n\f", "award": [], "sourceid": 1681, "authors": [{"given_name": "Eric", "family_name": "Wan", "institution": null}, {"given_name": "Rudolph", "family_name": "van der Merwe", "institution": null}, {"given_name": "Alex", "family_name": "Nelson", "institution": null}]}