Rosetta Stone for Stochastic Dynamic Programming

In grad school, I worked on what operations researchers call approximate dynamic programming. This field is virtually identical to what computer scientists call reinforcement learning. It’s also been called neuro-dynamic programming. All these things are an extension of stochastic dynamic programming, which is usually introduced through Markov decision processes.

Because these overlapping fields have been developed in different disciplines, mathematical notation for them can vary dramatically. After much confusion and frustration, I created this Rosetta Stone to help me keep everything straight. I hope others will find it useful as well. A PDF and the original LaTeX code are available on my Github.

Bertsekas Sutton and Barto Puterman Powell
Stages $k$ $t$ $t$ $t$
First Stage $N$ $1$ $1$ 1
Final Stage $0$ $T$ $N$ $T$
State Space $S$ $\mathcal{S}$
State $i$, $i_{k}$ $s$ $s$ $s$, $S_{t}$
Action Space $U(i)$ $A=\cup{s\in S}A{s}$ $\mathcal{A}$
Action $a$ $a$ $a$
Policy $\mu_{k}(i)$, $\pi$ $\pi(s,a)$, $\pi$ $\pi$, $d_{t}^{MD}(s)$ $\pi$
Transitions $p{ij}(\mu{k}(i))$ $\mathcal{P}_{ss’}^{a}$ $p_{t}(\cdot\,\vert\,s,a)$ $\mathbb{P}(s’\,\vert\,S{t},a{t})$
Cost $g(i,u,j)$ $\mathcal{R}_{ss’}^{a}$ $r_t(s,a)$ $C{t}(S{t},a_{t})$
Terminal Cost $G(i_{N})$ $r_{T}$ $r_{N}(s)$ $V{T}(S{T})$
Discount $\alpha$ $\gamma$ $\lambda$ $\gamma$
Q-Value (Policy) $J_{k}^{\pi}(i)$ $\mathcal{Q}^{\pi}(s,a)$
Q-Value (Optimal) $\mathcal{Q}(S^{n},a)$
Value (Policy) $J_{k}^{\pi}(i)$ $V^{\pi}(s)$ $u_{t}^{\pi}$ $V{t}^{\pi}(S{t})$
Value (Optimal) $J_{k}^{*}(i)$ $V^{*}(s)$ $u_{t}^{*}$ $V{t}(S{t})$
Bellman Operator $T$ $\mathscr{L}$, $L$ $\mathcal{M}$

Optimal Value Function

  • Bertsekas

$$J{k}^{*}=\min{u\in U(i)}\sum{j=1}^{n}p{ij}(u)\left(g(i,u,j)+\alpha J^{*}_{k-1}(j)\right)$$

  • Sutton and Barto

$$V^{*}(s)=\max_{a}\mathcal{P}^{a}_{ss’}\left[\mathcal{R}^{a}_{ss’}+\gamma V^{*}(s’)\right]$$

  • Puterman

$$u_{t}^{*}(s_{t})=\max_{a\in A_{s_{t}}}\left\{r_{t}(s_{t},a)+\sum_{j\in S}p_{t}(j\,|\,s_{t},a)u_{t+1}^{*}(j)\right\}$$

  • Powell

$$V_{t}(S_{t})=\max_{a_{t}}\left\{C_{t}(S_{t},a_{t})+\gamma\sum_{s’\in \mathcal{S}}\mathbb{P}(s’\,\vert\,S_{t},a_{t})V_{t+1}(s’)\right\}$$

References

  • D.P. Bertsekas. Dynamic Programming and Optimal Control. Number v. 2 in Athena Scientific Optimization and Computation Series. Athena Scientific, 2007. ISBN 9781886529304. URL.
  • W.B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley Series in Probability and Statistics. John Wiley & Sons, 2011. ISBN 9781118029152. URL.
  • M.L. Puterman. Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and statistics. Wiley-Interscience, 1994. ISBN 9780471727828. URL.
  • R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning. Mit Press, 1998. ISBN 9780262193986. URL.