Part 2 of MPC Series

MPC model predictive control

In this part we are going to talking about MPC where the dynamics of the robot are linear and the task that we want the robot to do is of a specific quadratic form (I will also have some code for you to try this with). If you are just joining in and have no idea what I am talking about, I recommend going through the part 1 of this series.

Linear dynamical systems

Given the state $s_t$ at a discrete time $t$ and action $a_t$ a robot with linear dynamics is defined as

\[s_{t+1} = f(s_t, a_t) = As_t + Ba_t\]

where $A \in \mathbb{R}^{n\times n}$ and $B \in \mathbb{R}^{n \times m}$ are matrices that tells us how the free dynamics evolve and how control enter the robotic system. Unfortunately very few robots can be defined this way and for those that can, the restriction is that the system can be controlled (which you can calculate, but I won’t get into).

Quadratic objectives

As mentioned in the previous post, we need a way to measure how well the robot is doing a task. We are going to go over the case where the goal for the robot is to drive its states to zero as a function of time. If you are familiar with optimization you will know that one of the things you want is for the objective to be some convex quadratic form. We can write this kind of objective (known as a tracking objective) as $\ell(s,a)=s^\top Q s + a^\top R a$ where $Q$ and $R$ are positive semidefinite weight matrices that allow us to put more control emphasis on a particular state and action dimension. The objective function is then given to use by

\[\mathcal{J}(a_{0:T-1}, s_0) = \sum_{t=0}^{T-1} s_t^\top Q s_t + a_t^\top R a_t + s_T^\top Q s_T\]

where we penalize the final state at time $T$. So we want to optimize the objective function by finding the set of actions $a_{0:T-1}$ that optimizes the objective function (since our objective is a sum of convex states we will be minimizing the objective).

What you will find in most control textbooks and papers is the following problem formulation

\[a^\star_{0:T-1} = \arg \min_{a_{0:T-1}} \mathcal{J}(a_{0:T-1}, s_0) \\ \text{subject to } s_{t+1} = f(s_t, a_t)\]

where $a^\star_{0:T-1}$ is the optimal sequence of inputs to the robot.

Dynamic Programming Solution

If you want to, you can stick in that optimization problem into some proprietary solver and get an answer and that would be fine. Chances are you will have no idea how the solver came up with the answer and it might take a bit of time to get an answer. What I will present is a recursive solution derived from dynamic programming and is known to appear in a lot of optimization problems with temporal structure.

Let’s first define what is known as the value function

\[V_t(s_t) = \min_{a_{t:T-1}} \sum_{\tau = t}^{T-1} \left( s_\tau^\top Q s_\tau + a_\tau^\top R a_\tau \right) + s_T^\top Q s_T \\ \text{subject to } s_{\tau+1} = A s_\tau + B a_\tau\]

The value function $V_t(s_t)$ gives us the minimum cost-to-go starting from state $s_t$ where $V_0(s_0)$ is the minimum total cost starting from state $s_0$.

Now this is where things get a bit weird and tricky, but just try to follow along and you will see things will make sense in the end. Since the objective is convex and the dynamics are linear dynamics, as it turns out $V_t(s_t) = s_t^\top P_t s_t$ where $P_t = P^\top_t$ is positive semidefinite known as the cost-to-go matrix (take my word for it). The way this was originally found was through guess and check so I am saving you from a lot of headaches. What is super cool and interesting is that we can find the optimal action $a_t$ through the value function.

Value Iteration

Let’s assume we know $V_{t+1}(s_{t_1})$. We can write the value function as

\[V_t(s_t) = \min_{a_t} \left( s_t^\top Q s_t + a_t^\top R a_t + V_{t+1}(A s_t + B a_t) \right)\]

where we use the dynamics to fill in $s_{t+1}$ as a function of state (I encourage you to prove this for yourself). This is known as the Hamilton-Jacobi-Bellman (HJB) equation and is very useful in a lot of different applications include reinforcement learning, path planning, and of course, optimal control. If you can solve this, you solve the optimal action for where you are and the optimal action to improve the value where you land (the foundation of Dynamic programming). Remember that $V_t(s_t) = s_t^\top P_t s_t$, let’s assume you knew what $P_t$ was and plug in this expression into the HJB equation.

\[V_t(s_t) = \min_{a_t} \left(s_t^\top Q s_t + a_t^\top R a_t + (A s_t + B a_t)^\top P_{t+1}(A s_t + B a_t) \right).\]

We can find the minimum of the value function at time $t$ setting the derivative of $V_t(s_t)$ with respect to $a_t$ and setting it to zero. This gives us the LQR solution as:

\[2 R a_t + 2 B^\top P_{t+1} (A s_t + B a_t) = 0 \\ \text{ solving for $a_t$ gives: } a^\star_t = -(R + B^\top P_{t+1} B)^{-1} P_{t+1} A s_t\]

which is known as the LQR solution of the form $a^\star_t = K_t s_t$, where $K_t = -(R + B^\top P_t B)^{-1} P_t A$. What is $P_t$ though? Well, what is the value function at time $T$? Since the time horizon runs out, what you are left with is $V_T(s_T) = s_T^\top Q s_T = s_T^\top P_T s_T$ and this is where the iteration in value iteration comes from. Since we know the terminal value, that $P_T = Q$, and that the LQR gain $K_t$ is a function of the next time step cost-to-go matrix $P_{t+1}$, we can work out way backwards and calculate each optimal control!

Plugging in $a^\star = K_t s_t$ into $V_t(s_t)$ as the solution to $a_t$ we get the following:

\[\begin{align} V_t(s_t) &= s_t^\top P_t s_t \\ \text{ where } P_t &= Q + K_t^\top R K_t + (A + B K_t)^\top P_{t+1} (A + B K_t) \\ \text{ and } K_t &= -(R + B^\top P_{t+1} B)^{-1} B^\top P_{t+1}A \end{align}\]

where the update equation for $P_t$ running backwards in time is known as a algebraic Ricatti equation.

Starting from $t=T$ we get the recursive solution for LQR

\[\begin{align} &\text{ set $P_T = Q$ } && \\ &\text{ for $t \in [T-1, \ldots, 0 ]$} && \\ & && \hspace{-40mm} K_t = -(R + B^\top P_{t+1} B)^{-1} B^\top P_{t+1}A \\ & && \hspace{-40mm} P_t = Q + K_t^\top R K_t + (A + B K_t)^\top P_{t+1} (A + B K_t) \\ & && \hspace{-40mm} a^\star_t = K_t s_t \end{align}\]

This is straightforward to implement in code so I recommend trying this out for yourself. Next post, I will present the more general problem with nonconvex objectives and nonlinear dynamics using the JAX automatic differentiation library.