- For MDP we can define state-value function, for policy π, Vπ(s) as: \begin{equation} V_\pi(s) = E[R_t|s] \end{equation} where R is the expected discounted reward sum \begin{equation} R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + … = \sum_{i=0}^{n}\gamma^i r_{i+1} \end{equation} where γ is the discount factor that takes value between 0 and 1.
- We can also define the state-action value function, for policy π, Qπ(s, a) as: \begin{equation} Q_\pi(s, a) = E[R_{t}|s, a] \end{equation}
The Q value of action a, actually is the expected discounted reward sum if we follow action a in state s and then continue using policy π.

- Optimal policy, π*, can be defined using state-action value function as:
TD methods try to find the state value or the state-action value function.

- The simplest TD algorithm is TD(0) and determines the state value function using the update rule: \begin{equation} V(s_t) \gets V(s_{t}) + \alpha(r_{t+1} + \gamma V(s_{t+1}) – V(s_{t})) \end{equation}
If the state-space is large then, we cannot save in memory a different value for all possible states, so we try to approximate state-value function.

- On the other hand Policy Gradient methods try to directly approximate function π(s), using policy gradient theorem.

The basic problem in MDP is to find a policy for the decision maker, which is defined as

π(s) = P(a | s). That means that policy is a function of state s. Our goal is to find the optimal policy.

@kdnuggets: *Markov Decision Process and Reinforcement Learning #MachineLearning*

In this first post, I will write about the basics of Markov Decision Process (MDP) and Reinforcement Learning (RL).

Markov Decision Process is a mathematical framework for modeling decision-making. We define MDP as a tuple of:

The basic problem in MDP is to find a policy for the decision maker, which is defined as π(s) = P(a | s). That means that policy is a function of state s. Our goal is to find the optimal policy.

There are two different types of MDP:

In the image below you can see a MDP model.

The set of states is S = {S0, S1, S2}

The set of actions is A = {a0, a1}

In the image above you can see all the probabilities P(s’|s, a) for all s’, s in S and all a in A. This is a stochastic MDP.

You can also see all the possible rewards. There are only two possible rewards. It is +5 for the transition from state S1 to S0 taking action a0 and -1 for the transition from state S2 to S0 taking action a1. All the other rewards are zero. As you can see reward depends both on states, s and s’, and action a.

In the problems, I am going to discuss on this blog, both transition probabilities and rewards are unknown. These kinds of problems can be solved using reinforcement learning (RL). RL is the area of machine learning that is concerned how an agent has to take actions in an environment.

An RL agent interacts with his environment. At timestep t, the agent receives and observation ot. Then, he takes an action at and receives a reward rt+1 and an observation ot+1. Notice that observation ot and state st may be different. I will discuss about this issue in Markov Property post.

For MDP we can define state-value function, for policy π, Vπ(s) as: \begin{equation} V_\pi(s) = E[R_t|s] \end{equation} where Rt is the expected discounted reward sum \begin{equation} R_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + … = \sum_{i=0}^{n}\gamma^i r_{i+1} \end{equation} where γ is the discount factor that takes value between 0 and 1.

We can also define the state-action value function, for policy π, Qπ(s, a) as: \begin{equation} Q_\pi(s, a) = E[R_{t}|s, a] \end{equation}

The Q value of action a, actually is the expected discounted reward sum if we follow action a in state s and then continue using policy π.

Our goal is to find the optimal policy. We will only refer to two major ways to find it, Temporal Difference (TD) Learning and Policy Gradient algorithms. Optimal policy, π*, can be defined using state-action value function as:

\begin{equation} Q_{\pi*}(s, a) \geq Q_\pi(s,a), \forall (s,a) \in S \times A \end{equation}

TD methods try to find the state value or the state-action value function. The simplest TD algorithm is TD(0) and determines the state value function using the update rule: \begin{equation} V(s_t) \gets V(s_{t}) + \alpha(r_{t+1} + \gamma V(s_{t+1}) – V(s_{t})) \end{equation}

If the state-space is large then, we cannot save in memory a different value for all possible states, so we try to approximate state-value function. This can be done either using a linear function or a non-linear function, like deep neural network, that has parameters θ. Then we define the TD-error as:

\begin{equation} L_{TD} = \frac{1}{2}(r_{t+1} + \gamma \hat{V}(s_{t+1}, \theta) – \hat{V}(s_{t}, \theta))^2 \end{equation}

Then we can use an optimizer to find the parameters θ. In the two equations above we can change the state value function V, with state-action value function. Methods like these are Q-learning and Sarsa.

On the other hand Policy Gradient methods try to directly approximate function π(s), using policy gradient theorem. Methods like these are REINFORCE and actor-critic. Actually, actor-critic method is both policy gradient and Temporal Difference method.

I will later write posts about TD and PG methods in order to talk about deep reinforcement learning algorithms of Deepmind.