RL - DQN & FlappyBird Implementation

First of all, let me give a basic intuition of $DQN$ (Deep Q-Network). It is composed of $D$ (deep learning) and $QN$ (Q-Network), among which the Q-Learning algorithm is the core and the neural network is just a tool to maintain the matrix in Q algorithm because of the memory limits (a $state \times actioin$ size of matrix is too big for most of the times).

Algorithm Description

Q-Learning

It is explained in detail in this blog: A Painless Q-Learning Tutorial

DQN

Similar to dynamic programming, the state transition equation of Q array presented in Q-Learning:
$$
Q (s, a) = R(s, a) + \gamma \cdot \max_{\tilde{a}}\{Q (\tilde{s}, \tilde{a})\}
$$
Here $s, a$ indicates current state and action, $R$ indicates the reward matrix, $\gamma$ indicates learning rate (a coefficient), $\tilde{s}, \tilde{a}$ indicates state and action in the next time step.

Then comes the problem: how do we measure loss function? Easy. Remind that we are now using neural network to maintain the Q-Learning sheet, which means we need to have our network’s outputs approximate to Q values. Therefore, it can now be regarded as supervised learning with labels as Q values calculated by previous networks.
$$
L_i(\theta_i) = (r + \gamma \cdot (\max_{\tilde{a}}\{Q(\tilde{s}, \tilde{a} | \theta_{i - 1})\}) - Q(s, a | \theta_i))^2
$$
which is an MSE loss function, and $\theta$ means we need to use parametric approximations.

The training process is divided into three parts: observation period, exploration period, and training period

In practice, we define a replay memory to save those existed states and randomly pick up a batch of them as what we are to train in the exploration and training period, which can make the training more efficient. Meanwhile, we need to build two same networks, Q-Network and target Q-Network. Q-Network is what we are training, but target Q-Network is used to predict ‘label Q values’. And every certain episodes, we copy the parameters in Q-Network to target Q-Network.

Algorithm Description

Q-Learning

DQN

Code