Paddle Learning Notebook
Normalized Training Model
Stage 1: Load Data
Divide the stage into four sub-stage:
- load data file
- separate data into groups according to the number of features
- divide the dataset into training & test dataset
- normalize data to the range of [0, 1]
1.1.1 load data file
1 | import paddle |
np.fromfile
is used for constructing data from a text or binary file, among which the sep
para indicates the separator. Here we use spaces as the separator.
1.1.2 separate data into groups according to the number of features
1 | # separate data to groups according to the number of features |
1.1.3 divide the dataset into training & test dataset
1 | # divide the dataset |
Here regard the previous $80\%$ data as the training data.
1.1.4 normalization
1 | # normalize data to the range of [0, 1] |
axis = 0
means we process elements in rows. In contrast, axis = 1
means in columns.
1.1.5 end of the stage
1 | training_data = data[:offset] |
ex 1.1.1 extract data
1 | training_data, test_data = load_data () |
Basic Intuition About CNN
Layers
- Convolutional Layer
- Pooling Layer: Process of merging (which means reducing the size of data). Used for reducing noise so that significant information could be extracted.
- Layers above are used for feature extraction.
- Flattening Layer: Transform multi-dimensional output from convolutional layer to 1-dimensional vector that can be accepted by fully-connected layer.
- Fully-connected Layer: Main neural network section. $e.g. $ use softmax or sigmoid function to do classification
Number Classifier (Classic CNN, Paddle)
Input image size: $28 \times 28$
1 | import paddle |
Single image judgement
1 | # load model |
Yolov3
Definition
Anchor Box
Use a basic anchor box to generate a series of anchor boxes that keep the same area as the basic one.
Parameter center, scale, and ratio are all given previously in the input.
center describes the position of the center pixel in the basic anchor box.
scale describes the information of the basic anchor box’s size.
ratio describes the aspect ratio.
Suppose the width and height of the basic anchor box and the finally generated box are $w, h$ and $W, H$ respectively with ratio equal to $k$. Then $wh = WH, \frac{H}{W} = k$. Therefore, $W = \sqrt{\frac{wh}{k}}, H = \sqrt{whk}$
IoU (Intersection of Union)
$$
IoU = \frac{A \cap B}{A \cup B}
$$
Used for describing the coincidence degree between two boxes.
Process
The basic intuition about how yolo proceeds the data is that we first divide the picture into $n \times n$ (general $13 \times 13$, $26 \times 26$, and $52 \times 52$) grids, then assign a bounding box for each ground truth box, whose anchor has the highest $IoU$ value in respect to it.
Detailly speaking, for each anchor box, we need to predict a set of values $(P_{object}, t_x, t_y, t_w, t_h, P_{class_1}, P_{class_2}, …)$, where $P_{object}, P_{class_n}$ are the probability of the anchor box containing an object and it belongs to the class $n$ respectively. As for the reason for the existence of $t_x, t_y, t_w, t_h$ is that we need to slightly adjust the center and size of anchor boxes, for they are originally set with specified values.
Then how to predict $(t_x, t_y, t_w, t_h)$? We do regression.
For $t_x, t_y$,
$$
\begin{aligned}
b_x &= c_x + \sigma(t_x) \
b_y &= c_y + \sigma(t_y)
\end{aligned}
$$
where $b_x, b_y$ indicates the $x, y$ of ground truth boxes, $c_x, c_y$ indicates what grid we are in, $e.g.$ if the center of the ground truth box is in the $5$-th grid in a certain row, then $c_y$ should be $4$ (count from zero). And sigmoid function has a range of $(0, 1)$, which means $c_x + \sigma(t_x)$ would invariably lies between $c_x$ and $c_x + 1$, which meets the requirement.
For $t_w, t_h$,
$$
\begin{aligned}
b_w &= c_we^{t_w} \
b_h &= c_he^{t_h}
\end{aligned}
$$
It is because it is easier to do regression on ratio than the true value of $w, h$. Therefore, we need $e^{t_w}$ to keep the ratio positive.
Then it is clear that we only need to fill in
$$
\begin{aligned}
&d^_x = \sigma(t_x) = b_x - c_x \
&d^_y = \sigma(t_y) = b_y - c_y \
&t^_w = \ln{\frac{b_w}{c_w}} \
&t^_h = \ln{\frac{b_h}{c_h}}
\end{aligned}
$$
to the matrices as parameters to train.
Yolov3 uses network $Darknet53$ to train. And size of the output layer should be (for $13 \times 13$) $13 \times 13 \times batchsize \times (1 + 1 + 1 + 1 + 1 + numberof~classes)$, where $(1 + 1 + 1 + 1 + 1)$ saves $(P_{object}, x, y, w, h)$.
It uses binary cross-entropy as the loss function.
NLP (Natural Language Processing)
Word Embedding
The conversion from words to high dimensional vectors. For instance, word king
can be transformed to vector $[0.3, -0.35, …, 0.7]$ (values in the vector are all in the range of $[0, 1]$).
Actually, these transformed vectors can represent information of the words to some extent. What we need to do next is to calculate distance between vectors to find connections between them, $e.g.$, word mother
and son
are supposed to be near in the high dimensional coordinate system in general.
One-Hot Encoding
Suppose we have a dictionary with $5000$ words, then what generated by one-hot encoding should be a $1 \times 5000$ vector with a one and $4999$ zeros. For example, work king
is the $\text{3rd}$ word in the dictionary, then its one-hot-encoding should be $[0, 0, 1, 0, …, 0]$
Conversion from words to vectors
one-hot encoding $\times$ embedding lookup $=$ output vector
where embedding lookup indicates a $n \times m$ tensor (suppose we have $n$ words in the dictionary and each word can be converted to a $m$ dimensional vector). For example, here is a embedding lookup
word | |||
---|---|---|---|
king | 0.3 | 0.5 | - 0.1 |
queen | 0.2 | 0.45 | - 0.05 |
CBOW & Skip-gram
CBOW and Skip-gram are two typical models based on the word2vec algorithm.
4.4.1 CBOW
The above is a basic network working on CBOW which is similar to the one in Skip-gram. We will illustrate it in detail in Skip-gram.
4.4.2 Skip-gram