Normalized Training Model

Stage 1: Load Data

Divide the stage into four sub-stage:

  • load data file
  • separate data into groups according to the number of features
  • divide the dataset into training & test dataset
  • normalize data to the range of [0, 1]

1.1.1 load data file

1
2
3
4
5
6
7
8
import paddle
import numpy as np
import json

def load_data ():
# load data file
datafile = '.../.../.../'
data = np.fromfile (datafile, sep = ' ')

np.fromfile is used for constructing data from a text or binary file, among which the sep para indicates the separator. Here we use spaces as the separator.

1.1.2 separate data into groups according to the number of features

1
2
3
4
# separate data to groups according to the number of features
features = []
feature_num = len (features)
data = data.reshape (data.shape[0] // feature_num, feature_num)

1.1.3 divide the dataset into training & test dataset

1
2
3
4
# divide the dataset
ratio = 0.8
offset = int (data.shape[0] * ratio)
training_data = data[:offset]

Here regard the previous $80\%$ data as the training data.

1.1.4 normalization

1
2
3
4
5
6
7
# normalize data to the range of [0, 1]
maximum, minimum, avg = \
training_data.max (axis = 0), \
training_data.min (axis = 0), \
training_data.sum (axis = 0) / training_data.shape[0]
for i in range (feature_num): # to range [0, 1]
data[:, i] = (data[:, i] - avg[i]) / (maximum[i] - minimum[i])

axis = 0 means we process elements in rows. In contrast, axis = 1 means in columns.

1.1.5 end of the stage

1
2
3
training_data = data[:offset]
test_data = data[offset:]
return training_data, test_data

ex 1.1.1 extract data

1
2
3
training_data, test_data = load_data ()
x = training_data[:, :-1]
y = training_data[:, -1:]

Basic Intuition About CNN

Layers

  • Convolutional Layer
  • Pooling Layer: Process of merging (which means reducing the size of data). Used for reducing noise so that significant information could be extracted.
  • Layers above are used for feature extraction.
  • Flattening Layer: Transform multi-dimensional output from convolutional layer to 1-dimensional vector that can be accepted by fully-connected layer.
  • Fully-connected Layer: Main neural network section. $e.g. $ use softmax or sigmoid function to do classification

Number Classifier (Classic CNN, Paddle)

Input image size: $28 \times 28$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import paddle
from paddle.nn import Linear, Conv2D, MaxPool2D
import paddle.nn.functional as F
import os
import numpy as np
import matplotlib.pyplot as plt
import gzip
import json
import random

# !important
# if we wanna extract values from data_loader
# we need to guarantee that all image data in it are in numpy format
# the below code is used for this
paddle.vision.set_image_backend ('cv2')

# ''' --------------------load data version 1------------------------
# load dataset MNIST
train_dataset = paddle.vision.datasets.MNIST (mode = 'train')
# train_dataset[n][0] saves the n-th pic; train_dataset[n][1] saves its label
# note that we need to normalize them
# ------------------------------------------------------------ '''

''' -------------load data version 2 (self-wrapped load_data function)----------------
def load_data (mode = 'train'):
datafile = './Practice/NumClassifier_Paddle/mnist.json.gz'
data = json.load (gzip.open (datafile))

train_set, val_set, eval_set = data
if mode == 'train':
images, labels = train_set[0], train_set[1]
elif mode == 'valid':
images, labels = val_set[0], val_set[1]
elif mode == 'eval':
images, labels = eval_set[0], eval_set[1]
else: # throw error message
raise Exception ("invalid mode")

assert len (images) == len (labels)

# shuffle the train data
index_list = list (range (len (images)))
random.shuffle (index_list)

BATCHSIZE = 100
# a python generator (look up specific definition on websites)
def data_generator ():
img_list = label_list = []
for i in index_list:
img_list.append (np.array (images[i]).astype ('float32'))
label_list.append (np.array (labels[i]).astype ('float32'))
if len (img_list) == BATCHSIZE:
yield np.array (img_list), np.array (label_list)
img_list = label_list = []
# if there remains elements not wrapped
if len (img_list) > 0:
yield np.array (img_list), np.array (label_list)

return data_generator
---------------------------------------------------------------------------------- '''

# normalization
def norm_img (images):
assert len (images.shape) == 3
batch_size, img_h, img_w = images.shape[0], images.shape[1], images.shape[2]
images = images / 255
# reshape to 4-D
images = paddle.reshape (images, [batch_size, 1, img_h, img_w])

# here values in images are all 'paddle.float32' type (which is in tensor format)
# therefore we need to convert them to normal 'float32' type then

return images

class MNIST (paddle.nn.Layer):
def __init__ (self):
super (MNIST, self).__init__ ()

# CNN
self.conv1 = Conv2D (in_channels = 1, out_channels = 20, kernel_size = 5, stride = 1, padding = 2)
self.max_pool1 = MaxPool2D (kernel_size = 2, stride = 2)
self.conv2 = Conv2D (in_channels = 20, out_channels = 20, kernel_size = 5, stride = 1, padding = 2)
self.max_pool2 = MaxPool2D (kernel_size = 2, stride = 2)
# fc is the abbreviate of fully-connected layer
# 7 * 7 * 20 = 980
self.fc = Linear (in_features = 980, out_features = 10)

def forward (self, input):
x = self.conv1 (input)
# remember to activate
x = F.relu (x)
x = self.max_pool1 (x)
x = self.conv2 (x)
x = F.relu (x)
x = self.max_pool2 (x)
x = paddle.reshape (x, [x.shape[0], - 1])
x = self.fc (x)
x = F.softmax(x)
return x

model = MNIST ()

def train (model):
# set training mode
model.train ()
# load train data with a batch size of 16
train_loader = paddle.io.DataLoader (train_dataset, batch_size = 16, shuffle = True)
# train_loader = load_data ('train')
# define optimizer (parameters indicates elements we are to train)
# like we need to tell it the list of elements needed to be updated in back-prop
# opt = paddle.optimizer.SGD (learning_rate = 0.001, parameters = model.parameters ())
opt = paddle.optimizer.Adam (learning_rate = 0.001, parameters = model.parameters ())
EPOCH_NUM = 2
# actually, what extracted from train_loader are batches of images and labels
# which means para 'images' below should contain 16 images
# therefore, its shape should be [16, 28, 28]
# which means [batch_size, width, height]
for epoch in range (EPOCH_NUM):
for batch_id, (images, labels) in enumerate (train_loader ()):
# ''' ---------used for paddle.Dataloader------------
images = norm_img (images).astype ('float32')
labels = labels.astype ('int64') # used for cross-entropy
# ----------------------------------------------- '''

# forward propagation
predicts = model (images)

# compute loss
loss = F.cross_entropy (predicts, labels)
avg_loss = paddle.mean (loss)

if batch_id % 1000 == 0:
print ("epoch_id: {}, batch_id: {}, loss is: {}".format (epoch, batch_id, avg_loss.numpy()))

# back propagation, compute para in each layer
avg_loss.backward ()
# update para
opt.step ()
# clear old gradient data
opt.clear_grad ()

train (model)
paddle.save (model.state_dict (), './Practice/NumClassifier_Paddle/mnist.pt')

test_dataset = paddle.vision.datasets.MNIST (mode = 'test')
test_loader = paddle.io.DataLoader (test_dataset, batch_size = 16, shuffle = False)
def evaluation (model, dataset):
# set evaluation mode
model.eval ()

# list () create an empty list
acc_set = list ()
for batch_id, (images, labels) in enumerate (test_loader):
images = norm_img (images).astype ('float32')
labels = labels.astype ('int64')

predict = model (images)
acc = paddle.metric.accuracy (input = predict, label = labels)
acc_set.extend (acc.numpy ())

acc_val_mean = np.array (acc_set).mean ()
return acc_val_mean

acc = evaluation (model, test_dataset)
print (acc)

Single image judgement

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# load model
model = MNIST ()
param_dict = paddle.load ('./Practice/NumClassifier_Paddle/mnist.pt')
model.load_dict (param_dict)
model.eval ()

def load_image (path):
# load image and convert to grayscale
img = Image.open (path).convert ('L')
# Image.ANTIALIAS is a filter for resize function to keep quality
img = img.resize ((28, 28), Image.ANTIALIAS)
# reshape the image to a 1-dimensional vector ([1, n])
img = np.array (img).reshape (1, 1, 28, 28).astype (np.float32)
# need to reverse black & white to meet the MNIST format
img = 1.0 - img / 255

return img

# load image
tensor_img = load_image ('./Practice/NumClassifier_Paddle/zero.jpg')
# note that we need to convert the normalized image to tensor
result = model (paddle.to_tensor (tensor_img))
# sort the probablilities, pick the maximum
result = np.argsort (result.numpy ())
print (result[0][- 1])

Yolov3

Definition

Anchor Box

  • Use a basic anchor box to generate a series of anchor boxes that keep the same area as the basic one.

  • Parameter center, scale, and ratio are all given previously in the input.

  • center describes the position of the center pixel in the basic anchor box.

  • scale describes the information of the basic anchor box’s size.

  • ratio describes the aspect ratio.

  • Suppose the width and height of the basic anchor box and the finally generated box are $w, h$ and $W, H$ respectively with ratio equal to $k$. Then $wh = WH, \frac{H}{W} = k$. Therefore, $W = \sqrt{\frac{wh}{k}}, H = \sqrt{whk}$

IoU (Intersection of Union)

$$
IoU = \frac{A \cap B}{A \cup B}
$$

Used for describing the coincidence degree between two boxes.

Process

The basic intuition about how yolo proceeds the data is that we first divide the picture into $n \times n$ (general $13 \times 13$, $26 \times 26$, and $52 \times 52$) grids, then assign a bounding box for each ground truth box, whose anchor has the highest $IoU$ value in respect to it.

Detailly speaking, for each anchor box, we need to predict a set of values $(P_{object}, t_x, t_y, t_w, t_h, P_{class_1}, P_{class_2}, …)$, where $P_{object}, P_{class_n}$ are the probability of the anchor box containing an object and it belongs to the class $n$ respectively. As for the reason for the existence of $t_x, t_y, t_w, t_h$ is that we need to slightly adjust the center and size of anchor boxes, for they are originally set with specified values.

Then how to predict $(t_x, t_y, t_w, t_h)$? We do regression.

For $t_x, t_y$,
$$
\begin{aligned}
b_x &= c_x + \sigma(t_x) \
b_y &= c_y + \sigma(t_y)
\end{aligned}
$$
where $b_x, b_y$ indicates the $x, y$ of ground truth boxes, $c_x, c_y$ indicates what grid we are in, $e.g.$ if the center of the ground truth box is in the $5$-th grid in a certain row, then $c_y$ should be $4$ (count from zero). And sigmoid function has a range of $(0, 1)$, which means $c_x + \sigma(t_x)$ would invariably lies between $c_x$ and $c_x + 1$, which meets the requirement.

For $t_w, t_h$,
$$
\begin{aligned}
b_w &= c_we^{t_w} \
b_h &= c_he^{t_h}
\end{aligned}
$$
It is because it is easier to do regression on ratio than the true value of $w, h$. Therefore, we need $e^{t_w}$ to keep the ratio positive.

Then it is clear that we only need to fill in
$$
\begin{aligned}
&d^_x = \sigma(t_x) = b_x - c_x \
&d^
_y = \sigma(t_y) = b_y - c_y \
&t^_w = \ln{\frac{b_w}{c_w}} \
&t^
_h = \ln{\frac{b_h}{c_h}}
\end{aligned}
$$
to the matrices as parameters to train.

Yolov3 uses network $Darknet53$ to train. And size of the output layer should be (for $13 \times 13$) $13 \times 13 \times batchsize \times (1 + 1 + 1 + 1 + 1 + numberof~classes)$, where $(1 + 1 + 1 + 1 + 1)$ saves $(P_{object}, x, y, w, h)$.

It uses binary cross-entropy as the loss function.

NLP (Natural Language Processing)

Word Embedding

The conversion from words to high dimensional vectors. For instance, word king can be transformed to vector $[0.3, -0.35, …, 0.7]$ (values in the vector are all in the range of $[0, 1]$).

Actually, these transformed vectors can represent information of the words to some extent. What we need to do next is to calculate distance between vectors to find connections between them, $e.g.$, word mother and son are supposed to be near in the high dimensional coordinate system in general.

One-Hot Encoding

Suppose we have a dictionary with $5000$ words, then what generated by one-hot encoding should be a $1 \times 5000$ vector with a one and $4999$ zeros. For example, work king is the $\text{3rd}$ word in the dictionary, then its one-hot-encoding should be $[0, 0, 1, 0, …, 0]$

Conversion from words to vectors

one-hot encoding $\times$ embedding lookup $=$ output vector

where embedding lookup indicates a $n \times m$ tensor (suppose we have $n$ words in the dictionary and each word can be converted to a $m$ dimensional vector). For example, here is a embedding lookup

word
king 0.3 0.5 - 0.1
queen 0.2 0.45 - 0.05

CBOW & Skip-gram

CBOW and Skip-gram are two typical models based on the word2vec algorithm.

4.4.1 CBOW

XrnyI1.jpg

The above is a basic network working on CBOW which is similar to the one in Skip-gram. We will illustrate it in detail in Skip-gram.

4.4.2 Skip-gram