Convolutional Neural Network

Convolutional Neural Network

Brief Introduction of Convolutional Neural Network, related architectures and computer vision practice


Computer Vision Problems

1. Image Classification

e.g. given an image of 64x64x3, say if it’s an image of cats?

2. Object Detection

e.g. given an image, detect the objects(cars, pedestrains and motors) in this image

3. Neural Style Transfer


Deep Learning on Large Images

1. Standard Neural Network

For a standard neural network(with all fully-connected layers), an input image of 100010003 may need W[1]: (1000, 3 million) = 3 billion parameters

2. Convolutional Neural Network

Only need to train parameters in each filters(kernels), and the number of parameters won’t be affected by the size of input images


Convolutional Neural Network

1. Some building blocks

Edge Detection

  • Standard NN (given all pixels of the input image): early layers may detect edges, then some later layers may detect part of the object, then even later layers detect complete objects
  • Convolutional NN (given the original input image):
    • detect vertical edges
    • detect horizontal edges
Edges Detector
  1. construct filters (33, 55 or 1*1 matrix, the size of filters usually odd)
  2. operation: convolution, represented by ‘*’ in deep learning.
  3. one convolution operation: place the filter to the original input image, take the element-wise product and add all resulting numbers, then you got the first number of output
1
2
3
4
5
6
7
# Convolution Operation:
# python
conv_forward
# tensorflow
tf.nn.conv2d
# Keras
Conv2D
Vertical Edge Detection
1
2
3
4
# filter
|1 0 -1|
|1 0 -1|
|1 0 -1|
Horizontal Edge Detection
1
2
3
4
# filter
| 1 1 1|
| 0 0 0|
|-1 -1 -1|
Why can these filters detect vertical/horizontal edges?

As for the vertical edge filter, it gets darker vertically from left to right.(In gray images, lager the number is, brighter the pixel is)
After running a convolution operation, the filter vertically can make the left part brighter, the right part darker.

Some other filters
1
2
3
4
5
6
7
8
9
# Sobel filter
|1 0 -1|
|2 0 -2|
|1 0 -1|

# Scharr filter
| 3 0 -3|
|10 0 -10|
| 3 0 -3|

Padding

Using convolutions has two downsides (Why we need padding):

  1. everytime after convolution, the width and height of the image shrinks
  2. pixels on the corners and edges are used much less in the output

What’s padding:
Before using convolutions, pad the image with additional border. (e.g. original input image size: 66 -> 88)

Valid Padding

means ‘no padding’

Same Padding

means ‘pad the input image to make the size of the output == size of the original input’

Stride

Normal Cases: filter moves 1 step on the input image after every calculation (both vertically and horizontally)
With Stride: filter moves s steps on the input image after every calculation (both vertically and horizontally)

Calculate the size of output

filter size: f * f
padding: p
stride: s
input image size: n * n
output image size: floor((n - f + 2p) / s) + 1

Convolutions over Volume

Single filter

Convolutions on RGB images:
input image shape: n_h * n_w * 3(number of channels)
filter shape: f * f * 3(number of channels)
convolution operation: from matrix element-wise to cube element-wise, each channel of input and filter run convolution, each number of output comes from the sum of all channels
output shape: n_h’ * n_w’

Multiple filters

Convolutions on RGB images:
input image shape: n_h * n_w * 3(number of channels)
filter shape: f * f * 3(number of channels) with n_c‘ filters
convolution operation: each filter convolves with input image and generate one channel of the output. so if there are c filters, the volume of the output will be c
output shape: n_h’ * n_w’ * n_c’

Convolutional Layers

After convolving the input image with a set of filters, we add a ‘bias’ to each of the channel(generated by each filter) and apply an activation function(like ReLU) to each of them.
Then stack them together, wo got the output after a complete convolutional layer.

A complete convolutional layer

  1. input image(on behalf of ‘a[i-1]’)
  2. a set of filters, each filter has the same volume with input (on behalf of ‘w[i]’ )
  3. bias(constant number, on behalf of ‘b[i]’)
  4. activation function(like ReLU, on behalf of ‘g(z[i])’, which z[i] = w[i]*a[i-1] + b[i])
  5. output image(‘a[i]’, which is g(z[i]))

Number of parameters in a convolution layer

  1. params in each filter: f * f * channels
  2. bias for each filter: 1
  3. for m filters in a layer: (f * f * channels + 1) * m
    Sum of the parameters in a convolutional layer: (f * f * channels + 1) * m

Feature Maps (特征图)

Feature maps are outputs from each filter. It describes some characteristic (特征) of the input image.

During each conv layer, convolutional neural network will extract some characteristic and create feature maps.

  • Shallow layers’ feature maps focus on details (like edges).
  • Deep layers’ feature maps focus on overall structure (like face, objects)

Feature maps in use:

  • Object Detection (目标检测): YOLO, Fast R-CNN (detect boundary boxes on feature maps)
  • Semantic Segmentation (语义分割): UNet, DeepLab (divide categories at pixel-level through feature maps)
  • Neural Style Transfer (神经风格迁移): calculate ‘Gram matrix’ through intermediate layer’s feature map

Pooling Layers

Pooling layers don’t have parameters, instead, pooling layers use ‘hyper parameters’ such as filter size f and stride s.
(no params to learn)

Max Pooling

choose the biggest value from each filter area on the input image

Average Pooling

calculate the average value of numbers from each filter area on the input image

Fully-Connected Layers

Like normal layers in standard neural networks, fully-connected layers are usually used after several conv layers in conv networks.

Convolutional Neural Network

Here is a normal cnn structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
       input image
|
| conv layer: filters, padding, stride
v
intermediate output image (normally with width and height shrinked and channel expanded)
|
| conv layer: filters, padding, stride
v
intermediate output image (normally with width and height shrinked and channel expanded)
|
| fully-connected layer: some neurons
v
intermediate output vector
|
| fully-connected layer: single neuron with softmax activation
v
classification result: e.g.say if it's a cat or not

Advantages of Conv Nets

Parameter Sharing

A feature detector that is useful in one part of the image is probably useful in another part of the image.

Sparsity of Connections

In each layer, each output value only depends on a small number of inputs.

Comments