Convolutional Neural Network
Brief Introduction of Convolutional Neural Network, related architectures and computer vision practice
Computer Vision Problems
1. Image Classification
e.g. given an image of 64x64x3, say if it’s an image of cats?
2. Object Detection
e.g. given an image, detect the objects(cars, pedestrains and motors) in this image
3. Neural Style Transfer
Deep Learning on Large Images
1. Standard Neural Network
For a standard neural network(with all fully-connected layers), an input image of 100010003 may need W[1]: (1000, 3 million) = 3 billion parameters
2. Convolutional Neural Network
Only need to train parameters in each filters(kernels), and the number of parameters won’t be affected by the size of input images
Convolutional Neural Network
1. Some building blocks
Edge Detection
- Standard NN (given all pixels of the input image): early layers may detect edges, then some later layers may detect part of the object, then even later layers detect complete objects
- Convolutional NN (given the original input image):
- detect vertical edges
- detect horizontal edges
Edges Detector
- construct filters (33, 55 or 1*1 matrix, the size of filters usually odd)
- operation: convolution, represented by ‘*’ in deep learning.
- one convolution operation: place the filter to the original input image, take the element-wise product and add all resulting numbers, then you got the first number of output
1 | # Convolution Operation: |
Vertical Edge Detection
1 | # filter |
Horizontal Edge Detection
1 | # filter |
Why can these filters detect vertical/horizontal edges?
As for the vertical edge filter, it gets darker vertically from left to right.(In gray images, lager the number is, brighter the pixel is)
After running a convolution operation, the filter vertically can make the left part brighter, the right part darker.
Some other filters
1 | # Sobel filter |
Padding
Using convolutions has two downsides (Why we need padding):
- everytime after convolution, the width and height of the image shrinks
- pixels on the corners and edges are used much less in the output
What’s padding:
Before using convolutions, pad the image with additional border. (e.g. original input image size: 66 -> 88)
Valid Padding
means ‘no padding’
Same Padding
means ‘pad the input image to make the size of the output == size of the original input’
Stride
Normal Cases: filter moves 1 step on the input image after every calculation (both vertically and horizontally)
With Stride: filter moves s steps on the input image after every calculation (both vertically and horizontally)
Calculate the size of output
filter size: f * f
padding: p
stride: s
input image size: n * n
output image size: floor((n - f + 2p) / s) + 1
Convolutions over Volume
Single filter
Convolutions on RGB images:
input image shape: n_h * n_w * 3(number of channels)
filter shape: f * f * 3(number of channels)
convolution operation: from matrix element-wise to cube element-wise, each channel of input and filter run convolution, each number of output comes from the sum of all channels
output shape: n_h’ * n_w’
Multiple filters
Convolutions on RGB images:
input image shape: n_h * n_w * 3(number of channels)
filter shape: f * f * 3(number of channels) with n_c‘ filters
convolution operation: each filter convolves with input image and generate one channel of the output. so if there are c filters, the volume of the output will be c
output shape: n_h’ * n_w’ * n_c’
Convolutional Layers
After convolving the input image with a set of filters, we add a ‘bias’ to each of the channel(generated by each filter) and apply an activation function(like ReLU) to each of them.
Then stack them together, wo got the output after a complete convolutional layer.
A complete convolutional layer
- input image(on behalf of ‘a[i-1]’)
- a set of filters, each filter has the same volume with input (on behalf of ‘w[i]’ )
- bias(constant number, on behalf of ‘b[i]’)
- activation function(like ReLU, on behalf of ‘g(z[i])’, which z[i] = w[i]*a[i-1] + b[i])
- output image(‘a[i]’, which is g(z[i]))
Number of parameters in a convolution layer
- params in each filter: f * f * channels
- bias for each filter: 1
- for m filters in a layer: (f * f * channels + 1) * m
Sum of the parameters in a convolutional layer: (f * f * channels + 1) * m
Feature Maps (特征图)
Feature maps are outputs from each filter. It describes some characteristic (特征) of the input image.
During each conv layer, convolutional neural network will extract some characteristic and create feature maps.
- Shallow layers’ feature maps focus on details (like edges).
- Deep layers’ feature maps focus on overall structure (like face, objects)
Feature maps in use:
- Object Detection (目标检测): YOLO, Fast R-CNN (detect boundary boxes on feature maps)
- Semantic Segmentation (语义分割): UNet, DeepLab (divide categories at pixel-level through feature maps)
- Neural Style Transfer (神经风格迁移): calculate ‘Gram matrix’ through intermediate layer’s feature map
Pooling Layers
Pooling layers don’t have parameters, instead, pooling layers use ‘hyper parameters’ such as filter size f and stride s.
(no params to learn)
Max Pooling
choose the biggest value from each filter area on the input image
Average Pooling
calculate the average value of numbers from each filter area on the input image
Fully-Connected Layers
Like normal layers in standard neural networks, fully-connected layers are usually used after several conv layers in conv networks.
Convolutional Neural Network
Here is a normal cnn structure:
1 | input image |
Advantages of Conv Nets
Parameter Sharing
A feature detector that is useful in one part of the image is probably useful in another part of the image.
Sparsity of Connections
In each layer, each output value only depends on a small number of inputs.