Posted 2024-12-27CV

Images Classifier

Classic networks like LeNet-5, AlexNet and VGG;
Architectures like ResNet and Inception Net to improve performance of CNNs;
MobileNets to allow mobile devices to run apps of classifier systems;
Transfer Learning and Data Augmentation to start your system faster and make your classifier stronger.

Classic Networks

LeNet-5

Used

Classify hand-write digit

Trained

Gray scale images (32 * 32 * 1)

Params

60k

Paper

Gradient-based learning applied to document recognition (part II)

Feature

as going deeper in the network, n_H, n_W goes down, n_C goes up
structure: conv-pool-conv-pool-fc-fc-output
using average pooling layers, and sigmoid/tanh rather than ReLu functions in the hidden layers

AlexNet

Used

Image Recognition

Trained

RGM images (2272273)

Params

60m

Paper

ImageNet classification with deep convolutional neural networks

Feature

using ReLu function
multiple GPUs (GPU communicate with each other)
Local Response Normalization (choose a position in a image and normalize every same position among all channels)

VGG - 16

Used

Trained

RGM images (2242243)

Params

138m

Paper

Very deep convolutional networks for large-scale image recognition

Feature

fixed filters: CONV = 3 * 3 filter, s = 1, same convolution; Max-Pool = 2 * 2, s = 2
the n_H and n_W shrink double (after every pool layer) or n_C grows double (after every conv layer)

Residual Network

For very deep networks, there are usually problems like gradient vanishing and gradient explosion.

With ResNets, we can train very deep networks.

Residual Block

A residual block contains some extra layers and a skip connection. (for example below, 2 fully-connected layers and 1 skip connection)
A residual network is a neural network contains multiple residual blocks.

a[l] ---> Linear ---> ReLu ---> a[l+1] ---> Linear ---> ReLu ---> a[l+2]  (main path)
      |                                             ^
      |                                             | pass 'a[l]' here before ReLu
      ---------skip connection / short cut-----------

Before: a[l+2] = g(z[l+2)
After Adding Residual Block: a[l+2] = g(z[l+2] + a[l])

Paper

Deep residual networks for image recognition

Features

compared to plain networks, ResNet allows us to have a reasonable training error even we have many layers.
Identity function is easy for Residual Block to learn. (That’s why adding more layers to the network doesn’t hurt the performance)
usually use ‘same convolutions’ in ResNets to make a[l] and z[l+2] have the same dimension. (If not, add an extra matrix ‘Ws’ to a[l])
we can turn a plain net into a res net by adding residual blocks.

**1*1 Convolutions Network**

one-by-one convolution is like having a fully-connect neural network that apply to each position of the input channels.

Paper

Network in network

Features

use one-by-one convolutions to shrink the number of volume

Inception Network (‘Google Net’: Inception V1)

In generally speaking: Instead of needing to pick any size of the filters or pooling, we do them all and concatenate all the outputs. Let the network learn whatever params it wants to use.

Inception Network: Neural network puts a lot of inception modules together.

Paper

Going deeper with convolutions

Inception Module (inception blocks)

                  --------------------------------> 1 * 1 CONV ---------|
                  |                                                     |
Previous          |------------> 1 * 1 CONV ------> 3 * 3 CONV ---------|       Channel
Activation  ------                                                      |---->  Concat
                  |------------> 1 * 1 CONV ------> 5 * 5 CONV ---------|
                  |                                                     |
                  ----------> MaxPool(Padding) ---> 1 * 1 CONV ---------|

Feature

Computational Cost: Use 1*1 convolution to shrink the channel first, then do regular convolutions.
Have several side branches that also predict like the output layer (ends with a softmax function), this has a regularizing effect and reduces the overfitting.

MobileNet V1

Depthwise Separable Convolution (Building Block of MobileNets)

Depthwise Convolution

n_c filters, each filter(size: f * f * 1) convolve with each channel of input, the channel of output will be the same as input

Pointwise Convolution (Projection)

n_c’ filters, each filter(size: 1 * 1 * n_c) convolve with the whole input, the channel of output will be n_c’

Paper

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Feature

Low computational cost at deployment: MobileNets cost usually $\frac{1}{n_c’} + \frac{1}{f * f}$ times of that in normal convolutions, which is about 10 times cheaper.
Useful for mobile and embedded vision applications

MobileNet V2

Paper

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Bottleneck Block

    --------------------------------- Residual Connection -------------------------------
    |                                                                                   |
    |           (channel expands)                         (channel shrinks)             v
--------> n * n * 3 --------> n * n * 18 --------> n * n * 18 ----------> n * n * 3 --------->
                        ^                    ^                     ^
                        |                    |                     |
                    1 * 1 * 3            Depthwise             1 * 1 * 18
                    Expansion        (same convolution)        Pointwise/Projection
                   (18 filters)         (18 filters)           (3 filters)

Why using Bottle Block?

By using ‘Expansion’, it lets the neural network to learn a richer function by increasing the representation. (From n * n * 3 to n * n * 18)
Memory is limited for mobile devices, so bottleneck block uses ‘Pointwise/Projection’ operation to shrink the representation before pass it to next block. (When passing, the memory needs to pass these values reduced.)

EfficientNet

How to automatically increase or scale the size of the neural network on different devices?

‘Good Trades off‘ between image resolution, depth of the network and width of the layers.

Paper

EfficientNet: Rethinking Model Scaling for Convolutional Neural Network

Transfer Learning

How to create a classifier for myself if I don’t have much data? Using Transfer Learning!

In Computer Vision, transfer learning is one thing that you should almost always do.(Unless you have an exceptionally large data set)

Freeze Layers

When you have Little Data

Download a open-sourced implementation of neural network(with its weights), replace the softmax and output layer with your own layers and freeze the other layers.
Train your own ‘softmax’ and ‘output’ layers.

When you have Large Data

Freeze fewer layers and train the later layers (Initialize the weights of later layers and run gradient descent on them. / Blow away these layers and create you own layers.).

When you have A Lot of Data

Change the ‘softmax’ and ‘output’ layers, and train the whole network.

Neat Trick to Speed up Training

Since you froze several layers and didn’t want to train them: pre-compute the last frozen layer for all training sets and save the results to disk.

So you just need to train a shallow softmax classifier instead of a big one.

Fast Reason: You don’t need to calculate the frozen layers everytime you train the network.

Data Augmentation

Common augmentation methods (distortions)

Mirroring
Random Cropping
Color shifting: Add different distortions to RGB channels. (color distortion algorithm: PCA Color Augmentation-Principles Component Analysis)

Less using methods

Rotation
Shearing
Local warping

Hyper params in Data Augmentation

A good way is to use others’ trained hyper params in their network.

Implementing Distortions

implementing distortions during training:

                                  distortions
hard disk           ----- data1 ------->-------- new data1 ------          
  ----             |                                             |
  |  |  --- load ---                                             ---------> Training
  ----             |              distortions                    |   
                    ----- data2 ------->-------- new data2 ------            CPU/GPU --------
                                                                                             |----> can run in parallel
                                    CPU threads  --------------------------------------------
                     (loading data and implementing distortions)

Tips for Benchmarks and Competitions

Ensembling (maybe 1% or 2% better, needs much memory): Train several networks independently and average their outputs.(like ‘tree ensemble’)
Multi-crop at test time (a little bit better, don’t need much memory): Run classifier on multiple versions of test images and average results. (10-crop)

Tips for Building a Computer Vision Practical System

Use architectures of networks published in the literature.
Use open source implementation if possible.
Use pretrained models and fine-tune on your dataset.
If you have huge data or need to invent a system by yourself, you can make a system from scratch.

Reference

https://www.coursera.org/learn/convolutional-neural-networks/home/week/2

#Computer Vision

Images Classifier

Classic Networks

LeNet-5

Used

Trained

Params

Paper

Feature

AlexNet

Used

Trained

Params

Paper

Feature

VGG - 16

Used

Trained

Params

Paper

Feature

Residual Network

Residual Block

Paper

Features

1*1 Convolutions Network

Paper

Features

Inception Network (‘Google Net’: Inception V1)

Paper

Inception Module (inception blocks)

Feature

MobileNet V1

Depthwise Separable Convolution (Building Block of MobileNets)

Depthwise Convolution

Pointwise Convolution (Projection)

Paper

Feature

MobileNet V2

Paper

Bottleneck Block

Why using Bottle Block?

EfficientNet

Paper

Transfer Learning

Freeze Layers

When you have Little Data

When you have Large Data

When you have A Lot of Data

Neat Trick to Speed up Training

Data Augmentation

Common augmentation methods (distortions)

Less using methods

Hyper params in Data Augmentation

Implementing Distortions

Tips for Benchmarks and Competitions

Tips for Building a Computer Vision Practical System

Like this article? Support the author with

Comments

Tags

Archives

Links

Categories

Recents

follow.it

**1*1 Convolutions Network**