CNN
Delete after finish YOLO algorithm.
- convolution *(asterisk)
- filter(kernel)
- image * filter = new image
- python: conv_forward, tensorflow: tf.nn.con2d, keras: conv2D
- edge detection (vertical && horizontal)
- filter can distinguish whether the input image is ‘light to dark’ or ‘dark to light’
- different numbers used in filters (sobel, Scharr)
- learn numbers in filter by backprop
- filter ‘f’ is usually odd(3 x 3, 5 x 5, 7 x 7)
- padding: pixels on the corners and edges used much less than the middle: problems:1.shrinking output 2.throwing away information from edges of the image
- padding choices: valid(no padding) and same(pad, so output size is same as input size)
- stride convolutions: steps the filter moves
- new output dimension: floor[(n + 2p - f) / s] + 1
- convolutions over volumns(3 channels): input image: nnn_channels , filter: ff3, output: 1 volumn(no longer 3)
- filter ff3 can detect edge of different color(red/green/blue channel)
- using multiple filters in the same time: output dimension: n_newn_newn_filters
- types of layers in cnn: Conv layer, Pooling Layer, Fully-Connected Layer
- Pooling(no params to learn): Max Pooling(usually not use padding) - biggest number, Average Pooling
- LeNet-5 (conv-pool, conv-pool, fc, fc), AlexNet, VGG-16
- Residual NN(skip connection/short cut: residual block)
- why Residual NN works? identity function is easy for residual block to learn
- turn a plain NN to a residual NN: add residual blocks(skip connections)
- 1*1 convolution(one-by-one convolution/network in network)
- inception network/inception layer: use them all! Question: computation cost (fix: using one-by-one convolution to shrink the channel of the input)
- MobileNet v1 (depthwise separable convolution: depthwise + pointwise)
- Depthwise Convolution(number of filters = number of channels, filter size: f * f)
- Pointwise Convolution / Projection (filter size: f * f * n_channel)
- MobileNet v2 (2 main changes: 1.add Residual Connection 2.add expansion layer -> bottleneck block)
- EfficientNet (limited computational resource: how to trade-off between resolution of the input image, depth of the network and width of the layers)
- Open-Source implementation
- Transfer Learning (always try when doing computer vision) (freeze layers in others’ projects, the larger dataset you have for your task, the fewer layers you freeze, the more layers you train)
- data augmentation (Common Used: Mirroring, Random Cropping, Color shifting Less Used: Rotation, Shearing, Local Warping)
- PCA Color Augmentation
- implementing distortions during training (have multiple threads to load and implement distortions and pass the results to other threads to train)
- Tips for doing well on benchmark/competitions (1.Ensembling: train several nets independently and average their outputs 2.Multi-crop at test time: run classifier on multiple versions of test image and average results)
- object detection: object classification + object localization(landmark detection) = object detection
- algorithm for object detection: sliding windows. train a CNN to classify entire cars in image, then go sliding windows in original input, pass the crop window image to that CNN to check if there is a car. After go through the entire input image, make the window larger and repeat.
- problem for sliding windows: 1.huge computational cost. fix: use that algorithm ‘convolutionally’ 2.position of the bounding boxes aren’t too accurate. fix: YOLO algorithm(making grid for input images and label the training data with 8 dimensions: [Pc, bx, by, bh, bw, c1, c2, c3])
- YOLO(you only look once) algorithm
- how to tell if your object detection algorithm is working well? using: intersection over union(IOU): size of intersection(交集) / size of union(并集), correct if Iou >= 0.5(threshold), Iou is a measure of the overlap between two bounding boxes
- Object Detection Problems: 1.make sure object detection algorithm detect each object only once: non-max suppression: 1.discard all boxes with pc <= 0.6(threshold, means there isn’t any object) 2.while there are any remaining boxes: pick box with the largest pc, output as a prediction, then discard any remaining box with Iou >= 0.5 wit the box you just output as a prediction
- Object Detection Problems: 2.each of the grid cell can only detect one object(also called ‘overlapping objects’): anchor boxes: pre-define two(or more) different shapes(called ‘anchor boxes’) and reshape the label y (maybe from 8 dimensions to 16 dimensions because of two anchor boxes)
- YOLO algorithm: put them together: For training set: y is containing two(or more) anchor boxes(two anchor boxes: [pc bx by bw bh c1 c2 c3 pc bx by bw bh c1 c2 c3]) For making predictions: pc == 1 means this anchor box contains the object. Run non-max suppression: choose the highest ‘pc’(probability) predictions, for each class, run non-max suppression to generate final predictions
- Region Proposals: R-CNN(regions with convolutional neural networks): to run your convolutional classifier on several regions: 1.run segmentation algorithm to determine some blobs(颜色区域) that maybe an object 2.then run convolutional classifier on these blobs. (R-CNN -> Fast R-CNN -> Faster R-CNN)
- Semantic Segmentation: label every single pixel: semantic segmentation with U-Net(compared to CNN for object recognition, in the U-Net, the width and height first get smaller, then they need to get bigger to blow it back to a full-size image: using transpose convolutions)
- transpose convolution: set filter to output (not input), use padding and stride to output
- U-Net: