Final Capstone Project of Master Years
Fine grained visual categorization (FGVC) is a state-of-the-art topic in Computer vision. It focuses on distinguishing species, which tend to be similar to each other in appearance. Testing and training such model always requires much time and effort. Transfer learning is a good method to implement to tackle this problem when the dataset used by the pre-trained models and target dataset are sim- ilar. It especially works well on image-based models due to the generality of features captured in lower level network layers, such as edges and patterns. Higher level convolution layers are used to extract more complex features specific to the categories that the model was originally trained on.
However, inconsistent categories between the original dataset and target dataset occurs, fine tuning the weights of those higher level convolutional layers is a better strategy. Moreover, due to the various architecture of the pre-trained models, their convolution layers, especially the higher level one, can extract different features. Model ensembling takes advantage of those various fea- tures generating from multiple pre-trained models that greatly improves the model performance. However, there are several problems should be concerned when using transfer learning approach on FGVC based dataset, iNaturalist.
Firstly, unlike the ImageNet dataset, iNaturalist dataset has im- balanced classes. This would easily lead to overfitting problem that has to be dealt with. Secondly, the similarity between species presents more challenges to the model to learn important details from limited number of pixels. In this project, four different neural network model architectures trained on ImageNet are used: VGG16, InceptionV3, ResNet152, and InceptionResNetV2. They are used as a basis to build a model tested to classify two small subsets, one contains fine-grained categories with balanced distribution, the other one contains coarse-grained categories with imbalanced dis- tribution. Since fine tuning such small dataset on pre-trained model tends to result in overfitting, context augmentation preprocessing based on Mask R-CNN and generative inpainting is applied as a novel method to prevent the model from relying on the background information. The ensemble of fine-tuned InceptionResNetV2, InceptionV3, and ResNet152 is the best model at classifying similar looking categories subset. Most of the pretrained models by themself are capable of handling the imbalanced subset well, with ResNet152 having the best performance.