Building AI for Emotion Detection
Introduction
Machines have been able to recognize and differentiate between
faces for a few years now, but the human face has a purpose outside of
identity. Mouths and eyebrows are vital to human communication and allow us to
convey tone and emotion without the use of words or gestures. If machines
gained the capability to interpret facial expressions into human emotion, it
would open a whole new world of sentiment analysis. While the capability is
obviously great, this task is hard for many humans to perform. Humans often misinterpret non-verbal signals either due to the
similarity of two emotions or due to incorrect assumptions of tone. For our
example we are seeking to discriminate between eight key emotions: Happy, Sad,
Contempt, Disgust, Fear, Anger, Surprise, and Neutral. As discussed earlier we
can see how some of these emotions represent similar feelings, such as Contempt
and Disgust, and how other emotions may evoke similar reactions, such as Fear
and Surprise. All these factors make this problem difficult to solve, however,
our experimentation leads us to believe that it is possible to attain high
accuracy in this classification problem.
Background
Our dataset of choice for this problem
is AffectNet. This dataset is attractive for image recognition due
to its large scale; there are around 1 million images in the dataset,
and there is 290k images that have been hand labeled with their depicted
emotion. In addition, the AffectNet Dataset
contains valence and arousal values which describe emotions
in a continuous manner. These values are not commonly found in other facial
emotion recognition datasets and allows us to create models that predict an image’s
location in this continuous space rather than classify the image within the
discrete space described by our eight designated
emotions: Neutral, Happy, Sad, Surprise, Anger, Disgust, Fear,
and Contempt.
Integral to our exploration of this
data was our choice CNN model for transfer learning. We
specifically looked at AlexNet, Resnet, and VGG as potential underlying CNNs
for our model. Each of these models were trained on the ImageNet database,
meaning that each of these Neural Nets was well equipped for object recognition
and we just had to retrain them to specialize in emotion detection. Using these
underlying CNNs, however, required us to ensure that our input was like the
initial training corpus for these models. The shape each data point was the
same, but we needed to normalize each image for the CNNs to provide accurate
results. The model we chose was VGG16. VGG16 (also called OxfordNet) was used to win
the Large Scale Visual Recognition Challenge in 2014 and is still
considered an excellent vision model and achieves 92.7% top-5 test
accuracy in ImageNet. The
default input size for the VGG16 model is 224 x 224 pixels with 3
channels for RGB images. It has convolution layers of a 3x3 filter with a
stride 1 and a max pool layer of a 2x2 filter of stride 2. The combination of these layers makes VGG-16 ideal
for feature extraction from an image dataset and make it well suited for our
problem
We tested VGG-16 on the AffectNet dataset by unfreezing and training the weights of the fully connected layers, this approach didn't give us a significant result as we only got a validation accuracy of 34%, however, this is to be expected because of the difference in dataset. The next attempt was retraining the last convolution block in addition to the fully connected layers, the accuracy increased to 39%. As we increased the number of trainable layers, our training time too increased at a high rate. With our compute constraints the maximum number of trainable layers we could unfreeze was half of the VGG-16 layers. Furthermore, it didn't make sense to retrain the early layers of VGG-16 as the early layers of VGG-16 could be used directly to get the basic features of images which are common for all image classification datasets. After training with 17 trainable layers of VGG-16 we could come up with decent accuracy of 48% validation accuracy. We will later talk about how we built upon this base model.
Imbalanced Distribution
For our primary model we drew
inspiration from a paper titled, AffectNet:
A Database for Facial Expression, Valence, and Arousal Computing in the Wild
[1]. In this paper they took a pretrained Convolutional Neural
Network and transfer-learned emotion detection from these extracted features.
For our implementation we incorporated VGG-16 as our base Convolutional Neural
Network and used Adam Optimizer with Cross Entropy Loss. Under this structure we
saw an overall validation accuracy of about 48%. Our model had high accuracy
for Happy and Neutral emotions but had awful accuracy in Disgust and Contempt
emotions. This same behavior is shown in the paper discussed above. The primary
reason for this imbalanced accuracy is the imbalanced nature of the AffectNet
dataset. Below you can see how the occurrences of each class correlate
with the model's accuracy on each class.
Our interpretation of these results is that down-sampling requires you to sacrifice data, introducing more variance into the model, and up-sampling requires you to duplicate data, introducing more bias into the model. Each of these approaches is not ideal because of the high expense of their required tradeoff. Weighted loss, however, does not affect the bias-variance tradeoff as heavily and as such its performance was better than the other two. Looking at these results, however, we felt that we could improve upon the process used for up-sampling the data such that we minimized the introduction of bias into the model.
From this base model we felt that there existed smarter methods for balancing the data than what has already been used, specifically, we felt that there was a better way to up-sample the data. The goal of up-sampling is to introduce
synthetic data into the training set for a given model in the hopes that this
synthetic data will lead to better results for the minority class. Basic
up-sampling, as used in the paper, simply copies and repeats samples from the
minority classes to generate synthetic data. While this is a reasonable
approach, we felt that it could be done more robustly by taking advantage of
the image domain. Because of how the model interprets images we felt that it
would be possible to generate synthetic data by transforming a given image such
that the Neural Network would no longer recognize the original image and the
transformed image as the same data point. Our first thought to accomplishing
this was to look for a generator model. We even found a paper titled, Deep
Neural Network Augmentation: Generating Faces for Affect Analysis [2],
which discussed the exact capability we sought, allowing us to input an
image of a neutral face and receive a manipulated version of that face that
mapped our neutral image to a chosen expression. While this paper showed that
this face generation capability was possible there was no public code or model
that we found for this purpose. This led us to the idea of image
transformation, perturbing an image such that the Network would not recognize
it, allowing us to use multiple copies of the same image without extremely high
risk of memorization of the training set. The transforms we performed consisted
of a random rotation between 45 and -45 degrees as well as a horizontal
flip.
We were inspired to use these specific
transformations as we felt that they would challenge the Convolutional layers
of the Neural Network, forcing the Network to abstract features from the images
rather than memorize the corpus. We also felt that these transformations would
deliver object orientations that would reasonably exist in the given dataset.
Due to the large imbalance of the dataset (130K vs. 4K for Happy vs.
Contempt/Disgust) we were worried that fully balancing the dataset with these
transformed images would still introduce undue bias. As such we introduced a
new hyperparameter for our up-sampling method that limits how much each class
can be up-sampled. By setting this parameter such that each class would be
up-sampled to at maximum 20K samples we immediately saw huge improvements to
the balance of accuracies for each class as shown in the table below.
While we saw great improvement from
this novel up-sampling practice, there were other processes and parameters that
we wanted to test which were not tested in the original paper. One of these was
the choice of optimizer. We decided to compare Adam Optimization and Stochastic
Gradient Descent Optimization. While we believed that Adam Optimization
would perform better due to its adaptive learning rate, we wanted to make sure
that we performed an empirical comparison. As we can see in the table
Adam outperformed Stochastic Gradient Descent in almost every class, however,
with further optimization of parameters there is a chance that SGD could prove
more optimal.
We also tested various class weights for
our Cross Entropy loss function. These tested weights ranged from keeping all
classes equal to heavily penalizing minority misclassifications and heavily
discounting majority misclassifications such that the accuracy shifted towards
the minority classes. Overall, we found that use of weighted loss created high
risk for overfit of training data and over classification of the minority
classes, especially when used in conjunction with Stochastic Gradient Descent
Optimization. We even sought to combine our weighted loss function with
momentum and/or weight decay to see how this would affect the overfit of the
data. Unfortunately, we did not have enough time or compute to find accurate
feedback on these hyperparameters and the ranges in which they were effective. We
did create a model that used SGD with heavy bias towards the minority class and
were able to get our best accuracy of any model.
Finally,
we explored classification of each image into two continuous axes of valence
and arousal. The motivation for this practice comes from the fact that emotions
are not unrelated, and some emotions are incredibly similar. By mapping these
images into a continuous emotion space, we can use a RMSE loss function which
will allow us to describe the accuracy of prediction more accurately.
We believe that this continuous emotion
space may also open the door for ensemble classification. Rather than having
one model that discriminates between eight emotions, we find the region in emotion
space that the image lies, and then have a model that is trained to
discriminate between the emotions in that region. For example, the first model
would map a Contempt labeled image to negative valence and low arousal. In this
region we would also see emotions such as Disgust. Here another model would be
able to discern between these two emotions and correctly determine that the
emotion expressed in the image is indeed Contempt. We feel that this approach
is much more robust because our more global model was a rather poor learner,
achieving a maximum accuracy of 63%. As such this boosting-inspired approach
would allow each model to behave as an expert within its relatively small
radius. This is further encouraged by the results shown by the continuous
emotion space mapping model.
This regression method proves itself to be
quite accurate. We see rather small RMSE values for valence and arousal which
both exist on the domain: {-1, 1}. This means that the emotion neighborhood
approach is a reasonable task for this style of model as it experimentally
placed all images close to their labeled position in emotion space. In this
case we would be able to correctly choose a secondary model to classify the
image within this neighborhood, allowing this new approach to take advantage
several local experts rather than a single global expert.
Our experiments showed that there are still large improvements that can be made on a global classifier. Should further resources be put into optimizing transform up-sampling, selection of CNN, and tuning of hyperparameters we believe that accuracy can be improved above the 63% benchmark set by AlexNet. This belief is based on our improvements from a 48% accuracy of classification in our base model to 50% accuracy in our up-sampling model and 55% accuracy in our weighted loss model. However, even with these improvements we would still expect a global classifier to be a relatively poor learner. As such we hypothesize that an ensembling based approach could provide greater accuracy. The basic structure of this approach would include a primary model that can place images into emotion neighborhoods, from which a secondary model would be used to further discriminate between these more similar emotions to accurately classify the image. Not only does this help avoid the problem of a poor global learner, but we believe that it more accurately fits the problem of emotion detection.
References:
[1] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: A database for facial expression, Valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2019.
[2] D. Kollias, S. Cheng, E. Ververas, I. Kotsia, and S. Zafeiriou, “Deep neural network augmentation: Generating faces for affect analysis,” International Journal of Computer Vision, vol. 128, no. 5, pp. 1455–1484, 2020.
[3] A. Ali Heydari, Craig A.Thompson, Asif Mehmood, SoftAdapt: Techniques for Adaptive Loss Weighting of Neural Networks with Multi-Part Loss Functions.
[4] M.A.H.Akhand, Shuvendu Roy, Nazmul Siddique, Md Abdus Samas Kamal, Tetsuya Shimamura, Facial Emotion Recognition Using Transfer Learning in the Deep CNN
Comments
Post a Comment