Building AI for Emotion Detection

Introduction

    Machines have been able to recognize and differentiate between faces for a few years now, but the human face has a purpose outside of identity. Mouths and eyebrows are vital to human communication and allow us to convey tone and emotion without the use of words or gestures. If machines gained the capability to interpret facial expressions into human emotion, it would open a whole new world of sentiment analysis. While the capability is obviously great, this task is hard for many humans to perform. Humans often misinterpret non-verbal signals either due to the similarity of two emotions or due to incorrect assumptions of tone. For our example we are seeking to discriminate between eight key emotions: Happy, Sad, Contempt, Disgust, Fear, Anger, Surprise, and Neutral. As discussed earlier we can see how some of these emotions represent similar feelings, such as Contempt and Disgust, and how other emotions may evoke similar reactions, such as Fear and Surprise. All these factors make this problem difficult to solve, however, our experimentation leads us to believe that it is possible to attain high accuracy in this classification problem.

Background

    Our dataset of choice for this problem is AffectNet. This dataset is attractive for image recognition due to its large scale; there are around 1 million images in the dataset, and there is 290k images that have been hand labeled with their depicted emotion. In addition, the AffectNet Dataset contains valence and arousal values which describe emotions in a continuous manner. These values are not commonly found in other facial emotion recognition datasets and allows us to create models that predict an image’s location in this continuous space rather than classify the image within the discrete space described by our eight designated emotions: Neutral, Happy, Sad, Surprise, Anger, Disgust, Fear, and Contempt.


Figure 1. Sample Image from AffectNet Dataset

    Integral to our exploration of this data was our choice CNN model for transfer learning. We specifically looked at AlexNet, Resnet, and VGG as potential underlying CNNs for our model. Each of these models were trained on the ImageNet database, meaning that each of these Neural Nets was well equipped for object recognition and we just had to retrain them to specialize in emotion detection. Using these underlying CNNs, however, required us to ensure that our input was like the initial training corpus for these models. The shape each data point was the same, but we needed to normalize each image for the CNNs to provide accurate results. The model we chose was VGG16. VGG16 (also called OxfordNet) was used to win the Large Scale Visual Recognition Challenge in 2014 and is still considered an excellent vision model and achieves 92.7% top-5 test accuracy in ImageNet. The default input size for the VGG16 model is 224 x 224 pixels with 3 channels for RGB images. It has convolution layers of a 3x3 filter with a stride 1 and a max pool layer of a 2x2 filter of stride 2. The combination of these layers makes VGG-16 ideal for feature extraction from an image dataset and make it well suited for our problem



Figure 2. VGG16 Architecture

    We tested VGG-16 on the AffectNet dataset by unfreezing and training the weights of the fully connected layers, this approach didn't give us a significant result as we only got a validation accuracy of 34%, however, this is to be expected because of the difference in dataset. The next attempt was retraining the last convolution block in addition to the fully connected layers, the accuracy increased to 39%. As we increased the number of trainable layers, our training time too increased at a high rate. With our compute constraints the maximum number of trainable layers we could unfreeze was half of the VGG-16 layers. Furthermore, it didn't make sense to retrain the early layers of VGG-16 as the early layers of VGG-16 could be used directly to get the basic features of images which are common for all image classification datasets. After training with 17 trainable layers of VGG-16 we could come up with decent accuracy of 48% validation accuracy. We will later talk about how we built upon this base model.


                                                                       
Table I. Results of Various VGG Trainable Layers

Imbalanced Distribution

    For our primary model we drew inspiration from a paper titled, AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild [1]. In this paper they took a pretrained Convolutional Neural Network and transfer-learned emotion detection from these extracted features. For our implementation we incorporated VGG-16 as our base Convolutional Neural Network and used Adam Optimizer with Cross Entropy Loss. Under this structure we saw an overall validation accuracy of about 48%. Our model had high accuracy for Happy and Neutral emotions but had awful accuracy in Disgust and Contempt emotions. This same behavior is shown in the paper discussed above. The primary reason for this imbalanced accuracy is the imbalanced nature of the AffectNet dataset. Below you can see how the occurrences of each class correlate with the model's accuracy on each class.

Table II. Class Accuracy for Various Balancing Methods [1]

    Our interpretation of these results is that down-sampling requires you to sacrifice data, introducing more variance into the model, and up-sampling requires you to duplicate data, introducing more bias into the model. Each of these approaches is not ideal because of the high expense of their required tradeoff. Weighted loss, however, does not affect the bias-variance tradeoff as heavily and as such its performance was better than the other two. Looking at these results, however, we felt that we could improve upon the process used for up-sampling the data such that we minimized the introduction of bias into the model.

Table III. Class Accuracy and Occurrence for Base Model using Adam Optimization and Cross-Entropy Loss

Our Experiment

    From this base model we felt that there existed smarter methods for balancing the data than what has already been used, specifically, we felt that there was a better way to up-sample the data. The goal of up-sampling is to introduce synthetic data into the training set for a given model in the hopes that this synthetic data will lead to better results for the minority class. Basic up-sampling, as used in the paper, simply copies and repeats samples from the minority classes to generate synthetic data. While this is a reasonable approach, we felt that it could be done more robustly by taking advantage of the image domain. Because of how the model interprets images we felt that it would be possible to generate synthetic data by transforming a given image such that the Neural Network would no longer recognize the original image and the transformed image as the same data point. Our first thought to accomplishing this was to look for a generator model. We even found a paper titled, Deep Neural Network Augmentation: Generating Faces for Affect Analysis [2], which discussed the exact capability we sought, allowing us to input an image of a neutral face and receive a manipulated version of that face that mapped our neutral image to a chosen expression. While this paper showed that this face generation capability was possible there was no public code or model that we found for this purpose. This led us to the idea of image transformation, perturbing an image such that the Network would not recognize it, allowing us to use multiple copies of the same image without extremely high risk of memorization of the training set. The transforms we performed consisted of a random rotation between 45 and -45 degrees as well as a horizontal flip. 


Figure 3. Sample Image Transformation for Up-Sampling

    We were inspired to use these specific transformations as we felt that they would challenge the Convolutional layers of the Neural Network, forcing the Network to abstract features from the images rather than memorize the corpus. We also felt that these transformations would deliver object orientations that would reasonably exist in the given dataset. Due to the large imbalance of the dataset (130K vs. 4K for Happy vs. Contempt/Disgust) we were worried that fully balancing the dataset with these transformed images would still introduce undue bias. As such we introduced a new hyperparameter for our up-sampling method that limits how much each class can be up-sampled. By setting this parameter such that each class would be up-sampled to at maximum 20K samples we immediately saw huge improvements to the balance of accuracies for each class as shown in the table below.


Table III. Class Accuracy and Occurrence for Up-Sampled Model using Adam Optimization and Cross-Entropy Loss

    While we saw great improvement from this novel up-sampling practice, there were other processes and parameters that we wanted to test which were not tested in the original paper. One of these was the choice of optimizer. We decided to compare Adam Optimization and Stochastic Gradient Descent Optimization. While we believed that Adam Optimization would perform better due to its adaptive learning rate, we wanted to make sure that we performed an empirical comparison.  As we can see in the table Adam outperformed Stochastic Gradient Descent in almost every class, however, with further optimization of parameters there is a chance that SGD could prove more optimal.


Table IV. Comparison of Class Accuracies for SGD and Adam Optimizers

    We also tested various class weights for our Cross Entropy loss function. These tested weights ranged from keeping all classes equal to heavily penalizing minority misclassifications and heavily discounting majority misclassifications such that the accuracy shifted towards the minority classes. Overall, we found that use of weighted loss created high risk for overfit of training data and over classification of the minority classes, especially when used in conjunction with Stochastic Gradient Descent Optimization. We even sought to combine our weighted loss function with momentum and/or weight decay to see how this would affect the overfit of the data. Unfortunately, we did not have enough time or compute to find accurate feedback on these hyperparameters and the ranges in which they were effective. We did create a model that used SGD with heavy bias towards the minority class and were able to get our best accuracy of any model.


Table V. Class Accuracy for weighted loss approach

    Interestingly we observed that an increase in the number of epochs created a high risk for overfitting the training data and over-classification of the minority classes in this highly accurate combination of SGD and weighted loss. As such we introduced momentum and weight decay into our optimization to see how this would affect the overfitting of the data. Unfortunately, we did not have enough time or compute to find accurate feedback on these hyperparameters and the ranges in which they were effective. We believe the reason for overfitting is because the penalty for the minority classes' misclassification was poorly calibrated and as such caused poor convergence.

    While we feel that our results prove the promise of each of our hypotheses, we have many thoughts on how to further improve the model given more time and compute resources.  First, we believe that an ability to further tune the hyperparameters listed above (class weights for weighted loss, momentum of optimization, weight decay, and maximum rebalance number) would allow for better performance. This hyperparameter search would also have to be compared on both of our chosen Optimizers because while we feel that we have shown that Adam Optimizer provides superior results, the introduction of the various hyperparameters could change our results as we saw with class weights having a huge effect of SGD Optimization. We also would like to test Node Dropout on our model because we saw quick overfitting of training data in conjunction with degradation of validation accuracy. Finally, we would like to look at various other options for our underlying Convolutional Neural Network as this could greatly affect feature abstraction. We believe that by testing these various decision parameters our model could experience even greater gains and with enough resources we could even perform better than the greatest accuracy we have seen of 63%. 

Novel Approach using Valence and Arousal Data

    Finally, we explored classification of each image into two continuous axes of valence and arousal. The motivation for this practice comes from the fact that emotions are not unrelated, and some emotions are incredibly similar. By mapping these images into a continuous emotion space, we can use a RMSE loss function which will allow us to describe the accuracy of prediction more accurately.

Figure 4. Continuous Emotion Space for Emotion Detection 

    We believe that this continuous emotion space may also open the door for ensemble classification. Rather than having one model that discriminates between eight emotions, we find the region in emotion space that the image lies, and then have a model that is trained to discriminate between the emotions in that region. For example, the first model would map a Contempt labeled image to negative valence and low arousal. In this region we would also see emotions such as Disgust. Here another model would be able to discern between these two emotions and correctly determine that the emotion expressed in the image is indeed Contempt. We feel that this approach is much more robust because our more global model was a rather poor learner, achieving a maximum accuracy of 63%. As such this boosting-inspired approach would allow each model to behave as an expert within its relatively small radius. This is further encouraged by the results shown by the continuous emotion space mapping model. 


Table VI. Error for Regression on Valence and Arousal

   This regression method proves itself to be quite accurate. We see rather small RMSE values for valence and arousal which both exist on the domain: {-1, 1}. This means that the emotion neighborhood approach is a reasonable task for this style of model as it experimentally placed all images close to their labeled position in emotion space. In this case we would be able to correctly choose a secondary model to classify the image within this neighborhood, allowing this new approach to take advantage several local experts rather than a single global expert. 


Conclusion

    Our experiments showed that there are still large improvements that can be made on a global classifier. Should further resources be put into optimizing transform up-sampling, selection of CNN, and tuning of hyperparameters we believe that accuracy can be improved above the 63% benchmark set by AlexNet. This belief is based on our improvements from a 48% accuracy of classification in our base model to 50% accuracy in our up-sampling model and 55% accuracy in our weighted loss model. However, even with these improvements we would still expect a global classifier to be a relatively poor learner. As such we hypothesize that an ensembling based approach could provide greater accuracy. The basic structure of this approach would include a primary model that can place images into emotion neighborhoods, from which a secondary model would be used to further discriminate between these more similar emotions to accurately classify the image. Not only does this help avoid the problem of a poor global learner, but we believe that it more accurately fits the problem of emotion detection.

Link to Github Repository

References:

[1] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: A database for facial expression, Valence, and arousal computing in the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2019.

[2] D. Kollias, S. Cheng, E. Ververas, I. Kotsia, and S. Zafeiriou, “Deep neural network augmentation: Generating faces for affect analysis,” International Journal of Computer Vision, vol. 128, no. 5, pp. 1455–1484, 2020.

[3] A. Ali Heydari, Craig A.Thompson, Asif Mehmood, SoftAdapt: Techniques for Adaptive Loss Weighting of Neural Networks with Multi-Part Loss Functions.

[4]  M.A.H.Akhand, Shuvendu Roy, Nazmul Siddique, Md Abdus Samas Kamal, Tetsuya Shimamura, Facial Emotion Recognition Using Transfer Learning in the Deep CNN

Comments