Facial Affective Behavior Analysis with Instruction Tuning

Yifan Li1    Anh Dao1    Wentao Bao1    Zhen Tan2    Tianlong Chen3,4,5    Huan Liu2    Yu Kong1   
1Michigan State University          2Arizona State University          3University of North Carolina at Chapel Hill
4Massachusetts Institute of Technology          5Harvard University
Intro Image

An overview of our proposed method and datasets

Abstract

Facial affective behavior analysis (FABA) is crucial for understanding human mental states from images. However, traditional approaches primarily deploy models to discriminate among discrete emotion categories, and lack the fine granularity and reasoning capability for complex facial behaviors. The advent of Multi-modal Large Language Models (MLLMs) has been proven successful in general visual understanding tasks. However, directly harnessing MLLMs for FABA is challenging due to the scarcity of datasets and benchmarks, neglecting facial prior knowledge, and low training efficiency. To address these challenges, we introduce (i) an instruction-following dataset for two FABA tasks, i.e., facial emotion and action unit recognition, (ii) a benchmark FABA-Bench with a new metric considering both recognition and generation ability, and (iii) a new MLLM EmoLA as a strong baseline to the community. Our initiative on the dataset and benchmarks reveal the nature and rationale of facial affective behaviors, i.e., finegrained facial movement, interpretability, and reasoning. Moreover, to build an effective and efficient FABA MLLM, we introduce a facial prior expert module with face structure knowledge and a low-rank adaptation module into pre-trained MLLM. We conduct extensive experiments on FABA-Bench and four commonly-used FABA datasets. The results demonstrate that the proposed facial prior expert can boost the performance and EmoLA achieves the best results on our FABA-Bench. On commonly-used FABA datasets, EmoLA is competitive rivaling taskspecific state-of-the-art models.

New FABA-Instruct dataset

The FABA-Instruct dataset is designed to enhance facial action unit (AU) and emotion recognition. It includes 19,877 images with detailed annotations and instructions. Unlike existing datasets, FABA-Instruct provides both categorical and descriptive annotations, making it a valuable resource for training advanced AI models.

Dataset Statistic

The distribution of FABA-Instruct on AU (a) and emotion (b) tasks, and the word clouds (c).


Some examples from FABA-Instruct

Example 2

Facial Action Units Analysis

The facial action units (AUs) visible in this image suggest a variety of facial muscle movements. Here's a breakdown:

  • AU4: Brow Lowerer - The person's brows are drawn together, causing vertical lines to appear between the eyebrows, which is indicative of a frown or a scowl.
  • AU6: Cheek Raiser - The muscles around the eye appear to be contracting, creating a small squint which can indicate a smile or a similar expression.
  • AU7: Lid Tightener - The tightening around the eyes can be seen, which often accompanies AU6.
  • AU12: Lip Corner Puller - The corners of the mouth are drawn back towards the ears which is the primary indicator of a smile.
  • AU23: Lip Tightener - The lips appear tight and stretched across the face.
  • AU25: Lips part - The lips are parted, which may be due to the facial expression of the smile or speaking.
  • AU26: Jaw Drop - The jaw is lowered, contributing to the open mouth appearance.

The overall configuration of these facial action units seems to create an intense or exaggerated expression, which could be associated with strong emotions such as excitement, aggression, or intense joy.

Some examples from FABA-Instruct

Example 2

Facial Action Units Analysis

In this image, we can analyze the facial expression by looking at the action units (AUs), which are part of the Facial Action Coding System (FACS):

  • AU6: Cheek Raiser – There is evidence of the orbicularis oculi muscle contracting, which produces crow's feet and causes the cheeks to move upwards. This is often associated with expressions of happiness or genuine smiles.
  • AU12: Lip Corner Puller – The lip corners are pulled upwards, which is a characteristic of smiling.
  • AU25: Lips Part – The lips are slightly parted, which can happen naturally during a relaxed, happy facial expression.

Based on these AUs, the expression can be interpreted as a happy or content one, commonly associated with a genuine smile (also known as a Duchenne smile). This type of smile typically involves both the muscles around the eyes and the mouth, which are activated in this facial expression.

Some examples from FABA-Instruct

Example 2

Exaggerated Emotion

The person in the image appears to be showing an exaggerated expression of discomfort or unease. Their eyes are wide open, which can suggest surprise, shock, or fear, while their mouth is turned downward in a pronounced frown, which often indicates sadness or displeasure. The overall expression could be interpreted as a humorous display of distaste or dissatisfaction.

Some examples from FABA-Instruct

Example 2

Degree of Emotion

The person in the image appears to have a neutral expression with a slight hint of a smile. There is relaxed composure around the eyes and mouth, suggesting a calm or content state. The eyes are engaged and there is a mild pleasantness to the overall demeanor.

Some examples from FABA-Instruct

Example 2

Combination of Emotions

The emotion depicted on the face in the image appears to be a complex one, often characterized as a mixture of sadness and discomfort, potentially with an element of a forced smile. The downturned eyebrows, elevated cheeks, and the slight pulling up of the lip corners suggest a pained expression that could be associated with crying or distress, yet there is an attempt to smile. This could indicate a social smile or an attempt to mask one's true feelings.


EmoLA Model

EmoLA Model

The EmoLA model enhances facial expression analysis through instruction tuning, improving recognition of facial action units (AUs) and emotions. It combines categorical and descriptive annotations, providing robust performance and interpretability in affective behavior analysis.

New Benchmark – FABA-Bench

FABA-Bench

FABA-Bench is a new benchmark introduced for the evaluation of models in Facial Affective Behavior Analysis (FABA). This benchmark is designed to assess both the recognition and generation capabilities of models on instruction-following data. The key metric used in FABA-Bench is the REGE score (Recognition and Generation Evaluation), which combines recognition accuracy with the quality of text generation. FABA-Bench provides a comprehensive evaluation framework that highlights the strengths and weaknesses of different models in understanding and generating descriptions for complex facial affective behaviors.

Results on Traditional FABA benchmarks


BibTeX

@article{li2024facial,
        title={Facial Affective Behavior Analysis with Instruction Tuning},
        author={Li, Yifan and Dao, Anh and Bao, Wentao and Tan, Zhen and Chen, Tianlong and Liu, Huan and Kong, Yu},
        journal={European Conference on Computer Vision (ECCV) 2024},
        year={2024}
}

NSF Acknowledgement

This material is based upon work supported by the National Science Foundation (NSF) under Grant No. 1949694 and 2040209. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.