Research Agenda

1. Multimodal Emotion Recognition
2. Affective Multimedia Analysis
3. Human-centered Behavior Analysis

Research Background



Can machines sense and identify human emotion? The main research focus of INSPIRE is the automatic analysis of human behavior during real-world human-human and human-machine interactions. In particular we provide an interdisciplinary research platform that develops systems and devices for automatic sensing, quantification, and interpretation of affective and social signals during interactive communication. Human-human and human-machine interactions often evoke and involve affective and social cues, such as emotion, social attitude, engagement, conflict, and persuasion. These signals can be inferred from both verbal and nonverbal human behaviors, such as words, head and body movements, and facial and vocal expressions. The signals profoundly influence the overall outcome of interactions, and hence the understanding of these signals will enable us to build human-centered interactive technology tailored to an individual user’s needs, preferences, and capabilities.

Computational human behavior research is at a tipping point. A variety of applications would benefit from outcomes of our research, ranging from personalized assistive systems to surveillance monitoring systems. The line of the proposed research builds upon multimodal signal processing and machine learning techniques, which provide a technical background for extracting meaningful information from audio and video recordings. However, the complexities inherent in human behavior necessitate the innovation and adaptation of traditional techniques based on behavioral and social contexts.

Multimodal Emotion Recognition

A critical step for training and validating emotion recognition systems is to discover useful features or representations from raw audio-visual data. Audio-visual data during emotion expression have inherent multimodality. To resolve this multimodality, we propose methodologies using deep learning that capture complex non-linear interactions between audio and visual emotion expressions. This approach overcomes limitations of traditional methods that could only capture linear relationships between modalities or alternatively require labeled data when extracting multimodal features. The proposed method shows improvement in emotion classification rates, particularly for ambiguous emotion content (defined as no rater consensus) [ICASSP 2013deep].

Affective Multimedia Analysis

An important characteristic of emotion to consider is that it changes continuously over time. Previous emotion recognition systems demonstrated that modeling the dynamic characteristic is critical to increase system accuracy and tracking ability. Thus, we investigat methods to temporally segment and analyze continuous variations in audio-visual emotion expressions. The most basic assumption behind traditional emotion recognition systems is that emotion variations can be quantified and captured within fixed-length windows. An example question we tackle is: How can we identify appropriate, variable-length segments that accurately quantify emotion variations? We found that there exist structural patterns of emotion changes within an utterance, typical for each emotion class of anger, happiness, neutrality, and sadness [ICASSP 2013emotion]. These structural patterns are shown to be effective in discriminating between different emotion classes.

To focus on dynamic audio-visual behaviors, we explore how emotion variations modulate audio and facial movement when a person is speaking, a challenging situation to recognize emotion (e.g., recognition systems must differentiate a person smiling vs. saying ‘cheese’). We found that variable-length time units that capture natural dynamics of facial movements are critical for emotion classification [ACM MM 2014; best student paper] [ACM TOMM 2015]. We developed and proposed a new variable-length segmentation method that utilized dynamics of individual face regions, showing significant improvement in the system accuracy. In our recent work [FG 2015], we further introduced an efficient inference method that can jointly segment and classify temporal data. The novelty of this method is that it modeled transition patterns between event segments of interest, such as a person’s gesture changes in which the arms move upward from a resting position to touch the nose. The method showed significant performance gain compared to traditional segmentation methods.

Human-Centered Behavior Analysis

Humans often perceive and evaluate the same emotion expressions in different ways. Audio-visual emotions that have disagreement by evaluators can lower the accuracy of automatic emotion recognition. We propose a method that gives different weights to the training instances based on their agreement level from raters [ACII 2015]. The results demonstrated that the information about human agreement level significantly improved the system accuracy, particularly from the recognition of neutral emotion.