Skip to content

cosbidev/Temporal-Multimodal-Multitask-Attention

Repository files navigation

Temporal-Multimodal-Multitask-Attention

Code for "Temporal Multimodal Multitask Attention for affective state estimation in a stressful environment" This repository contains the official implementation of the model proposed in the paper. The architecture and training pipeline were developed specifically for this work.

Acknowledgements We thank the authors of the winning solution to the MuSe-Stress sub-challenge for providing the comparison code used in this work, as described in

@inproceedings{10.1145/3551876.3554811,
author = {He, Yu and Sun, Licai and Lian, Zheng and Liu, Bin and Tao, Jianhua and Wang, Meng and Cheng, Yuan},
title = {Multimodal Temporal Attention in Sentiment Analysis},
year = {2022},
isbn = {9781450394840},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3551876.3554811},
doi = {10.1145/3551876.3554811},
abstract = {In this paper, we present the solution to the MuSe-Stress sub-challenge in the MuSe 2022 Multimodal Sentiment Analysis Challenge. The task of MuSe-Stress is to predict a time-continuous value (i.e., physiological arousal and valence) based on multimodal data of audio, visual, text, and physiological signals. In this competition, we find that multimodal fusion has good performance for physiological arousal on the validation set, but poor prediction performance on the test set. We believe that problem may be due to the over-fitting caused by the model's over-reliance on some specific modal features. To deal with the above problem, we propose Multimodal Temporal Attention (MMTA), which considers the temporal effects of all modalities on each unimodal branch, realizing the interaction between unimodal branches and adaptive inter-modal balance. The concordance correlation coefficient (CCC) of physiological arousal and valence are 0.6818 with MMTA and 0.6841 with early fusion, respectively, both ranking Top 1, outperforming the baseline system by a large margin (i.e., 0.4761 and 0.4931) on the test set.},
booktitle = {Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge},
pages = {61–66},
numpages = {6},
keywords = {self-attention, temporal attention, multimodal fusion, multimodal sentiment analysis},
location = {Lisboa, Portugal},
series = {MuSe' 22}
}

We also gratefully acknowledge the organizers of the MuSe 2022 Challenge for making the dataset available, as detailed in

@inproceedings{10.1145/3551876.3554817,
author = {Christ, Lukas and Amiriparian, Shahin and Baird, Alice and Tzirakis, Panagiotis and Kathan, Alexander and M\"{u}ller, Niklas and Stappen, Lukas and Me\ss{}ner, Eva-Maria and K\"{o}nig, Andreas and Cowen, Alan and Cambria, Erik and Schuller, Bj\"{o}rn W.},
title = {The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress},
year = {2022},
isbn = {9781450394840},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3551876.3554817},
doi = {10.1145/3551876.3554817},
abstract = {The Multimodal Sentiment Analysis Challenge (MuSe) 2022 is dedicated to multimodal sentiment and emotion recognition. For this year's challenge, we feature three datasets: (i) the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset that contains audio-visual recordings of German football coaches, labelled for the presence of humour; (ii) the Hume-Reaction dataset in which reactions of individuals to emotional stimuli have been annotated with respect to seven emotional expression intensities, and (iii) the Ulm-Trier Social Stress Test (Ulm-TSST) dataset comprising of audio-visual data labelled with continuous emotion values (arousal and valence) of people in stressful dispositions. Using the introduced datasets, MuSe 2022 addresses three contemporary affective computing problems: in the Humor Detection Sub-Challenge (MuSe-Humor), spontaneous humour has to be recognised; in the Emotional Reactions Sub-Challenge (MuSe-Reaction), seven fine-grained 'in-the-wild' emotions have to be predicted; and in the Emotional Stress Sub-Challenge (MuSe-Stress), a continuous prediction of stressed emotion values is featured. The challenge is designed to attract different research communities, encouraging a fusion of their disciplines. Mainly, MuSe 2022 targets the communities of audio-visual emotion recognition, health informatics, and symbolic sentiment analysis. This baseline paper describes the datasets as well as the feature sets extracted from them. A recurrent neural network with LSTM cells is used to set competitive baseline results on the test partitions for each sub-challenge. We report an Area under the Curve (AUC) of .8480 for MuSe-Humor; .2801 mean (from 7-classes) Pearson's Correlations Coefficient for MuSe-Reaction, as well as .4931 Concordance Correlation Coefficient (CCC) and .4761 for valence and arousal in MuSe-Stress, respectively.},
booktitle = {Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge},
pages = {5–14},
numpages = {10},
keywords = {affective computing, benchmark, challenge, emotion recognition, humor detection, multimodal fusion, multimodal sentiment analysis},
location = {Lisboa, Portugal},
series = {MuSe' 22}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages