Components

The mexca package contains five components that can be used to build the MEXCA pipeline.

FaceExtractor

This component takes a video file as input as and applies four steps:

Detection: Faces displayed in the video frames are detected using a pretrained MTCNN model from facenet-pytorch [1].
Encoding: Faces are extracted from the frames and encoded into an embedding space using InceptionResnetV1 from facenet-pytorch.
Identification: IDs are assigned to faces by clustering the embeddings using spectral clustering (k-means).
Extraction: Facial features (landmarks, action units) are extracted from the faces using pyfeat [2]. Available models are PFLD, MobileFaceNet, and MobileNet for landmark extraction and svm, and xgb for action unit extraction.

Note

The two available AU extraction models give different output: svm returns binary unit activations, whereas xgb returns continuous activations (from a tree ensemble).

SpeakerIdentifier

This component takes an audio file as input and applies three steps using the speaker diarization pipeline from pyannote.audio [3]:

Segmentation: Speech segments are detected using pyannote/segmentation (this step includes voice activity detection).
Encoding: Speaker embeddings are computed for each speech segment using ECAPA-TDNN from speechbrain [4].
Identification: IDs are assigned to speech segments based on clustering with a Gaussian hidden Markov model.

VoiceExtractor

This component takes the audio file as input and extracts voice features using praat-parselmouth [5]. Currently, only the fundamental frequency (F0) can be extracted.

AudioTranscriber

This component takes the audio file and speech segments information as input. It transcribes the speech segments to text using a pretrained Whisper model. The resulting transcriptions are aligned with the speaker segments. The transcriptions are split into sentences using a regular expression.

SentimentExtractor

This component takes the transcribed text sentences as input and predicts sentiment scores (positive, negative, neutral) for each sentence using a pretrained multilingual RoBERTa model [6].