Components

The mexca package contains five components that can be used to build the MEXCA pipeline.

FaceExtractor

This component takes a video file as input as and applies four steps:

Detection: Faces displayed in the video frames are detected using a pretrained MTCNN model from facenet-pytorch [1].
Encoding: Faces are extracted from the frames and encoded into an embedding space using InceptionResnetV1 from facenet-pytorch.
Identification: IDs are assigned to faces by clustering the embeddings using spectral clustering (k-means).
Extraction: Facial landmarks are extracted using the pretrained MTCNN from facenet-pytorch. Facial action unit activations are extracted using a pretrained Multi-dimensional Edge Feature-based AU Relation Graph model which is adpated from the OpenGraphAU code base [2]. Currently, only the ResNet-50 backbone is available.

SpeakerIdentifier

This component takes an audio file as input and applies three steps using the speaker diarization pipeline from pyannote.audio [3]:

Segmentation: Speech segments are detected using pyannote/segmentation (this step includes voice activity detection).
Encoding: Speaker embeddings are computed for each speech segment using ECAPA-TDNN from speechbrain [4].
Identification: IDs are assigned to speech segments based on clustering with a Gaussian hidden Markov model.

VoiceExtractor

This component takes the audio file as input and extracts voice features using librosa [5]. For the default set of voice features that are extracted, see the output section.

AudioTranscriber

This component takes the audio file and speech segments information as input. It transcribes the speech segments to text using a pretrained Whisper model [6]. The resulting transcriptions are aligned with the speaker segments. The transcriptions are split into sentences using a regular expression.

SentimentExtractor

This component takes the transcribed text sentences as input and predicts sentiment scores (positive, negative, neutral) for each sentence using a pretrained multilingual RoBERTa model [7].