Components

The mexca package contains five components that can be used to build the MEXCA pipeline.

FaceExtractor

This component takes a video file as input as and applies four steps:

  1. Detection: Faces displayed in the video frames are detected using a pretrained MTCNN model from facenet-pytorch [1].

  2. Encoding: Faces are extracted from the frames and encoded into an embedding space using InceptionResnetV1 from facenet-pytorch.

  3. Identification: IDs are assigned to faces by clustering the embeddings using spectral clustering (k-means).

  4. Extraction: Facial landmarks are extracted using the pretrained MTCNN from facenet-pytorch. Facial action unit activations are extracted using a pretrained Multi-dimensional Edge Feature-based AU Relation Graph model which is adpated from the OpenGraphAU code base [2]. Currently, only the ResNet-50 backbone is available.

SpeakerIdentifier

This component takes an audio file as input and applies three steps using the speaker diarization pipeline from pyannote.audio [3]:

  1. Segmentation: Speech segments are detected using pyannote/segmentation (this step includes voice activity detection).

  2. Encoding: Speaker embeddings are computed for each speech segment using ECAPA-TDNN from speechbrain [4].

  3. Identification: IDs are assigned to speech segments based on clustering with a Gaussian hidden Markov model.

VoiceExtractor

This component takes the audio file as input and extracts voice features using librosa [5]. For the default set of voice features that are extracted, see the output section.

AudioTranscriber

This component takes the audio file and speech segments information as input. It transcribes the speech segments to text using a pretrained Whisper model [6]. The resulting transcriptions are aligned with the speaker segments. The transcriptions are split into sentences using a regular expression.

SentimentExtractor

This component takes the transcribed text sentences as input and predicts sentiment scores (positive, negative, neutral) for each sentence using a pretrained multilingual RoBERTa model [7].

References