Components

The mexca package contains five components that can be used to build the MEXCA pipeline.

FaceExtractor

This component takes a video file as input as and applies four steps:

Detection: Faces displayed in the video frames are detected using a pretrained MTCNN model from facenet-pytorch 1.
Encoding: Faces are extracted from the frames and encoded into an embedding space using InceptionResnetV1 from facenet-pytorch.
Identification: IDs are assigned to faces by clustering the embeddings using spectral clustering (k-means).
Extraction: Facial features (landmarks, action units) are extracted from the faces using pyfeat 2. Available models are PFLD, MobileFaceNet, and MobileNet for landmark extraction and svm, and xgb for action unit extraction.

Note

The two available AU extraction models give different output: svm returns binary unit activations, whereas xgb returns continuous activations (from a tree ensemble).

SpeakerIdentifier

This component takes an audio file as input and applies three steps using the speaker diarization pipeline from pyannote.audio 3:

Segmentation: Speech segments are detected using pyannote/segmentation (this step includes voice activity detection).
Encoding: Speaker embeddings are computed for each speech segment using ECAPA-TDNN from speechbrain 4.
Identification: IDs are assigned to speech segments based on clustering with a Gaussian hidden Markov model.

VoiceExtractor

This component takes the audio file as input and extracts voice features using praat-parselmouth 5. Currently, only the fundamental frequency (F0) can be extracted.

AudioTranscriber

This component takes the audio file and speech segments information as input. It transcribes the speech segments to text using a pretrained Whisper model. The resulting transcriptions are aligned with the speaker segments. The transcriptions are split into sentences using a regular expression.

SentimentExtractor

This component takes the transcribed text sentences as input and predicts sentiment scores (positive, negative, neutral) for each sentence using a pretrained multilingual RoBERTa model 6.

References

1: Barbieri, F., Camacho-Collados, J., Neves, L., & Espinosa-Anke, L.. (2020). TweetEval: Unified benchmark and comparative evaluation for tweet classification. arxiv. https://doi.org/10.48550/arxiv.2010.12421
2: Bredin, H., & Laurent, A. (2021). End-to-end speaker segmentation for overlap-aware resegmentation. arXiv. https://doi.org/10.48550/arXiv.2104.04045
3: Cheong, J. H., Xie, T., Byrne, S., & Chang, L. J. (2021). Py-feat: Python facial expression analysis toolbox. arXiv. https://doi.org/10.48550/arXiv.2104.03509
4: Jadoul, Y., Thompson, B., & de Boer, B. (2018). Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics, 71, 1-15. https://doi.org/10.1016/j.wocn.2018.07.001
5: Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. https://cdn.openai.com/papers/whisper.pdf
6: Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., … Bengio, Y. (2021). SpeechBrain: A general-purpose speech toolkit. arXiv. https://doi.org/10.48550/arXiv.2106.04624
7: Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. arXiv. https://doi.org/10.48550/arXiv.1503.03832