mexca.data
Objects for storing multimodal data.
Module Contents
Classes
Video annotation class for storing facial features. |
|
Configure the calculation of signal properties used for voice feature extraction. |
|
Class for storing voice features. |
|
Class for storing speech segment data. |
|
Class for storing speaker and speech segment annotations. |
|
Class for storing transcription data. |
|
Class for storing audio transcriptions. |
|
Class for storing sentiment data. |
|
Class for storing sentiment scores of transcribed sentences. |
|
Class for storing multimodal features. |
- class mexca.data.VideoAnnotation[source]
Video annotation class for storing facial features.
- Parameters:
frame (list, optional) – Index of each frame.
time (list, optional) – Timestamp of each frame in seconds.
face_box (list, optional) – Bounding box of a detected face. Is numpy.nan if no face was detected.
face_prob (list, optional) – Probability of a detected face. Is numpy.nan if no face was detected.
face_landmarks (list, optional) – Facial landmarks of a detected face. Is numpy.nan if no face was detected.
face_aus (list, optional) – Facial action unit activations of a detected face. Is numpy.nan if no face was detected.
face_label (list, optional) – Label of a detected face. Is numpy.nan if no face was detected.
face_confidence (list, optional) – Confidence of the face_label assignment. Is numpy.nan if no face was detected or only one face label was assigned.
- class mexca.data.VoiceFeaturesConfig[source]
Configure the calculation of signal properties used for voice feature extraction.
Create a pseudo-immutable object with attributes that are recognized by the
VoiceExtractorclass and forwarded as arguments to signal property objects defined inmexca.audio.features. Details can be found in the feature class documentation.- Parameters:
frame_len (int) – Number of samples per frame.
hop_len (int) – Number of samples between frame starting points.
center (bool, default=True) – Whether the signal has been centered and padded before framing.
pad_mode (str, default='constant') – How the signal has been padded before framing. See
numpy.pad(). Uses the default value 0 for ‘constant’ padding.spec_window (str or float or tuple, default="hann") – The window that is applied before the STFT to obtain spectra.
pitch_lower_freq (float, default=75.0) – Lower limit used for pitch estimation (in Hz).
pitch_upper_freq (float, default=600.0) – Upper limit used for pitch estimation (in Hz).
pitch_method (str, default="pyin") – Method used for estimating voice pitch.
ptich_n_harmonics (int, default=100) – Number of estimated pitch harmonics.
pitch_pulse_lower_period (float, optional, default=0.0001) – Lower limit for periods between glottal pulses for jitter and shimmer extraction.
pitch_pulse_upper_period (float, optional, default=0.02) – Upper limit for periods between glottal pulses for jitter and shimmer extraction.
pitch_pulse_max_period_ratio (float, optional, default=1.3) – Maximum ratio between consecutive glottal periods for jitter and shimmer extraction.
pitch_pulse_max_amp_factor (float, default=1.6) – Maximum ratio between consecutive amplitudes used for shimmer extraction.
jitter_rel (bool, default=True) – Divide jitter by the average pitch period.
shimmer_rel (bool, default=True) – Divide shimmer by the average pulse amplitude.
hnr_lower_freq (float, default = 75.0) – Lower fundamental frequency limit for choosing pitch candidates when computing the harmonics-to-noise ratio (HNR).
hnr_rel_silence_threshold (float, default = 0.1) – Relative threshold for treating signal frames as silent when computing the HNR.
formants_max (int, default=5) – The maximum number of formants that are extracted.
formants_lower_freq (float, default=50.0) – Lower limit for formant frequencies (in Hz).
formants_upper_freq (float, default=5450.0) – Upper limit for formant frequencies (in Hz).
formants_signal_preemphasis_from (float, default=50.0) – Starting value for the applied preemphasis function (in Hz).
formants_window (str or float or tuple, default="praat_gaussian") – Window function that is applied before formant estimation.
formants_amp_lower (float, optional, default=0.8) – Lower boundary for formant peak amplitude search interval.
formants_amp_upper (float, optional, default=1.2) – Upper boundary for formant peak amplitude search interval.
formants_amp_rel_f0 (bool, optional, default=True) – Whether the formant amplitude is divided by the fundamental frequency amplitude.
alpha_ratio_lower_band (tuple, default=(50.0, 1000.0)) – Boundaries of the alpha ratio lower frequency band (start, end) in Hz.
alpha_ratio_upper_band (tuple, default=(1000.0, 5000.0)) – Boundaries of the alpha ratio upper frequency band (start, end) in Hz.
hammar_index_pivot_point_freq (float, default=2000.0) – Point separating the Hammarberg index lower and upper frequency regions in Hz.
hammar_index_upper_freq (float, default=5000.0) – Upper limit for the Hammarberg index upper frequency region in Hz.
spectral_slopes_bands (tuple, default=((0.0, 500.0), (500.0, 1500.0))) – Frequency bands in Hz for which spectral slopes are estimated.
mel_spec_n_mels (int, default=26) – Number of Mel filters.
mel_spec_lower_freq (float, default=20.0) – Lower frequency boundary for Mel spectogram transformation in Hz.
mel_spec_upper_freq (float, default=8000.0) – Upper frequency boundary for Mel spectogram transformation in Hz.
mfcc_n (int, default=4) – Number of Mel frequency cepstral coefficients (MFCCs) that are estimated per frame.
mfcc_lifter (float, default=22.0) – Cepstral liftering coefficient for MFCC estimation. Must be >= 0. If zero, no liftering is applied.
- class mexca.data.VoiceFeatures[source]
Class for storing voice features.
Features are stored as lists (like columns of a data frame). Optional features are initialized as empty lists.
- Parameters:
- class mexca.data.SpeakerAnnotation(intervals: List[intervaltree.Interval] = None)[source]
Bases:
intervaltree.IntervalTreeClass for storing speaker and speech segment annotations.
Stores speech segments as
intervaltree.Intervalin anintervaltree.IntervalTree. Speaker labels are stored in SegmentData objects in the data attribute of each interval.- classmethod from_pyannote(annotation: Any)[source]
Create a SpeakerAnnotation object from a
pyannote.core.Annotationobject.- Parameters:
annotation (pyannote.core.Annotation) – Annotation object containing speech segments and speaker labels.
- class mexca.data.AudioTranscription(filename: str, subtitles: Optional[intervaltree.IntervalTree] = None)[source]
Class for storing audio transcriptions.
- Parameters:
filename (str) – Name of the transcribed audio file.
subtitles (intervaltree.IntervalTree, optional, default=None) – Interval tree containing the transcribed speech segments split into sentences as intervals. The transcribed sentences are stored in the data attribute of each interval.
- class mexca.data.SentimentAnnotation(intervals: List[intervaltree.Interval] = None)[source]
Bases:
intervaltree.IntervalTreeClass for storing sentiment scores of transcribed sentences.
Stores sentiment scores as intervals in an interval tree. The scores are stored in the data attribute of each interval.
- class mexca.data.Multimodal(filename: str, duration: Optional[float] = None, fps: Optional[int] = None, fps_adjusted: Optional[int] = None, video_annotation: Optional[VideoAnnotation] = None, audio_annotation: Optional[SpeakerAnnotation] = None, voice_features: Optional[VoiceFeatures] = None, transcription: Optional[AudioTranscription] = None, sentiment: Optional[SentimentAnnotation] = None, features: Optional[pandas.DataFrame] = None)[source]
Class for storing multimodal features.
See the Output section for details.
- Parameters:
filename (str) – Name of the file from which features were extracted.
duration (float, optional, default=None) – Video duration in seconds.
fps (: float) – Frames per second.
fps_adjusted (float) – Frames per seconds adjusted for skipped frames. Mostly needed for internal computations.
video_annotation (VideoAnnotation) – Object containing facial features.
audio_annotation (SpeakerAnnotation) – Object containing speech segments and speakers.
voice_features (VoiceFeatures) – Object containing voice features.
transcription (AudioTranscription) – Object containing transcribed speech segments split into sentences.
sentiment (SentimentAnnotation) – Object containing sentiment scores for transcribed sentences.
features (pandas.DataFrame) – Merged features.
- merge_features() pandas.DataFrame[source]
Merge multimodal features from pipeline components into a common data frame.
Transforms and merges the available output stored in the Multimodal object based on the ‘frame’ variable. Stores the merged features as a pandas.DataFrame in the features attribute.
- Returns:
Merged multimodal features.
- Return type:
pandas.DataFrame