`mexca.audio.features`

Compute audio signal properties to extract voice features.

This module contains classes and methods to compute and store properties of audio signals that can be used to extract voice features.

There are two main types of classes: Signal (inherits from BaseSignal) and Frames (inherits from BaseFrames). Signals contain data about an entire signal (e.g., the audio signal itself) whereas Frames contain transformed and aggregated data about overlapping slices of the signal.

Module Contents

Classes

`BaseSignal`	Store a signal.
`AudioSignal`	Load and store an audio signal.
`BaseFrames`	Create and store signal frames.
`PitchFrames`	Estimate and store pitch frames.
`SpecFrames`	Create and store spectrogram frames.
`FormantFrames`	Estimate and store formant frames.
`PitchHarmonicsFrames`	Estimate and store voice pitch harmonics.
`FormantAmplitudeFrames`	Estimate and store formant amplitudes.
`PitchPulseFrames`	Extract and store glottal pulse frames.
`PitchPeriodFrames`	Create and store signal frames.
`JitterFrames`	Extract and store voice jitter frames.
`ShimmerFrames`	Extract and store voice shimmer frames.
`HnrFrames`	Estimate and store harmonics-to-noise ratios (HNRs).

class mexca.audio.features.BaseSignal(sig: numpy.ndarray, sr: int)[source]

Store a signal.

Parameters:

sig (numpy.ndarray) – Signal.
sr (int) – Sampling rate.

property idx: numpy.ndarray[source]: Sample indices (read-only).

property ts: numpy.ndarray[source]: Sample timestamps (read-only).

class mexca.audio.features.AudioSignal(sig: numpy.ndarray, sr: int, mono: bool = True, filename: Optional[str] = None)[source]

Bases: BaseSignal

Load and store an audio signal.

Parameters:

sig (numpy.ndarray) – Audio signal.
sr (int) – Sampling rate.
mono (bool, default=True) – Whether the signal has been converted to mono or not.
filename (str, optional) – Name of the audio file associated with the signal.

classmethod from_file(filename: str, sr: Optional[float] = None, mono: bool = True)[source]

Load a signal from an audio file.

Parameters:

filename (str) – Name of the audio file. File types must be supported by soundfile or audiofile. See librosa.load().
sr (float, optional, default=None) – Sampling rate. If None, is detected from the file, otherwise the signal is resampled.
mono (bool, default=True) – Whether to convert the signal to mono.

class mexca.audio.features.BaseFrames(frames: numpy.ndarray, sr: int, frame_len: int, hop_len: int, center: bool = True, pad_mode: str = 'constant')[source]

Create and store signal frames.

A frame is an (overlapping, padded) slice of a signal for which higher-order features can be computed.

Parameters:

frames (numpy.ndarray) – Signal frames. The first dimension should be the number of frames.
sr (int) – Sampling rate.
frame_len (int) – Number of samples per frame.
hop_len (int) – Number of samples between frame starting points.
center (bool, default=True) – Whether the signal has been centered and padded before framing.
pad_mode (str, default='constant') – How the signal has been padded before framing. See numpy.pad(). Uses the default value 0 for ‘constant’ padding.

See also

librosa.util.frame

property idx: numpy.ndarray[source]: Frame indices (read-only).

property ts: numpy.ndarray[source]: Frame timestamps (read-only).

classmethod from_signal(sig_obj: BaseSignal, frame_len: int, hop_len: Optional[int] = None, center: bool = True, pad_mode: str = 'constant')[source]

Create frames from a signal.

Parameters:

sig_obj (BaseSignal) – Signal object.
frame_len (int) – Number of samples per frame.
hop_len (int, optional, default=None) – Number of samples between frame starting points. If None, uses frame_len // 4.
center (bool, default=True) – Whether to center the frames and apply padding.
pad_mode (str, default='constant') – How the signal is padded before framing. See numpy.pad(). Uses the default value 0 for ‘constant’ padding. Ignored if center=False.

class mexca.audio.features.PitchFrames(frames: numpy.ndarray, flag: numpy.ndarray, prob: numpy.ndarray, sr: int, lower: float, upper: float, frame_len: int, hop_len: int, method: str, center: bool = True, pad_mode: str = 'constant')[source]

Bases: BaseFrames

Estimate and store pitch frames.

Estimate and store the voice pitch measured as the fundamental frequency F0 in Hz.

Parameters:

frames (numpy.ndarray) – Voice pitch frames in Hz with shape (num_frames,).
flag (numpy.ndarray) – Boolean flags indicating which frames are voiced with shape (num_frames,).
prob (numpy.ndarray) – Probabilities for frames being voiced with shape (num_frames,).
lower (float) – Lower limit used for pitch estimation (in Hz).
upper (float) – Upper limit used for pitch estimation (in Hz).
method (str) – Method used for estimating voice pitch.

See also

librosa.pyin, librosa.yin

classmethod from_signal(sig_obj: BaseSignal, frame_len: int, hop_len: Optional[int] = None, center: bool = True, pad_mode: str = 'constant', lower: float = 75.0, upper: float = 600.0, method: str = 'pyin')[source]

Estimate the voice pitch frames from a signal.

Currently, voice pitch can only be extracted with the pYIN method.

Parameters:

sig_obj (BaseSignal) – Signal object.
frame_len (int) – Number of samples per frame.
hop_len (int, optional, default=None) – Number of samples between frame starting points. If None, uses frame_len // 4.
center (bool, default=True) – Whether to center the frames and apply padding.
pad_mode (str, default='constant') – How the signal is padded before framing. See numpy.pad(). Uses the default value 0 for ‘constant’ padding. Ignored if center=False.
lower (float, default = 75.0) – Lower limit for pitch estimation (in Hz).
upper (float, default = 600.0) – Upper limit for pitch estimation (in Hz).
method (str, default = 'pyin') – Method for estimating voice pitch. Only ‘pyin’ is currently available.

Raises:

NotImplementedError – If a method other than ‘pyin’ is given.

class mexca.audio.features.SpecFrames(frames: numpy.ndarray, sr: int, window: str, frame_len: int, hop_len: int, center: bool = True, pad_mode: str = 'constant')[source]

Bases: BaseFrames

Create and store spectrogram frames.

Computes a spectrogram of a signal using the short-time Fourier transform (STFT).

Parameters:

frames (np.ndarray) – Spectrogram frames.
window (str) – The window that was applied before the STFT.

Notes

Frames contain complex arrays x where np.abs(x) is the magnitude and np.angle(x) is the phase of the signal for different frequency bins.

See also

librosa.stft

classmethod from_signal(sig_obj: BaseSignal, frame_len: int, hop_len: Optional[int] = None, center: bool = True, pad_mode: str = 'constant', window: Union[str, float, Tuple] = 'hamming')[source]

Transform a signal into spectrogram frames.

Parameters:

sig_obj (BaseSignal) – Signal object.
frame_len (int) – Number of samples per frame.
hop_len (int, optional, default=None) – Number of samples between frame starting points. If None, uses frame_len // 4.
center (bool, default=True) – Whether to center the frames and apply padding.
pad_mode (str, default='constant') – How the signal is padded before framing. See numpy.pad(). Uses the default value 0 for ‘constant’ padding. Ignored if center=False.
window (str) – The window that is applied before the STFT.

class mexca.audio.features.FormantFrames(frames: List, sr: int, frame_len: int, hop_len: int, center: bool = True, pad_mode: str = 'constant', max_formants: int = 5, lower: float = 50.0, upper: float = 5450.0, preemphasis_from: Optional[float] = 50.0, window: Optional[Union[str, float, Tuple]] = 'praat_gaussian')[source]

Bases: BaseFrames

Estimate and store formant frames.

Parameters:

frames (list) – Formant frames. Each frame contains a list of tuples for each formant, where the first item is the central frequency and the second the bandwidth.
max_formants (int, default=5) – The maximum number of formants that were extracted.
lower (float, default=50.0) – Lower limit for formant frequencies (in Hz).
upper (float, default=5450.0) – Upper limit for formant frequencies (in Hz).
preemphasis_from (float, default=50.0) – Starting value for the applied preemphasis function.
window (str) – Window function that was applied before formant estimation.

Notes

Estimate formants of the signal in each frame:

Apply a preemphasis function with the coefficient math.exp(-2 * math.pi * preemphasis_from * (1 / sr)) to the signal.
Apply a window function to the signal. By default, the same Gaussian window as in Praat is used: (np.exp(-48.0 * (n - ((N + 1)/2)**2 / (N + 1)**2) - np.exp(-12.0)) / (1.0 - np.exp(-12.0)), where N is the length of the window and n the index of each sample.
Calculate linear predictive coefficients using librosa.lpc() with order 2 * max_formants.
Find the roots of the coefficients.
Compute the formant central frequencies as np.abs(np.arctan2(np.imag(roots), np.real(roots))) * sr / (2 * math.pi).
Compute the formant bandwidth as np.sqrt(np.abs(np.real(roots) ** 2) + np.abs(np.imag(roots) ** 2)) * sr / (2 * math.pi).
Filter out formants outside the lower and upper limits.

property idx: numpy.ndarray[source]: Frame indices (read-only).

classmethod from_frames(sig_frames_obj: BaseFrames, max_formants: int = 5, lower: float = 50.0, upper: float = 5450.0, preemphasis_from: Optional[float] = 50.0, window: Optional[Union[str, float, Tuple]] = 'praat_gaussian')[source]

Extract formants from signal frames.

Parameters:

sig_frames_obj (BaseFrames) – Signal frames object.
max_formants (int, default=5) – The maximum number of formants that were extracted.
lower (float, default=50.0) – Lower limit for formant frequencies (in Hz).
upper (float, default=5450.0) – Upper limit for formant frequencies (in Hz).
preemphasis_from (float, default=50.0) – Starting value for the preemphasis function.
window (str) – Window function.

class mexca.audio.features.PitchHarmonicsFrames(frames: numpy.ndarray, sr: int, frame_len: int, hop_len: int, center: bool = True, pad_mode: str = 'constant', n_harmonics: int = 100)[source]

Bases: BaseFrames

Estimate and store voice pitch harmonics.

Compute the energy of the signal at harmonics (nF0 for any integer n) of the fundamental frequency.

Parameters:

frames (numpy.ndarray) – Harmonics frames with the shape (num_frames, n_harmonics)
n_harmonics (int, default=100) – Number of estimated harmonics.

See also

librosa.f0_harmonics

classmethod from_spec_and_pitch_frames(spec_frames_obj: SpecFrames, pitch_frames_obj: PitchFrames, n_harmonics: int = 100)[source]

Estimate voice pitch harmonics from spectrogram frames and voice pitch frames.

Parameters:

spec_frames_obj (SpecFrames) – Spectrogram frames object.
pitch_frames_obj (PitchFrames) – Pitch frames object.
n_harmonics (int, default=100) – Number of harmonics to estimate.

class mexca.audio.features.FormantAmplitudeFrames(frames: numpy.ndarray, sr: int, frame_len: int, hop_len: int, center: bool, pad_mode: str, lower: float, upper: float, rel_f0: bool)[source]

Bases: BaseFrames

Estimate and store formant amplitudes.

Parameters:

frames (np.ndarray) – Formant amplitude frames of shape (num_frames, max_formants) in dB.
lower (float) – Lower boundary for peak amplitude search interval.
upper (float) – Upper boundary for peak amplitude search interval.
rel_f0 (bool) – Whether the amplitude is relative to the fundamental frequency amplitude.

Notes

Estimate the formant amplitude as the maximum amplitude of harmonics of the fundamental frequency within an interval [lower*f, upper*f] where f is the central frequency of the formant in each frame. If rel=True, divide the amplitude by the amplitude of the fundamental frequency.

property idx: numpy.ndarray[source]: Frame indices (read-only).

classmethod from_formant_harmonics_and_pitch_frames(formant_frames_obj: FormantFrames, harmonics_frames_obj: PitchHarmonicsFrames, pitch_frames_obj: PitchFrames, lower: float = 0.8, upper: float = 1.2, rel_f0: bool = True)[source]

Estimate formant amplitudes from formant, pitch harmonics, and pitch frames.

Parameters:

formant_frames_obj (FormantFrames) – Formant frames object.
harmonics_frames_obj (PitchHarmonicsFrames) –
pitch_frames_obj (PitchFrames) –
lower (float, optional, default=0.8) – Lower boundary for peak amplitude search interval.
upper (float, optional, default=1.2) – Upper boundary for peak amplitude search interval.
rel_f0 (bool, optional, default=True) – Whether the amplitude is divide by the fundamental frequency amplitude.

class mexca.audio.features.PitchPulseFrames(frames: List[Tuple], sr: int, frame_len: int, hop_len: int, center: bool = True, pad_mode: str = 'constant')[source]

Bases: BaseFrames

Extract and store glottal pulse frames.

Glottal pulses are peaks in the signal corresponding to the fundamental frequency F0.

Parameters:: frames (list) – Pulse frames. Each frame contains a list of pulses or an empty list if no pulses are detected. Pulses are stored as tuples (pulse timestamp, T0, amplitude).

Notes

Extract glottal pulses with these steps:

Interpolate the fundamental frequency at the timestamps of the framed (padded) signal.
Start at the mid point m of each frame and create an interval [start, stop], where start=m-T0/2 and stop=m+T0/2 and T0 is the fundamental period (1/F0).
Detect pulses in the interval by:
1. Find the maximum amplitude in an interval within the frame.
2. Compute the fundamental period T0_new at the timestamp of the maximum m_new.
Shift the interval recursively to the right or left until the edges of the frame are reached:
1. When shifting to the left, set start_new=m_new-1.25*T0_new and stop_new=m_new-0.8*T0_new.
2. When shifting to the right, set start_new=m_new+0.8*T0_new and stop_new=m_new+1.25*T0_new.
Filter out duplicate pulses.

property idx: numpy.ndarray[source]: Frame indices (read-only).

classmethod from_signal_and_pitch_frames(sig_obj: BaseSignal, pitch_frames_obj: PitchFrames)[source]

Extract glottal pulse frames from a signal and voice pitch frames.

Parameters:

sig_obj (BaseSignal) – Signal object.
pitch_frames_obj (PitchFrames) – Voice pitch frames object.

class mexca.audio.features.PitchPeriodFrames(frames: numpy.ndarray, sr: int, frame_len: int, hop_len: int, center: bool, pad_mode: str, lower: float, upper: float)[source]

Bases: BaseFrames

Create and store signal frames.

A frame is an (overlapping, padded) slice of a signal for which higher-order features can be computed.

Parameters:

frames (numpy.ndarray) – Signal frames. The first dimension should be the number of frames.
sr (int) – Sampling rate.
frame_len (int) – Number of samples per frame.
hop_len (int) – Number of samples between frame starting points.
center (bool, default=True) – Whether the signal has been centered and padded before framing.
pad_mode (str, default='constant') – How the signal has been padded before framing. See numpy.pad(). Uses the default value 0 for ‘constant’ padding.

See also

librosa.util.frame

class mexca.audio.features.JitterFrames(frames: numpy.ndarray, sr: int, frame_len: int, hop_len: int, center: bool, pad_mode: str, rel: bool, lower: float, upper: float, max_period_ratio: float)[source]

Bases: PitchPeriodFrames

Extract and store voice jitter frames.

Parameters:

frames (numpy.ndarray) – Voice jitter frames of shape (num_frames,).
rel (bool) – Whether the voice jitter is relative to the average period length.
lower (float) – Lower limit for periods between glottal pulses.
upper (float) – Upper limit for periods between glottal pulses.
max_period_ratio (float) – Maximum ratio between consecutive periods used for jitter extraction.

Notes

Compute jitter as the average absolute difference between consecutive fundamental periods with a ratio below max_period_ratio for each frame. If rel=True, jitter is divided by the average fundamental period of each frame. Fundamental periods are calculated as the first-order temporal difference between consecutive glottal pulses.

classmethod from_pitch_pulse_frames(pitch_pulse_frames_obj: PitchPulseFrames, rel: bool = True, lower: float = 0.0001, upper: float = 0.02, max_period_ratio: float = 1.3)[source]

Extract voice jitter frames from glottal pulse frames.

Parameters:

pitch_pulse_frames_obj (PitchPulseFrames) – Glottal pulse frames object.
rel (bool, optional, default=True) – Divide jitter by the average pitch period.
lower (float, optional, default=0.0001) – Lower limit for periods between glottal pulses.
upper (float, optional, default=0.02) – Upper limit for periods between glottal pulses.
max_period_ratio (float, optional, default=1.3) – Maximum ratio between consecutive periods for jitter extraction.

class mexca.audio.features.ShimmerFrames(frames: List[Tuple], sr: int, frame_len: int, hop_len: int, center: bool, pad_mode: str, rel: bool, lower: float, upper: float, max_period_ratio: float, max_amp_factor: float)[source]

Bases: PitchPeriodFrames

Extract and store voice shimmer frames.

Parameters:

frames (numpy.ndarray) – Voice shimmer frames of shape (num_frames,).
rel (bool) – Whether the voice shimmer is relative to the average period length.
lower (float) – Lower limit for periods between glottal pulses.
upper (float) – Upper limit for periods between glottal pulses.
max_period_ratio (float) – Maximum ratio between consecutive periods used for shimmer extraction.
max_amp_factor (float) – Maximum ratio between consecutive amplitudes used for shimmer extraction.

Notes

Compute shimmer as the average absolute difference between consecutive pitch amplitudes with a fundamental period ratio below max_period_ratio and amplitude ratio below max_amp_factor for each frame. If rel=True, shimmer is divided by the average amplitude of each frame. Fundamental periods are calculated as the first-order temporal difference between consecutive glottal pulses. Amplitudes are signal amplitudes at the glottal pulses.

classmethod from_pitch_pulse_frames(pitch_pulse_frames_obj: PitchPulseFrames, rel: bool = True, lower: float = 0.0001, upper: float = 0.02, max_period_ratio: float = 1.3, max_amp_factor: float = 1.6)[source]

Extract voice shimmer frames from glottal pulse frames.

Parameters:

pitch_pulse_frames_obj (PitchPulseFrames) – Glottal pulse frames object.
rel (bool, optional, default=True) – Divide shimmer by the average pitch period.
lower (float, optional, default=0.0001) – Lower limit for periods between glottal pulses.
upper (float, optional, default=0.02) – Upper limit for periods between glottal pulses.
max_period_ratio (float, optional, default=1.3) – Maximum ratio between consecutive periods for shimmer extraction.
max_amp_factor (float, optional, default=1.6) – Maximum ratio between consecutive amplitudes used for shimmer extraction.

class mexca.audio.features.HnrFrames(frames: numpy.ndarray, sr: int, frame_len: int, hop_len: int, center: bool, pad_mode: str, lower: float, rel_silence_threshold)[source]

Bases: BaseFrames

Estimate and store harmonics-to-noise ratios (HNRs).

Parameters:

frames (numpy.ndarray) – HNR frames in dB with shape (num_frames,).
lower (float) – Lower fundamental frequency limit for choosing pitch candidates.
rel_silence_threshold (float) – Relative threshold for treating signal frames as silent.

Notes

Estimate the HNR for each signal frame with np.max(np.abs(frames), axis=1) > rel_silence_threshold*np.max(np.abs(frames)) by:

Compute the autocorrelation function (ACF) using the short-term Fourier transform (STFT).
Find the lags of peaks in the ACF excluding the zero-th lag.
Filter out peaks that correspond to pitch candidates below lower and above the Nyquist frequency.
Compute the harmonic component R0 as the highest of the remaining peaks divided by the ACF at lag zero.
Compute the HNR as R0/(1-R0) and convert to dB.

classmethod from_frames(sig_frames_obj: BaseFrames, lower: float = 75.0, rel_silence_threshold: float = 0.1)[source]

Estimate the HNR from signal frames.

Parameters:

sig_frames_obj (BaseFrames) – Signal frames object.
lower (float, default = 75.0) – Lower fundamental frequency limit for choosing pitch candidates.
rel_silence_threshold (float, default = 0.1) – Relative threshold for treating signal frames as silent.

mexca.audio.features

Module Contents

Classes

`mexca.audio.features`