mexca.video.extraction

Facial feature extraction from videos.

Module Contents

Classes

VideoDataset

Custom torch dataset for a video file.

FaceExtractor

Combine steps to extract features from faces in a video file.

Functions

cli()

Command line interface for extracting facial features.

exception mexca.video.extraction.NotEnoughFacesError(msg: str)[source]

Less detected faces than num_faces.

Cannot perform clustering if samples are less than the number of clusters.

Parameters:

msg (str) – Error message.

class mexca.video.extraction.VideoDataset(video_file: str, skip_frames: int = 1, start: float = 0, end: float | None = None)[source]

Custom torch dataset for a video file.

Only reads the frame timestamps of the video but not the frames themselves when initialized. Decodes the video frame-by-frame.

Parameters:
  • video_file (str) – Path to the video file.

  • skip_frames (int, default=1) – Only load every nth frame.

  • start (float, default=0) – Start of the subclip of the video to be loaded (in seconds).

  • end (float, optional, default=None) – End of the subclip of the video to be loaded (in seconds).

file_name

Name of the video file.

Type:

str

video_pts

Timestamps of video frames.

Type:

torch.Tensor

video_frames_idx

Indices of video frames.

Type:

torch.Tensor

video_fps

Frames per second.

Type:

int

video_frames

Indices of loaded frames.

Type:

numpy.ndarray

property duration: float[source]

Duration of the video (read-only).

__len__() int[source]

Number of video frames.

__getitem__(idx: int) Dict[str, torch.Tensor][source]

Get an item from the data set.

Loads the video frame into memory.

Parameters:

idx (int) – Index of the item in the dataset.

Returns:

Dictionary with ‘Image’ containing the video frame (T, H, W, C) and ‘Frame’ containing the frame index.

Return type:

dict

class mexca.video.extraction.FaceExtractor(num_faces: int | None, min_face_size: int = 20, thresholds: Tuple[float] = (0.6, 0.7, 0.7), factor: float = 0.709, post_process: bool = True, select_largest: bool = True, selection_method: str | None = 'num_faces', keep_all: bool = True, device: torch.device = torch.device(type='cpu'), clusterer: sklearn.base.ClusterMixin | None = None, embeddings_model: str = 'vggface2', post_min_face_size: Tuple[float, float] = (45.0, 45.0), au_model: str | None = None)[source]

Combine steps to extract features from faces in a video file.

Parameters:
  • num_faces (int, optional) – Number of faces to identify. Must be other than None if clusterer=None.

  • min_face_size (int, default=20) – Minimum face size (in pixels) for face detection in MTCNN.

  • thresholds (tuple, default=(0.6, 0.7, 0.7)) – Thesholds for face detection in MTCNN.

  • factor (float, default=0.709) – Factor used to create a scaling pyramid of face sizes in MTCNN.

  • post_process (bool, default=True) – Whether detected faces are post processed before computing embeddings. The post processing standardizes the detected faces.

  • select_largest (bool, default=True) – Whether to return the largest face or the one with the highest probability if multiple faces are detected.

  • selection_method ({None, 'num_faces', 'probability', 'largest', 'largest_over_threshold', 'center_weighted_size'}, optional, default='num_faces') – The heuristic used for selecting detected faces. If not None, overrides select_largest. The default num_faces, selects a maximum of num_faces faces per frame.

  • keep_all (bool, default=True) – Whether all faces should be returned in the order of select_largest.

  • device (torch.device, optional, default=torch.device("cpu")) – The device on which face detection and embedding computations are performed.

  • clusterer (sklearn.base.ClusterMixin, optional, default=None) – Class instance from sklearn.cluster used for clustering face embeddings. If None (default), creates a sklearn.cluster.SpectralClustering instance with n_clusters=num_faces. For large datasets, sklearn.cluster.KMeans is recommended to avoid memory issues.

  • embeddings_model ({'vggface2', 'casia-webface'}, default='vggface2') – Pretrained Inception Resnet V1 model for computing face embeddings.

  • post_min_face_size (tuple, default=(45.0, 45.0)) – Minimal width and height (in pixels) for filtering out faces after detection. This can be useful to exclude small faces before clustering their embeddings and can improve clustering performance.

  • au_model (str, optional, default=None) – Pretrained MEFARG model on Hugging Face Hub for extraction facial action unit activations. If None, uses the default model mexca/mefarg-open-graph-au-resnet50-stage-2.

property detector: facenet_pytorch.MTCNN[source]

The MTCNN model for face detection and extraction. See facenet-pytorch for details.

property encoder: facenet_pytorch.InceptionResnetV1[source]

The ResnetV1 model for computing face embeddings. See facenet-pytorch for details.

property clusterer: sklearn.base.ClusterMixin[source]

The clusterer instance from sklearn.cluster.

property extractor: mexca.video.mefarg.MEFARG[source]

The MEFARG model for extracting action unit activations. See ME-GraphAU model and paper for details.

__call__(**callargs) mexca.data.VideoAnnotation[source]

Alias for apply.

detect(frame: numpy.ndarray | torch.Tensor) Tuple[List[torch.Tensor], List[numpy.ndarray] | numpy.ndarray, List[numpy.ndarray] | numpy.ndarray, List[numpy.ndarray] | numpy.ndarray][source]

Detect faces in a video frame.

Parameters:

frame (numpy.ndarray or torch.Tensor) – Batch of B frames containing RGB values with dimensions (B, W, H, 3).

Returns:

  • faces (list) – Batch of B tensors containing the N cropped face images from each batched frame with dimensions (N, 3, 160, 160). Is None if a frame contains no faces.

  • boxes (numpy.ndarray or list) – Batch of B bounding boxes of the N detected faces as (x1, y1, x2, y2) coordinates with dimensions (B, N, 4). Returns a list if different numbers of faces are detected across batched frames. Is None if a frame contains no faces.

  • probs (numpy.ndarray or list) – Probabilities of the detected faces (B, N). Returns a list if different numbers of faces are detected across batched frames. Is None if a frame contains no faces.

  • landmarks (numpy.ndarray or list) – Batch of B facial landmarks for N detected faces as (x, y) coordinates with dimensions (5, 2). Is None if a frame contains no faces.

encode(faces: torch.Tensor) numpy.ndarray[source]

Compute embeddings for face images.

Parameters:

faces (torch.Tensor) – Cropped N face images from a video frame with dimensions (N, 3, H, W). H and W must at least be 80 for the encoding to work.

Returns:

Embeddings of the N face images with dimensions (N, 512).

Return type:

numpy.ndarray

identify(embeddings: numpy.ndarray) numpy.ndarray[source]

Cluster faces based on their embeddings.

Parameters:

embeddings (numpy.ndarray) – Embeddings of the N face images with dimensions (N, E) where E is the length of the embedding vector.

Returns:

Cluster indices for the N face embeddings.

Return type:

numpy.ndarray

extract(frame: numpy.ndarray | torch.Tensor) List[numpy.ndarray] | numpy.ndarray[source]

Detect facial action units activations.

Parameters:

frame (numpy.ndarray or torch.Tensor) – Batch of B frames containing RGB values with dimensions (B, H, W, 3).

Returns:

aus – Batch of B action unit activations for N detected faces with dimensions (N, 41). Is None if a frame contains no faces.

Return type:

numpy.ndarray or list

compute_avg_embeddings(embeddings: numpy.ndarray, labels: numpy.ndarray) dict[source]

Computes average embedding vector for each face detected in the video.

Parameters:
Returns:

average embedding dictionary – Dictionary with keys representing face labels and values representing the average embedding vector for each face label.

Return type:

dict

compute_confidence(embeddings: numpy.ndarray, labels: numpy.ndarray) numpy.ndarray[source]

Compute face label classification confidence.

Parameters:
Returns:

confidence – Confidence scores between 0 and 1. Returns numpy.nan if no label was assigned to a face.

Return type:

numpy.ndarray

apply(filepath: str, batch_size: int = 1, skip_frames: int = 1, process_subclip: Tuple[float | None] = (0, None), cluster_embeddings: bool = True, return_embeddings: bool = False, show_progress: bool = True) mexca.data.VideoAnnotation[source]

Apply multiple steps to extract features from faces in a video file.

This method subsequently calls other methods for each frame of a video file to detect and cluster faces. It also extracts facial landmarks and action units.

Parameters:
  • filepath (str) – Path to the video file.

  • batch_size (int, default=1) – Size of the batch of video frames that are loaded and processed at the same time.

  • skip_frames (int, default=1) – Only process every nth frame, starting at 0.

  • process_subclip (tuple, default=(0, None)) – Process only a part of the video clip. Must be the start and end of the subclip in seconds.

  • cluster_embeddings (bool, default=True) – Cluster embeddings using spectral clustering.

  • return_embeddings (bool, default=False) – Return embedding vectors for each detected face.

  • show_progress (bool, default=True) – Enables the display of a progress bar.

Returns:

A data class object with extracted facial features.

Return type:

VideoAnnotation

mexca.video.extraction.cli()[source]

Command line interface for extracting facial features. See extract-faces -h for details.