mexca.video

Facial feature extraction from videos.

Module Contents

Classes

VideoDataset

Custom torch dataset for a video file.

FaceExtractor

Combine steps to extract features from faces in a video file.

Functions

cli()

Command line interface for extracting facial features.

Attributes

EMPTY_VALUE

Value that is returned if no faces are detected in a video frame.

mexca.video.EMPTY_VALUE[source]

Value that is returned if no faces are detected in a video frame.

exception mexca.video.NotEnoughFacesError(msg: str)[source]

Bases: Exception

Less detected faces than num_faces.

Cannot perform clustering if samples are less than the number of clusters.

Parameters:

msg (str) – Error message.

class mexca.video.VideoDataset(video_file: str, skip_frames: int = 1, start: float = 0, end: Optional[float] = None)[source]

Bases: torch.utils.data.Dataset

Custom torch dataset for a video file.

Only reads the frame timestamps of the video but not the frames themselves when initialized. Decodes the video frame-by-frame.

Parameters:
  • video_file (str) – Path to the video file.

  • skip_frames (int, default=1) – Only load every nth frame.

  • start (float, default=0) – Start of the subclip of the video to be loaded (in seconds).

  • end (float, optional, default=None) – End of the subclip of the video to be loaded (in seconds).

file_name

Name of the video file.

Type:

str

video_pts

Timestamps of video frames.

Type:

torch.Tensor

video_frames_idx

Indices of video frames.

Type:

torch.Tensor

video_fps

Frames per second.

Type:

int

video_frames

Indices of loaded frames.

Type:

numpy.ndarray

property duration: float[source]

Duration of the video (read-only).

__len__() int[source]

Number of video frames.

__getitem__(idx: int) Dict[str, torch.Tensor][source]

Get an item from the data set.

Loads the video frame into memory.

Parameters:

idx (int) – Index of the item in the dataset.

Returns:

Dictionary with ‘Image’ containing the video frame (T, H, W, C) and ‘Frame’ containing the frame index.

Return type:

dict

class mexca.video.FaceExtractor(num_faces: Optional[int], min_face_size: int = 20, thresholds: Tuple[float] = (0.6, 0.7, 0.7), factor: float = 0.709, post_process: bool = True, select_largest: bool = True, selection_method: Optional[str] = None, keep_all: bool = True, device: Optional[torch.device] = None, max_cluster_frames: Optional[int] = None, embeddings_model: str = 'vggface2', au_model: str = 'xgb', landmark_model: str = 'mobilefacenet')[source]

Combine steps to extract features from faces in a video file.

Parameters:
  • num_faces (int, optional) – Number of faces to identify.

  • min_face_size (int, default=20) – Minimum size required for detected faces (in pixels).

  • thresholds (tuple, default=(0.6, 0.7, 0.7)) – Face detection thesholds.

  • factor (float, default=0.709) – Factor used to create a scaling pyramid of face sizes.

  • post_process (bool, default=True) – Whether detected faces are post processed before computing embeddings.

  • select_largest (bool, default=True) – Whether to return the largest face or the one with the highest probability if multiple faces are detected.

  • selection_method ({None, 'probability', 'largest', 'largest_over_threshold', 'center_weighted_size'}, optional, default=None) – The heuristic used for selecting detected faces. If not None, overrides select_largest.

  • keep_all (bool, default=True) – Whether all faces should be returned in the order of select_largest.

  • device (torch.device, optional, default=None) – The device on which face detection and embedding computations are performed.

  • max_cluster_frames (int, optional, default=None) – Maximum number of frames that are used for spectral clustering. If the number of frames exceeds the maximum, hierarchical clustering is applied first to reduce the frames to this number. This can reduce the computational costs for long videos.

  • embeddings_model ({'vggface2', 'casia-webface'}, default='vggface2') – Pretrained Inception Resnet V1 model for computing face embeddings.

  • au_model ({'xgb', 'svm'}, default='xgb') – Pretrained model for predicting facial action unit activations.

  • landmark_model ({'mobilefacenet', 'mobilenet', 'pfld'}, default='mobilefacenet') – Pretrained model for detecting facial landmarks.

Notes

For details on the available pretrained models for facial action unit and landmark detection, see the documentation of py-feat. The pretrained action unit models return different outputs: ‘xgb’ returns continous values (0-1), whereas ‘svm’ returns binary (0, 1) values.

property detector: facenet_pytorch.MTCNN[source]

The MTCNN model for face detection and extraction. See facenet-pytorch for details.

property encoder: facenet_pytorch.InceptionResnetV1[source]

The ResnetV1 model for computing face embeddings. See facenet-pytorch for details.

property clusterer: spectralcluster.SpectralClusterer[source]

The spectral clustering model for identifying faces based on embeddings. See spectralcluster for details.

property extractor: feat.detector.Detector[source]

The model for extracting facial landmarks and action units. See py-feat for details.

__call__(**callargs) mexca.data.VideoAnnotation[source]

Alias for apply.

detect(frame: Union[numpy.ndarray, torch.Tensor]) Tuple[List[torch.Tensor], Union[List[numpy.ndarray], numpy.ndarray], Union[List[numpy.ndarray], numpy.ndarray]][source]

Detect faces in a video frame.

Parameters:

frame (numpy.ndarray or torch.Tensor) – Batch of B frames containing RGB values with dimensions (B, W, H, 3).

Returns:

  • faces (list) – Batch of B tensors containing the N cropped face images from each batched frame with dimensions (N, 3, 160, 160). Is None if a frame contains no faces.

  • boxes (numpy.ndarray or list) – Batch of B bounding boxes of the N detected faces as (x1, y1, x2, y2) coordinates with dimensions (B, N, 4). Returns a list if different numbers of faces are detected across batched frames. Is None if a frame contains no faces.

  • probs (numpy.ndarray or list) – Probabilities of the detected faces (B, N). Returns a list if different numbers of faces are detected across batched frames. Is None if a frame contains no faces.

encode(faces: torch.Tensor) numpy.ndarray[source]

Compute embeddings for face images.

Parameters:

faces (torch.Tensor) – Cropped N face images from a video frame with dimensions (N, 3, H, W). H and W must at least be 80 for the encoding to work.

Returns:

Embeddings of the N face images with dimensions (N, 512).

Return type:

numpy.ndarray

identify(embeddings: numpy.ndarray) numpy.ndarray[source]

Cluster faces based on their embeddings.

Parameters:

embeddings (numpy.ndarray) – Embeddings of the N face images with dimensions (N, E) where E is the length of the embedding vector.

Returns:

Cluster indices for the N face embeddings.

Return type:

numpy.ndarray

extract(frame: Union[numpy.ndarray, torch.Tensor], boxes: Union[List[numpy.ndarray], numpy.ndarray]) Tuple[List[List[numpy.ndarray]], List[numpy.ndarray]][source]

Detect facial action units and landmarks.

Parameters:
  • frame (numpy.ndarray or torch.Tensor) – Batch of B frames containing RGB values with dimensions (B, H, W, 3).

  • boxes (numpy.ndarray or list) – Batch of B bounding boxes of the N detected faces as (x1, y1, x2, y2) coordinates with dimensions (B, N, 4) or list of B elements with (N, 4).

Returns:

  • landmarks (list) – Batch of B facial landmarks for N detected faces as (x, y) coordinates with dimensions (68, 2). Is None if a frame contains no faces.

  • aus (list) – Batch of B action unit activations for N detected faces with dimensions (N, 20). Is None if a frame contains no faces.

compute_confidence(embeddings: numpy.ndarray, labels: numpy.ndarray) numpy.ndarray[source]

Compute face label classification confidence.

Parameters:
Returns:

confidence – Confidence scores between 0 and 1. Returns numpy.nan if no label was assigned to a face.

Return type:

numpy.ndarray

apply(filepath: str, batch_size: int = 1, skip_frames: int = 1, process_subclip: Tuple[Optional[float]] = (0, None), show_progress: bool = True) mexca.data.VideoAnnotation[source]

Apply multiple steps to extract features from faces in a video file.

This method subsequently calls other methods for each frame of a video file to detect and cluster faces. It also extracts facial landmarks and action units.

Parameters:
  • filepath (str) – Path to the video file.

  • batch_size (int, default=1) – Size of the batch of video frames that are loaded and processed at the same time.

  • skip_frames (int, default=1) – Only process every nth frame, starting at 0.

  • process_subclip (tuple, default=(0, None)) – Process only a part of the video clip. Must be the start and end of the subclip in seconds.

  • show_progress (bool, default=True) – Enables the display of a progress bar.

Returns:

A data class object with extracted facial features.

Return type:

VideoAnnotation

mexca.video.cli()[source]

Command line interface for extracting facial features. See extract-faces -h for details.