Human action recognition has emerged as an active area of research within the deep learning community. The primary objective involves identifying and categorizing human actions in videos by utilizing multiple input streams, such as video and audio data.
One particular application of this technology lies in the pornography domain, which poses unique technical challenges that complicate the process of human action recognition. Factors such as lighting variations, occlusions, and substantial differences in camera angles and filming techniques make action recognition difficult.
Even when two actions are identical, the diverse camera perspectives can lead to confusion in model predictions. To address these challenges in the pornography domain, we have employed deep learning techniques that learn from various input streams, including RGB, Skeleton (Pose), and Audio data. The most effective models in terms of performance and runtime include transformer-based architectures for the RGB stream, PoseC3D for the skeleton stream, and ResNet101 for the audio stream.
The outputs of these models are combined using late fusion, wherein each model's significance in the final scoring scheme differs. An alternative strategy might involve training a model with two input streams simultaneously, such as RGB+skeleton or RGB+audio, and subsequently merging their results. However, this approach is unsuitable due to the data's inherent properties.
Audio input streams are only useful for specific actions, while other actions lack distinct audio characteristics. Similarly, the skeleton-based model is only applicable when pose estimation surpasses a certain confidence threshold, which is challenging to attain for some actions.
By employing the late fusion technique, detailed in subsequent sections, we attain an impressive 90% accuracy rate for the top two predictions among 20 distinct categories. These categories encompass a diverse range of sexual actions and positions.
The primary and most reliable input stream for the model is the RGB frames. The two most potent architectures in this context are the 3D Convolutional Neural Networks (3D CNNs) and attention-based models. The attention-based models, particularly those utilizing transformer architectures, are currently considered state-of-the-art in the field. Consequently, we employ a transformer-based architecture to achieve optimal performance. Additionally, the model demonstrates rapid inference capabilities, requiring approximately 0.53 seconds to process 7-second video clips.
Initially, the human skeleton is extracted utilizing a human detection and 2D pose estimation model. The extracted skeleton information is subsequently fed into PoseC3D, a 3D Convolutional Neural Network (3D CNN) specifically designed for skeleton-based human action recognition. This model is also considered state-of-the-art in the field. In addition to its performance, the PoseC3D model exhibits efficient inference capabilities, requiring approximately 3 seconds to process 7-second video clips.
Owing to the challenging perspectives encountered in numerous actions (e.g. it's not possible to extract reliable poses that will help a model identify a fingering action most of the time), skeleton-based human action recognition is employed selectively, specifically for a subset of actions, which includes sex positions
For the audio input stream, a ResNet-based architecture derived from the Audiovisual SlowFast model is employed. This approach is applied to a smaller set of actions compared to the skeleton-based method, primarily due to the limited information available from an audio perspective for reliably identifying actions within this specific domain.
The assembled dataset is extensive and heterogeneous, incorporating a wide range of recording types, including point-of-view (POV), professional, amateur, with or without a dedicated camera operator, and varying background environments, individuals, and camera perspectives. The dataset comprises approximately 100 hours of training data spanning 20 distinct categories. However, some category imbalances were observed in the dataset. Efforts to address these imbalances are being considered for future iterations of the dataset.
The illustration above provides an overview of the AI pipeline utilized in our system.
Initially, a lightweight NSFW detection model is employed to identify non-NSFW segments of the video, enabling us to bypass the rest of the pipeline for those sections. This approach not only accelerates the overall video inference time but also minimizes false positives. Running the action recognition models on irrelevant footage, such as a house or car, is unnecessary as they are not designed to recognize such content.
Following this preliminary step, we deploy a rapid RGB-based action recognition model. Depending on the top two results from this model, we determine whether to execute the RGB-based position recognition model, the audio-based action recognition model, or the skeleton-based action recognition model. If one of the top two predictions from the RGB-action recognition model corresponds to the position category, we proceed with the RGB-position recognition model to accurately identify the specific position.
Subsequently, we utilize bounding box and 2D pose models to extract the human skeleton, which is then input into the skeleton-based position recognition model. The results from the RGB-position recognition model and the skeleton-position recognition model are integrated through late fusion.
If the audio group is detected within the top two labels, the audio-based action recognition model is executed. Its results are combined with those of the RGB-action recognition model through late fusion.
Lastly, we parse the outcomes of the action and position models, generating one or two final predictions. Examples of such predictions include single actions (e.g., Missi***ry), position and action combinations (e.g., Cowgirl & Kissing or Doggy & An*l), or dual actions (e.g., Cunn***ngus & Fing***ng).
For more information you can read our P-HAR API Docs