Violence Detection in Videos: Introductionby@kinetograph

Violence Detection in Videos: Introduction

Too Long; Didn't Read

In this paper, researchers propose a system for automatic detection of violence in videos, utilizing audio and visual cues for classification.
featured image - Violence Detection in Videos: Introduction
Kinetograph: The Video Editing Technology Publication HackerNoon profile picture


(1) Praveen Tirupattur,  University of Central Florida.

1. Introduction

The amount of multimedia content uploaded to social networking websites and the ease with which these can be accessed by children is posing a problem to parents who wish to protect their children from getting exposed to violent and adult content on the web. The number of video uploads to websites like YouTube and Facebook are on the rise. There is an increase of 75% in the number of video posts on Facebook (Blog-FB [3]) in the last one year and more than 120,000 videos are uploaded to YouTube every day (Wesch [56], Gill et al. [26]). It is estimated that 20% of the videos uploaded to these websites contain violent or adult content (Sparks [54]). This makes it easy for children to access or accidentally get exposed to these unsafe contents. The effects of watching violent content on children are well studied in psychology (Tompkins [55], Sparks [54], Bushman and Huesmann [6], and Huesmann and Taylor [32]) and the results of these studies suggest that watching of violent content has a substantial effect on emotions of the children. The major effects are increases in the likelihood of aggressive or fearful behavior and becoming less sensitive to the pain and suffering of others. Huesmann and Eron [31] conducted a study involving children from elementary school, who watched many hours of violence on television. By observing these children into adulthood, they found that the ones who did watch a lot of television violence when they were 8 years old were more likely to be arrested and prosecuted for criminal acts as adults. Similar studies by Flood [25] and Mitchell et al. [40] suggest that exposure to adult content also has detrimental effects on children. This motivated research in the field of automatic violent and adult content detection in videos.

Adult content detection (Chan et al. [8], Schulze et al. [52], Pogrebnyak et al. [47]) is well studied and much progress has been made. Violence detection, on the other hand, has been less studied and has gained interest only in the recent past. Few approaches for violence detection were proposed in the past and each of these approaches tried to detect violence using different visual and auditory features. For example, Nam et al. [41] combined multiple audio-visual features to identify violent scenes. In their work, flames and blood were detected using predefined color tables and various representative audio effects (gunshots, explosions, etc.) were also exploited. Datta et al. [14] proposed an accelerated motion vector based approach to detect human violence such as fist fighting, kicking, etc. Cheng et al. [11] presented a hierarchical approach to locating gun play and car racing scenes through detection of typical audio events (e.g. gunshots, explosions, and car-braking).

More approaches proposed for violence detection are discussed in Chapter 2. All of these approaches focused mainly only on detection of violence in Hollywood movies but not in videos from video sharing and social media websites such as YouTube or Facebook. Detection of violence in Hollywood movies is relatively easy as these movies follow some moviemaking rules. For example, to exhibit exciting action scenes, the atmosphere of fast-pace is created through high-speed visual movement and fast-paced sound. But the videos from the video-sharing websites, like YouTube and Facebook, do not follow these moviemaking rules and often have poor audio and video quality. These characteristics of user-generated videos make it very hard to detect violence in them.

Before the approach to detect violence is discussed, it is important to provide a definition for the term “Violence”. All of the previous approaches for violence detection have not followed the same definition of violence and have used different features and different datasets. This makes the comparison of different approaches very difficult. To overcome this problem and to foster research in this area, a dataset named Violent Scene Detection (VSD) was introduced by Demarty et al. [15] in 2011 and the recent version of this dataset is the VSD2014. According to this latest dataset, “Violence” in a video is, “any scene one would not let an 8 year old child watch because they contain physical violence”Schedl et al. [51]. This definition is believed to be formulated based on the research findings from psychology, which are mentioned above. From this definition, it can be observed that violence is not a physical entity but a concept which is very generic, abstract and also very subjective. Hence, violence detection is not a trivial task.

The aim of this work is to build a system which automatically detects violence not only in Hollywood movies, but also in videos from the video-sharing websites like YouTube and Facebook. In this work, an attempt is made to also detect the category of violence in a video, which was not addressed by earlier approaches. The categories of violence which are targeted in this work are the presence of blood, presence of cold arms, explosions, fights, screams, presence of fire, firearms, and gunshots. These represent the subset of concepts defined and used in the VSD2014 for annotating video segments. The categories “gory scenes” and “car chase” from VSD2014 were not selected as there were not many video segments in VSD2014 annotated with these concepts. Another such category is the “Subjective Violence”. It is not selected as the scenes belonging to this category do not have any visible violence and hence are very hard to detect. In this work, both audio and visual features are used for violence detection as combining both audio and visual information provides more reliable results in classification.

The advantages of developing a system like this, which can automatically detect violence in multi-media content are many. It can be used to rate movies depending on the amount of violence. This can be used by social networking sites to detect and block upload of violent videos to their platforms. Also, it can be used for scene characterization and genre classification which helps in searching and browsing movies. Recognition of violence in video streams from real-time camera systems will be very helpful for video surveillance in places such as airports, hospitals, shopping malls, public places, prisons, psychiatric wards, school playgrounds etc. However, real time detection of violence is much more difficult and in this work no attempt is made to deal with it.

An overview of related work, detailed description of the proposed approach and the evaluation are presented next. The following chapters are organized as follows. In Chapter 2 some of the previous works in the area of violence detection are explained in detail. In Chapter 3, the details of the approach used for training and testing of feature classifiers are presented. It also includes the details of feature extraction and the classifier training. Chapter 4 describes the details of datasets used, experimental setup and the results obtained from the experiments. Finally, in Chapter 5 conclusions are provided followed by the possible future work.

This paper is available on arxiv under CC 4.0 license.