Violence Detection in Videos: Conclusions and Future Workby@kinetograph

Violence Detection in Videos: Conclusions and Future Work

Too Long; Didn't Read

In this paper, researchers propose a system for automatic detection of violence in videos, utilizing audio and visual cues for classification.
featured image - Violence Detection in Videos: Conclusions and Future Work
Kinetograph: The Video Editing Technology Publication HackerNoon profile picture


(1) Praveen Tirupattur,  University of Central Florida.

5. Conclusions and Future Work

In this chapter, the conclusions and the directions in which the existing work can be extended are discussed in the Section 5.1 and Section 5.2 respectively.

5.1. Conclusions

In this work, an attempt has been made to develop a system to detect violent content in videos using both visual and audio features. Even though the approach used in this work is motivated by the earlier works in this area, the following are the unique aspects of it: (i) Detection of different classes of violence, (ii) the use of SentiBank feature to describe visual content of a video, (iii) the Blood detector and the blood model developed using images from the web, and (iv) using information from video codec to generate motion features. Here is a brief overview of the process used to develop this system.

As violence is not a physical entity, the detection of it in a video is not a trivial task. Violence is a visual concept and to detect it there is a need to use multiple features. In this work, MFCC features were used to describe audio content and Blood, Motion and SentiBank features are used to describe visual content. SVM classifiers were trained for each of the selected features and the individual classifier scores were combined by weighted sum to get the final classification scores for each of the violence classes. The weights for each class are found using a grid-search approach with the optimizing criteria to be the minimum EER. Different datasets are used in this work, but the most important one is the VSD dataset, which is used for training the classifiers, calculating the classifier weights and for testing the system.

The performance of the system is evaluated on two different classification tasks, MultiClass, and Binary classification. In Multi-Class classification task, the system has to detect the class of violence present in a video segment. This is a much more difficult task than just detecting the presence of violence and the system presented here is one of the first to tackle this problem. The Binary classification task is where the system has to just detect the presence of violence without having to find the class of violence. In this task, if the final classification score from the Multi-Class classification task for any of the violence class is more than 0.5, then the video segment is categorized as “Violence” else, it is categorized as “No Violence”. The results from the Multi-Class classification task is far from perfect and there is room for improvement, whereas, the results on the Binary classification tasks are better than the existing benchmark results from MediaEval-2014. However, these results are definitely encouraging. In Section 5.2, a detailed discussion on the possible directions in which the current work can be extended are presented.

5.2. Future Work

There are many possible directions in which the current work can be extended. One direction would be to improve the performance of the existing system. For that, the performance of the individual classifiers has to be improved. Motion and Blood are the two features whose classifier performance needs resonable improvement. As explained in Section 4.4, the approach used to extract motion features has to be changed for improving the performance of the motion classifier. For Blood, the problem is with the dataset used for training the classifier but not the feature extractor. An appropriate dataset with decent amount of frames containing blood should be used for training. Making these improvements should be the first step towards building a better system. Another direction for the future work would be to adapt this system and develop different tools for different applications. For example, (i) a tool could be developed which could extract the video segments containing violence from a given input video. This could be helpful in video tagging. (ii) A similar tool could be developed for parental control where the system could be used to rate a movie depending on the amount of violent content in it. Another possible direction for future work is, to improve the speed of the system so that it can be used in the real-time detection of violence from the video feed of security cameras. The improvements needed for developing such a system will not be trivial.

This paper is available on arxiv under CC 4.0 license.