Authors:
(1) Pinelopi Papalampidi, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh;
(2) Frank Keller, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh;
(3) Mirella Lapata, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh.
Previous approaches to movie understanding have mainly focused on isolated video clips, and tasks such as the alignment between movie scenes and book chapters [49], question answering [50], video captioning for movie shots [44], and text-to-video retrieval [5]. Recent work [40–42] attempts to identify high-level narrative structure and summarize entire TV episodes and movies focusing exclusively on the textual modality (i.e., screenplays).
Existing approaches to trailer generation exploit superficial audiovisual features, such as background music or visual changes between sequential shots [24, 46]. Other work creates “attractive” trailers with a graph-based model for shot selection [57] or uses a human in the loop in conjunction with a model trained on horror movies via audiovisual sentiment analysis [47]. The Trailer Moment Detection Dataset [53] consists of full-length movies paired with official trailers and annotations for key moments, but it is not publicly available and does not include screenplays.
Knowledge distillation [3, 23] was originally proposed for distilling information from a larger teacher model to a smaller student one. Generalized distillation [30] provides a framework for using privileged information, i.e., information which is available at training time only. Most related to our work is the use of different modalities or views of the same content [33, 34], e.g., transcribed narrations to learn visual representations in instructional videos. We leverage screenplays as a source of privileged information and distill knowledge about events, characters, and scenes in a film, which we subsequently exploit for identifying trailer worthy shots in video.
This paper is available on arxiv under CC BY-SA 4.0 DEED license.