Authors:
(1) Pinelopi Papalampidi, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh;
(2) Frank Keller, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh;
(3) Mirella Lapata, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh.
Evaluation Metrics Previous work [41] evaluates the performance of TP identification models in terms of three metrics: Total Agreement (TA), i.e., the percentage of TP scenes that are correctly identified, Partial Agreement (PA), i.e., the percentage of TP events for which at least one gold-standard scene is identified, and Distance (D), i.e., the minimum distance in number of scenes between the predicted and gold-standard set of scenes for a given TP, normalized by the screenplay length. We report results with the partial agreement metric. We can no longer use total agreement, since we evaluate against silver standard (rather than gold) labels for shots (rather than scenes) and as a result consider all shots within a scene equally important. We do not use the distance metric either since it yields very similar outcomes and does not help discriminate among model variants.
Hyperparameters Following previous work [42], we project all types of features (i.e., textual, visual, and audio) to the same lower dimension of 128. We find that larger dimensions increase the number of parameters considerably and yield inferior results possibly due to small dataset size.
We contextualize scenes (with respect to the screenplay) and shots (with respect to the video) using transformer encoders. We experimented with 2, 3, 4, 5, and 6 layers in the encoder and obtained best results with 3 layers. For the feed forward (FF) dimension, we experimented with both a standard size of 2,048 and a smaller size of 1,024 and found the former works better. We use another transformer encoder to compute the representation of a scene from a sequence of input sentence representations. This encoder has 4 layers and 1,024 FF dimension. Both encoders, employ 8 attention heads and 0.3 dropout.
During graph sparsification (i.e., selection of top-k neighbors), we consider different neighborhood options for the scene- and shot-based networks due to their different granularity and size. Following [42], we consider [1–6] neighbors for the scene network and we increase the neighborhood size to [6–12] for the shot network.
This paper is available on arxiv under CC BY-SA 4.0 DEED license.