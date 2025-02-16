Abstract and 1 Introduction and Motivation

6 CONCLUSION AND FUTURE WORK

In this paper we propose Ducho, a framework for extracting highlevel features for multimodal-aware recommendation. Our main purpose is to provide a unified and shared tool to support practitioners and researchers in processing and extracting multimodal features used as side information in recommender systems. Concretely, Ducho involves three main modules: Dataset, Extractor, and Runner. The multimodal extraction pipeline can be highly customized through a Configuration component that allows the setup of the modalities involved (i.e., audio, visual, textual), the sources of multimodal information (i.e., items and/or user-item interactions), and the pre-trained models along with their main extraction parameters. To show how Ducho works in different scenarios and settings, we propose three demos accounting for the extraction of (i) visual/textual items features, (ii) audio/textual items features, and (iii) textual items/interactions features. They can be run locally, on Docker (as we also dockerize Ducho), and on Google Colab. As future directions, we plan to: (i) adopt all available backends (i.e., TensorFlow, PyTorch, and Transformers) to extract features for all modalities; (ii) implement a general extraction model interface allowing the users to follow the same naming/indexing scheme for all pre-trained models and their extraction layers; (iii) integrate the extraction of low-level multimodal features.

ACKNOWLEDGMENTS

This work was partially supported by the following projects: Secure Safe Apulia, MISE CUP: I14E20000020001 CTEMT - Casa delle Tecnologie Emergenti Comune di Matera, CT_FINCONS_III, OVS Fashion Retail Reloaded, LUTECH DIGITALE 4.0, KOINÈ.

This paper is available on arxiv under CC BY 4.0 DEED license.



