paint-brush
A Comprehensive Approach to Multimodal Bundle Construction with CLHE Methodologyby@feedbackloop

A Comprehensive Approach to Multimodal Bundle Construction with CLHE Methodology

tldt arrow

Too Long; Didn't Read

Learn about CLHE's groundbreaking methodology for bundle construction, covering problem formulation, hierarchical encoding, and contrastive learning. Discover how CLHE addresses challenges like sparsity and the cold-start problem, utilizing multimodal features and item-level user feedback. Dive into the details of the hierarchical encoder and contrastive learning techniques that make CLHE a pioneering solution in automatic bundle construction.
featured image - A Comprehensive Approach to Multimodal Bundle Construction with CLHE Methodology
The FeedbackLoop: #1 in PM Education HackerNoon profile picture

Authors:

(1) Yunshan Ma, National University of Singapore;

(2) Xiaohao Liu, University of Chinese Academy of Sciences;

(3) Yinwei Wei, Monash University;

(4) Zhulin Tao, Communication University of China and a Corresponding author;

(5) Xiang Wang, University of Science and Technology of China and affiliated with Institute of Artificial Intelligence, Institute of Dataspace, Hefei Comprehensive National Science Center;

(6) Tat-Seng Chua, National University of Singapore.

Table of Links

Abstract & Introduction

Methodology

Experiments

Related Work

Conclusion and Future Work, Acknowledgment and References

2 METHODOLOGY

We first formally define the problem of bundle construction by considering all three types of data. Then we describe the details of our proposed method CLHE (as shown in Figure 2).

2.1 Problem Formulation

2.2 Hierarchical Encoder

We utilize a hierarchical encoder for multimodal bundle representation. Initially, we extract multimodal features using multimodal foundation models, while concurrently pre-training a CFbased model to capture item-level user feedback. Subsequently, a self-attention encoder is introduced to integrate these multimodal features, resulting in a fused item representation. Another self attention encoder then aggregates these representations, producing a comprehensive bundle representation.


Figure 2: The overall framework of our proposed method CLHE, which consists of two main components: hierarchical encoder(aka. multimodal feature extraction, item and bundle representation learning) and contrastive learning.


2.2.1 Item Representation Learning. We first detail the feature extraction process and then present the self-attention encoder.



Item-level User Feedback Feature Extraction. We employ the well-performing CF-based model, i.e., LightGCN [23], to obtain item representations from user feedback. Specifically, we devise a bipartite graph based on the user-item interaction matrix, then train a LightGCN [1] model over the bipartite graph, denoted as:



2.3 Contrastive Learn

ng Even though the hierarchical encoder can well attain the correlations among multiple features and multiple items, it still suffers from noise, sparsity, or even cold-start problems in both item and bundle levels. Specifically, at the item level, the items that have fewer user feedbacks or are involved in fewer bundles during training may also be prone to deteriorate representations, which is the so-called sparsity issue. Even worse, some cold-start items may have never interacted with any users or been included in any bundles before, therefore, the cold-start problem will severely deteriorate the representation quality. Second, at the bundle level, the partial bundle’s representation is susceptible to noise and sparsity issues. Instead of a complete bundle that is sufficient to depict all the functionalities or properties of the bundle, the given partial bundle only encompasses some of the items. Consequently, the bundle representation may be biased due to the arbitrary seed items.


To tackle these problems, we aim to harness contrastive learning over both item and bundle levels to mine the self-supervision signals. Recently, contrastive learning has achieved great success in various tasks, including CV [10], NLP [18], and recommender systems [49]. The main idea is to first corrupt the original data and generate some augmented views for the same data point, and then leverage an InfoNCE loss to pull close the representations across multiple augmented views for the same data point, while pushing away the representations of different data points. Therefore, the representations could be more robust to combat noise and sparsity.


2.3.1 Item-level Contrastive Learning. For each item 𝑖, we tailor its representation f𝑖 in Equation 5. We leverage various data augmentations to generate the augmented view f ′ 𝑖 . The item-level data augmentation methods we used include: 1) No Augmentation (NA) [37]: just use the original representation as the augmented feature without any augmentation; 2) Feature Noise (FN) [53]: a y. small-scaled random noise vector to the item’s features; 3) Feature Dropout (FD) [49]: randomly dropout some values over the feature vectors; and 4) Modality Dropout (MD): dropout the whole feature of a randomly selected modality on a randomly selected item. Then, we use the InfoNCE [37] to generate the item-level contrastive loss, denoted as:



where cos(·) is the cosine similarity, and 𝜏 is the temperature.


2.4 Prediction and Optimization




[1] Other CF-based models can also be used.


This paper is available on arxiv under CC 4.0 license.