paint-brush
Leveraging Multimodal Features and Item-level User Feedback for Bundle Constructionby@feedbackloop

Leveraging Multimodal Features and Item-level User Feedback for Bundle Construction

tldt arrow

Too Long; Didn't Read

CLHE, a Contrastive Learning-enhanced Hierarchical Encoder, is revolutionizing bundle construction by combining multimodal features, item-level user feedback, and existing bundles. Addressing challenges like sparsity and the cold-start problem, CLHE outperforms competitors in automatic bundle construction. Learn about the key contributions and technical innovations that make CLHE a groundbreaking method in the e-commerce landscape.
featured image - Leveraging Multimodal Features and Item-level User Feedback for Bundle Construction
The FeedbackLoop: #1 in PM Education HackerNoon profile picture

Authors:

(1) Yunshan Ma, National University of Singapore;

(2) Xiaohao Liu, University of Chinese Academy of Sciences;

(3) Yinwei Wei, Monash University;

(4) Zhulin Tao, Communication University of China and a Corresponding author;

(5) Xiang Wang, University of Science and Technology of China and affiliated with Institute of Artificial Intelligence, Institute of Dataspace, Hefei Comprehensive National Science Center;

(6) Tat-Seng Chua, National University of Singapore.

Table of Links

Abstract & Introduction

Methodology

Experiments

Related Work

Conclusion and Future Work, Acknowledgment and References

ABSTRACT

Automatic bundle construction is a crucial prerequisite step in various bundle-aware online services. Previous approaches are mostly designed to model the bundling strategy of existing bundles. However, it is hard to acquire large-scale well-curated bundle dataset, especially for those platforms that have not offered bundle services before. Even for platforms with mature bundle services, there are still many items that are included in few or even zero bundles, which give rise to sparsity and cold-start challenges in the bundle construction models. To tackle these issues, we target at leveraging multimodal features, item-level user feedback signals, and the bundle composition information, to achieve a comprehensive formulation of bundle construction. Nevertheless, such formulation poses two new technical challenges: 1) how to learn effective representations by optimally unifying multiple features, and 2) how to address the problems of modality missing, noise, and sparsity problems induced by the incomplete query bundles. In this work, to address these technical challenges, we propose a Contrastive Learning-enhanced Hierarchical Encoder method (CLHE). Specifically, we use selfattention modules to combine the multimodal and multi-item features, and then leverage both item- and bundle-level contrastive learning to enhance the representation learning, thus to counter the modality missing, noise, and sparsity problems. Extensive experiments on four datasets in two application domains demonstrate that our method outperforms a list of SOTA methods. The code and dataset are available at https://github.com/Xiaohao-Liu/CLHE.


KEYWORDS

Bundle Construction, Multimodal Modeling, Contrastive Learning


Figure 1: The motivations of leveraging multimodal features and item-level user feedback for bundle construction.

1 INTRODUCTION

Product bundling has been a popular and effective marketing strategy, tracing back from ancient commercial times and persisting through to the rapidly growing e-commerce and online services today. By combining a set of individual items into a bundle, both the sellers (or service providers) and consumers can benefit a lot from multiple aspects, including the reduced cost of packaging, shipment, and installation, to promoting sales of old or new items by combining them with some popular or essential items with discounts. To implement product bundling, the first and foremost step is constructing bundles from individual items, aka. bundle construction, which is traditionally carried out by human experts. However, the explosive growth of item sets poses significant challenges to such high-cost manual approaches. Hence, automatic approaches to bundle construction are imperative and have garnered more and more attention in recent years.


By analyzing prior studies, we find that they mostly build the bundles based on the co-occurrence relationship of items in existing training bundles. However, there are two key problems that have not been well studied: 1) previous approaches heavily rely on largescale high-quality bundle dataset for training, and 2) they cannot properly handle the sparsity and cold-start issues. First, most previous bundle construction methods require high-quality supervision signals from a large set of well-curated bundles. However, there is a dilemma in such an approach especially for platforms that have not offered bundle service before or have just deployed bundle service for a short period of time, it is difficult for such platforms to collect sufficient bundle data for training. Second, even for platforms with mature bundle services, the situation is far from ideal due to the various cold-start problems. On the one hand, there are quite a number of items that are only involved in a few bundles, consequently, it is challenging to obtain informative representations for these sparse items to construct new bundles. Worse still, there many new items, which haven’t been part of previous bundles while are continuously pushed online, and how to swiftly bundle these cold-start items with existing warm items is crucial for platforms to promote new products and keep sustained growth.


Addressing these challenges, instead of seeking any silver bullet, we are more keen on practical solutions that make full use of the large amount of easy-to-access resources: multimodal features and item-level user feedback. The motivation behind this solution is that these data are well aligned to diverse bundling strategies. First, multimodal features, such as text, image, and audio, contain rich semantic information that is helpful to find either similar or compatible items and form bundles, as shown in Figure 1. More importantly, most items, even those sparse and newly introduced items, usually have one or multiple such features. A plethora of previous efforts, such as personalized recommendation [47], have demonstrated the efficacy of multimodal features in handling sparse and cold-start items. Second, item-level user feedback information endows precious crowd-sourcing knowledge that is crucial to bundle construction. Intuitively, the items that users frequently co-interact with are strong candidates for bundling. More importantly, a large amount of such user feedback signals are available even to platforms that do not offer bundle services. Compared with previous works [20], we pioneer the integration of multimodal features and item-level user feedback for bundle construction.


Given the outlined motivations, we aim to leverage both multimodal features and item-level user feedback, along with the existing bundles, to develop a comprehensive model for bundle construction. However, it is non-trivial to design a model to capture all three types of information and achieve optimal bundle construction performance. First, how to learn effective representations in each modality and well capture the cooperative association among the three modalities is a key challenge. Second, some items might not be associated with user feedback or affiliated to bundles comprehensively, thus the so-called modality-missing issue may degrade the modeling capability. What’s more, during the inference stage of bundle construction, we usually need to provide several seed items as a partial bundle to initiate the construction process. However, the incompleteness of the partial bundle imposes noise and sparsity challenges to the bundle representation learning, which will impede the bundle construction performance.


In this work, to address the aforementioned challenges, we propose a Contrastive Learning-enhanced Hierarchical Encoder (CLHE) for bundle construction. In order to obtain the representations of items, we make use of the recently proposed large-scale multimodal foundation models (i.e., BLIP [27] and CLAP [50]) to extract the multimodal features of items. Concurrently, we pretrain a collaborative filtering (CF)-based model (i.e., LightGCN [23]) to obtain the items’ representations that preserve the user feedback information. Then, we employ a hierarchical encoder to learn the bundle representation, where the self-attention mechanism is devised to expertly fuse multimodal information and multi-item representations. To tackle the modality missing problem and the sparsity/noise issues induced by the incomplete partial bundle, we employ two levels of contrastive learning [37, 49], i.e., item-level and bundle-level, to fully take advantage of the self-supervision signals. We conduct experiments on four datasets from two domains, and the results demonstrate that our method outperforms multiple leading methods. Various ablation and modal studies further justify the effectiveness of key modules and demonstrate multiple crucial properties of our proposed model. We summarize the key contributions of this work as follows:


• We introduce a pioneering approach to bundle construction by holistically combining multimodal features, item-level user feedback, and existing bundles. This integration addresses prevailing challenges such as data insufficiency and the cold-start problem.


• We highlight multiple technical challenges of this new formulation and propose a novel method of CLHE to tackle them.


• Our method outperforms various leading methods on four datasets from two application domains with different settings, and further diverse studies demonstrate various merits of our method.


This paper is available on arxiv under CC 4.0 license.