Multiple Instance Learning: Review of Instance and Embedding Level Approaches

Written by instancing | Published 2025/11/13
Tech Story Tags: deep-learning | multiple-instance-learning | embedding-level-mil | instance-aggregation | attention-mechanism | neural-pooling | modality-extension | instance-level-approach

TLDRThis article presents a new approach to Multiple Instance Learning (MIL) MIVPG is a type of machine learning that uses multiple instances to learn. The study uses attention-based VPG and a bag-level embedding to generate bag- level predictions. The authors will also present a case study of the use of multiple images in MIL.via the TL;DR App

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

2.2. Multiple Instance Learning

Traditionally, Multiple Instance Learning [6, 28] can be broadly categorized into two main types: (1) The instance-level approach [5, 10, 14, 17] : In this approach, bag-level predictions are directly derived from the set of instance-level predictions. (2) The embedding-level approach [16, 20, 26, 34] : Here, bag-level predictions are generated from an bag-level embedding that represents multiple instances. For the former, hand-crafted pooling operators such as mean pooling or max pooling are often employed. However, in practical applications, these hand-crafted pooling operators often yield limited results. Hence, the majority of current research is grounded in the latter approach.

Aggregating instance features to form bag-level features typically leads to better outcomes but requires more complex pooling operations. Recent research has applied neural networks to the pooling process in MIL. For instance, MI-Net [40] utilizes a fully connected layer in MIL. Furthermore, AB-MIL [16] employs attention during the pooling process, allowing for better weighting of different instances. Another category of methods[34] attempts to consider the relationships between different instances using the self-attention mechanism. Moreover, DS-MIL [20] employs attention not only to consider instance-to-instance relationships but also instance-to-bag relationships; DTFDMIL [46] incorporates the Grad-CAM[33] mechanism into MIL. While these approaches concentrate on single modality, the extension of MIL to multimodal applications is scarcely explored [39].

Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).


This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.


Written by instancing | Pioneering instance management, driving innovative solutions for efficient resource utilization, and enabling a more sus
Published by HackerNoon on 2025/11/13