MIVPG and Instance Correlation: Enhanced Multi-Instance Learning

Written by instancing | Published 2025/11/15
Tech Story Tags: deep-learning | mivpg | instance-correlation | correlated-self-attention | multiple-instance-learning | q-former-extension | visual-representations | low-rank-projection

TLDRMIVPG uses a Correlated Self-Attention (CSA) module to unveil instance correlation, fulfilling all MIL properties while outperforming Q-Former. CSA improves aggregation and reduces time complexity.via the TL;DR App

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

Subsequently, the aggregated low-rank matrix can be reintegrated with the original embeddings, as shown in Equation 9. This low-rank projection effectively reduces the time complexity to O(MM′).

Proposition 2. MIVPG, when equipped with the CSA (Correlated Self-Attention) module, continues to fulfill the essential properties of MIL

We prove the proposition 2 in the supplementary B.

In summary, as depicted in Figure 2a, we establish that QFormer falls under the MIL category and is a specialized instance of our proposed MIVPG. The latter extends to visual inputs with multiple dimensions, accounting for instance correlation.

Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).


This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.


Written by instancing | Pioneering instance management, driving innovative solutions for efficient resource utilization, and enabling a more sus
Published by HackerNoon on 2025/11/15