Data Scarcity and MLLMs: Using MIL to Uncover Latent Patterns in Single-Image Tasks

Written by instancing | Published 2025/11/18
Tech Story Tags: deep-learning | data-scarcity | multiple-instance-learning | single-image-scenario | mscoco | ppeg-enhancement | mllm-performance | image-patches

TLDREvaluates MIVPG performance on single-image datasets. Enhancements from PPEG and MIL are critical for discerning patterns in small datasets, mitigating the impact of data scarcity on MLLM performance.via the TL;DR App

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

4.2. Scenario 1: Samples with Single Image

We start by assessing the performance of our method on common single-image datasets to validate the effectiveness of considering Multiple Instance Learning through the addition of Pyramid Positional Encoding Generator for each

layer containing MIVPG. Following the fine-tuning baseline in BLIP2, we choose MSCOCO[23] as the evaluation dataset and employ the Karpathy validation and testing set split. The original training set contains approximately 560K image-text pairs. Given that most existing MIL methods are tailored for small datasets, we evaluate performance across various sizes of training subsets obtained through random sampling. In this dataset, we treat patches as individual instances, and each sample comprises only one image, indicating that N = 1.

The result from the MSCOCO dataset is shown in Figure 4. It reveals that the enhancements achieved through the use of PPEG are more noticeable when working with smaller datasets. As the dataset size increases, the difference in performance becomes less significant. This can be attributed to the fact that in cases of limited data, models often struggle to discern latent and implicit patterns. Therefore, more sophisticated modules are required to uncover deeper relationships within the data. Conversely, existing MLLMs are typically pretrained on extensive datasets, which tend to mitigate the impact of data scarcity. In practical applications, we demonstrate that one can draw upon MIL techniques to enhance MLLMs performance in scenarios where there is insufficient data for the downstream task.

Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).


This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.


Written by instancing | Pioneering instance management, driving innovative solutions for efficient resource utilization, and enabling a more sus
Published by HackerNoon on 2025/11/18