Cross-Model Validation: MIVPG's Efficacy on Encoder-Decoder vs. Decoder-Only LLMs

Written by instancing | Published 2025/11/19
Tech Story Tags: llms | multimodal-validation | mivpg-ablation | frozen-visual-encoder | decoder-only-llms | encoder-decoder-llms | computational-efficiency | csa-effectiveness

TLDRMIVPG's CSA module remains effective when switching from FLAN-T5-XL to the OPT-2.7b LLM architecture.via the TL;DR App

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

C. More Experiments

We implemented the proposed method on NVIDIA A100 GPUs with BFloat16. Except for the number of training epochs mentioned in the main paper, we kept all other hyperparameters the same as in BLIP2[22]. For PatchGastricADC22[36] and ABO[7], we trained the model for 40 epochs.

C.1. Frozen Visual Models

In the original BLIP2[22], image sizes are upscaled to 364 × 364, and consequently, the ViT is unfrozen during the fine-tuning process. This approach yields slightly better performance, albeit at a higher computational cost while training on the entire COCO training set.

In this section, we validate the performance of finetuning while keeping the ViT frozen and image sizes unchanged. Experiment results can be seen as Figure 8. We observed that when working with limited data, such as 50K samples, models exhibit comparable performance whether or not the visual encoder (ViT) is frozen. However, as the number of training epochs increases, the performance gap varies. In some cases, unfreezing the ViT leads to improved performance, while in others, the opposite holds true. Considering that many real-world applications may not have access to massive training data, freezing the ViT can be a more efficient approach while still maintaining similar performance levels.

C.2. Case Study

In the main paper, we employ the FLAN-T5-XL as the language model. Existing large language models can be broadly categorized into two types: encoder-decoder based and decoder-only based models. The FLAN-T5-XL falls into the former category. The decoder-only based models are more computationally efficient and the encoder-decoder based models can handle more sophisticated tasks. In this section, we assess the performance of MIVPG on models from the decoder-only category. Specifically, we use the BLIP2[22] with OPT-2.7b[47] as the base LLM. We validate the performance on the PatchGastricADC22 dataset. In the experiments, we only replace the LLM while keeping other hyperparameters unchanged.

The experiment results on PatchGastricADC22 using OPT-2.7b as the language model are presented in Table 4. Overall, the model continues to outperform the baselines shown in Table 1, emphasizing the advantages of integrating MLLMs into the WSI captioning task. Notably, the model with CSA performs better than the one without it, reaffirming the effectiveness of CSA. It’s also worth noting that the performance of using OPT-2.7b is not superior to using Flan-T5-XL. This could be attributed, in part, to the insufficiency of training data. Since OPT-2.7b is relatively less sophisticated, more training data may be required to train a more powerful model.

C.3. More Visualization

This section provides additional visualization results on the ABO dataset, including both patch-level attention weights and image-level attention weights. In the patch-level attention weights, it is evident that the model excels in detecting the shapes of objects, as a significant portion of the patch-level weights is assigned to edges and contours. The image-level attention weights display maps for all twelve heads. Each row in a map represents a query, while each column represents an image. It’s important to note that different heads and queries exhibit varying attention patterns towards the images, demonstrating the diversity in how the model processes and attends to the input images.

Authors:

(1) Wenliang Zhong, The University of Texas at Arlington ([email protected]);

(2) Wenyi Wu, Amazon ([email protected]);

(3) Qi Li, Amazon ([email protected]);

(4) Rob Barton, Amazon ([email protected]);

(5) Boxin Du, Amazon ([email protected]);

(6) Shioulin Sam, Amazon ([email protected]);

(7) Karim Bouyarmane, Amazon ([email protected]);

(8) Ismail Tutar, Amazon ([email protected]);

(9) Junzhou Huang, The University of Texas at Arlington ([email protected]).


This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.


Written by instancing | Pioneering instance management, driving innovative solutions for efficient resource utilization, and enabling a more sus
Published by HackerNoon on 2025/11/19