paint-brush
Transductive Learning for Textual Few-Shot Classification: An Enhanced Experimental Settingby@transduction

Transductive Learning for Textual Few-Shot Classification: An Enhanced Experimental Setting

tldt arrow

Too Long; Didn't Read

Few-shot classification involves training a model to perform a new classification task with a handful of labeled data.
featured image - Transductive Learning for Textual Few-Shot Classification: An Enhanced Experimental Setting
Transduction University Papers HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Pierre Colombo, Equall, Paris, France & MICS, CentraleSupelec, Universite Paris-Saclay, France;

(2) Victor Pellegrain, IRT SystemX Saclay, France & France & MICS, CentraleSupelec, Universite Paris-Saclay, France;

(3) Malik Boudiaf, ÉTS Montreal, LIVIA, ILLS, Canada;

(4) Victor Storchan, Mozilla.ai, Paris, France;

(5) Myriam Tami, MICS, CentraleSupelec, Universite Paris-Saclay, France;

(6) Ismail Ben Ayed, ÉTS Montreal, LIVIA, ILLS, Canada;

(7) Celine Hudelot, MICS, CentraleSupelec, Universite Paris-Saclay, France;

(8) Pablo Piantanida, ILLS, MILA, CNRS, CentraleSupélec, Canada.

4 An Enhanced Experimental Setting

4.1 Datasets

Benchmarking the performance of FSL methods on diverse sets of datasets is critical to evaluate their generalization capabilities in a robust manner as well as their potential for real-world applications. Previous work on FSL (Karimi Mahabadi et al., 2022; Perez et al., 2021) mainly focuses on datasets with a reduced number of classes (i.e., K < 5). Motivated by practical considerations we choose to build a new benchmark composed of datasets with a larger number of classes. Specifically, we choose Go Emotion (Demszky et al., 2020), Tweet Eval (Barbieri et al., 2020), Clinc (Larson et al., 2019), Banking (Casanueva et al., 2020) and the Multilingual Amazon Reviews Corpus (Keung et al., 2020). These datasets cover a wide range of text classification scenarios and are of various difficulty[4]. A summary of the datasets used can be found in Tab. 1.

4.2 Model Choice

The selection of an appropriate backbone model is a critical factor in achieving high performance in few-shot NLP tasks. To ensure the validity and robustness of our findings, we have included a diverse range of transformer-based backbone models in our study, including pretrained using the closed task (Taylor, 1953). We consider two different sizes of the RoBERTa model, namely RoBERTa (B) with 124M parameters and RoBERTa (L) with 355M parameters and DistilRoBERTa, a lighter version of RoBERTa trained through a distillation process (Hinton et al., 2015), for a total of 82M parameters.


  1. Three different sizes of RoBERTa based models (Liu et al., 2019b). Similar to BERT, RoBERTa is


  1. Three sentence-transformers encoder (Reimersand Gurevych, 2019). Following (Muennighoff et al., 2022), we consider MPNET-base (Song et al., 2020), MiniLM (Wang et al., 2020), and Albert Small V2 (Lan et al., 2019).


  2. Multilingual models. To address realistic multilingual scenarios, we rely on three sizes of XLMRoBERTa (Conneau et al., 2020, 2019): base (B), large (L) and XL (XL).


  3. text-davinci model: to mimic the typical setting of API-based models, we also conduct experiments on text-davinci, only accessible through OpenAI’s API.


4.3 Evaluation Framework

Prior research in textual FSL typically involves sampling a low number of tasks, typically less than 10, of each dataset. In contrast, we utilize an episodic learning framework that generates a large number of N-shots K-ways tasks. This framework has gained popularity through inductive metalearning approaches, such as those proposed by (Finn et al., 2017b; Snell et al., 2017; Vinyals et al., 2016; Sung et al., 2018a; Mishra et al., 2017; Rusu et al., 2019; Oreshkin et al., 2018), as it mimics the few-shot environment during evaluation and improves model robustness and generalization. In this context, episodic training implies that a different model is initialized for each generated few-shot task, and all tasks are compiled independently in parallel. This approach allows to the computation of more reliable performance statistics by evaluating the generalization capabilities of each method on a more diverse set of tasks. To account for the model’s generalization ability, we average the results for each dataset over 1000 episodes, with the N considered classes varying in every episode. For each experiment, we consider the F1-Score.




[4] Datasets are available in Dataset (Lhoest et al., 2021)