Authors:
(1) Troisemaine Colin, Department of Computer Science, IMT Atlantique, Brest, France., and Orange Labs, Lannion, France;
(2) Reiffers-Masson Alexandre, Department of Computer Science, IMT Atlantique, Brest, France.;
(3) Gosselin Stephane, Orange Labs, Lannion, France;
(4) Lemaire Vincent, Orange Labs, Lannion, France;
(5) Vaton Sandrine, Department of Computer Science, IMT Atlantique, Brest, France.
Estimating the number of novel classes
Appendix A: Additional result metrics
Appendix C: Cluster Validity Indices numerical results
Appendix D: NCD k-means centroids convergence study
The setup of NCD [2], which involves both labeled and unlabeled data, can make it difficult to distinguish from the many other domains that revolve around similar concepts. In this section, we review some of the most closely related domains and try to highlight their key differences in order to provide the reader with a clear and comprehensive understanding of the NCD domain.
Semi-supervised Learning is another domain that is at the frontier between supervised and unsupervised learning. Specifically, a labeled set is given alongside an unlabeled set containing instances that are assumed to be from the same classes. Semi-supervised Learning can be particularly useful when labeled data is scarce or annotation is expansive. As unlabeled data is generally available in large quantities, the goal is to exploit it to obtain the best possible generalization performance given limited labeled data.
The main difference with NCD is that the all the classes are known in advance. Some works have shown that the presence of novel classes in the unlabeled set negatively impacts the performance of Semi-Supervised Learning models [14, 15]. So as these works do not attempt to discover the novel classes, they are not applicable to NCD.
Transfer Learning aims at solving a problem faster or with better performance by leveraging knowledge from a different problem. It is commonly expressed in computer vision by pre-training models on ImageNet [1]. Transfer Learning can be either crossdomain, when a model trained on a given dataset is fine-tuned to perform the same task on a different (but related) dataset. Or it can be cross-task, where a model that can distinguish some classes is re-trained for other classes of the same domain.
NCD can be viewed as a cross-task Transfer Learning problem where the knowledge from a classification task on a source dataset is transferred to a clustering task on a target dataset. But unlike NCD, Transfer Learning typically requires the target spaces of both sets to be known in advance. Initially, NCD was characterized as a Transfer Learning problem (e.g. in DTC [16] and MCL [17]) and the training was done in two stages: first on the labeled set and then on the unlabeled set. This methodology seemed natural, as with Transfer Learning, both sets are not available at the same time.
Generalized Category Discovery (GCD) was first introduced by [11] and has also attracted attention from the community [12, 18, 19]. It can be seen as a less constrained alternative to NCD, since it does not rely on the assumption that samples belong exclusively to the novel classes during inference. However, this is a more difficult problem, as the models must not only cluster the novel classes, but also accurately differentiate between known and novel classes while correctly classifying samples from the known classes.
Some notable works in this area include ORCA [20] and OpenCon [21]. Namely, ORCA trains a discriminative representation by balancing a supervised loss on the known classes and unsupervised pairwise loss on the unlabeled data. And OpenCon proposes a contrastive learning framework which employs Out-Of-Distribution strategies to separate known vs. novel classes. Its clustering strategy is based on moving prototypes that enable the definition of positive and negative pairs of instances.
Novel Class Discovery has a rich body of papers in the domain of computer vision. Early works approached this problem in a two-stage manner. Some define a latent space using only the known classes, and project the unlabeled data into it (DTC[16] and MM [22]). Others train a pairwise labeling model on the known classes and use it to label and then cluster the novel classes (CCN [3] and MCL [17]). But both of these approaches suffered from overfitting on the known data when the high-level features were not fully shared by the known and novel classes.
Today, to alleviate this overfitting, the majority of approaches are one-stage and try to transfer knowledge from labeled to unlabeled data by learning a shared representation. In this category, AutoNovel [4] is one of the most highly influential works. After pre-training their latent representation with SSL [23], two classification networks are jointly trained. The first simply learns to distinguish the known classes with the ground-truth labels. And the other learns to separate unlabeled data from pseudolabels defined for each epoch based on pairwise similarity. NCL [6] adopts the same architecture as AutoNovel, and extends the loss by adding a contrastive learning term to encourage the separation of novel classes. OpenMix [5] utilizes the MixUp strategy to generate more robust pseudo-labels.
As expressed before, although these methods have achieved some success, they are not applicable to tabular data. To date, and to the best of our knowledge, only TabularNCD [9] tackles this problem. Also inspired by AutoNovel, it pre-trains a dense-layer autoencoder with SSL and adopts the same loss terms and dual classifier architecture. Pseudo-labels are defined between pairs of unlabeled instances by checking if they are among the most similar pairs.
For a more complete overview of the state-of-the-art of NCD, we refer the reader to the survey [2].
This paper is available on arxiv under CC 4.0 license.