paint-brush
A Practical Approach to Novel Class Discovery in Tabular Data: Appendix by@dataology

A Practical Approach to Novel Class Discovery in Tabular Data: Appendix

tldt arrow

Too Long; Didn't Read

This article showcases advancements in Novel Class Discovery (NCD) algorithms like PBN and innovative hyperparameter tuning methods, enabling the successful resolution of NCD problems even in realistic scenarios without prior knowledge of novel classes.
featured image - A Practical Approach to Novel Class Discovery in Tabular Data: Appendix
Dataology: Study of Data in Computer Science HackerNoon profile picture

Authors:

(1) Troisemaine Colin, Department of Computer Science, IMT Atlantique, Brest, France., and Orange Labs, Lannion, France;

(2) Reiffers-Masson Alexandre, Department of Computer Science, IMT Atlantique, Brest, France.;

(3) Gosselin Stephane, Orange Labs, Lannion, France;

(4) Lemaire Vincent, Orange Labs, Lannion, France;

(5) Vaton Sandrine, Department of Computer Science, IMT Atlantique, Brest, France.

Abstract and Intro

Related work

Approaches

Hyperparameter optimization

Estimating the number of novel classes

Full training procedure

Experiments

Conclusion

Declarations

References

Appendix A: Additional result metrics

Appendix B: Hyperparameters

Appendix C: Cluster Validity Indices numerical results

Appendix D: NCD k-means centroids convergence study

Appendix A Additional result metrics




Appendix B Hyperparameters

The Table B3 shows the hyperparameters found by the full procedure described in Section 6.


Appendix C Cluster Validity Indices numerical results

An estimate of the number of clusters in the 7 datasets considered in this paper can be found in Table C4. Among the 6 CVIs reported here, the Silhouette coefficient performed the best. Furthermore, compared to the original feature space, its average estimation error significantly decreased in the latent space, validating our approach. For some datasets, the Davies-Bouldin index continued to decrease and the Dunn index continued to increase as the number of clusters increased, resulting in very large overestimations. Note that the estimates of the number of novel classes in Table C4 are





not needed in the experiments of Section 7.2.2, since Algorithm 1 directly incorporates such estimates in the training procedure. This table has only helped us to identify the most appropriate CVI for our problem. The only exception is the TabularNCD method, which requires an a priori estimation of the number of novel classes in the original feature space.



Table C4: An estimation of the number of novel classes with some CVIs in the latent space of PBN.


Appendix D NCD k-means centroids convergence study

In this appendix, we aim to determine how to achieve the best performance with NCD k-means. Specifically, after the centroid initialization described in Section 3.2, we investigate: (1) whether it is more effective to update the centroids of both known and novel classes, or only the centroids of novel classes; (2) whether the centroids need to be updated using data from both known and novel classes, or only using data from novel classes. The results are presented in Table D5 and show that for 5 out of 7 datasets, the best results are obtained when only the centroids of the novel classes are updated on the unlabeled data. Updating the centroids of the known classes always leads to worse performance, as the class labels are not used in this process. Thus, the centroids of the known classes run the risk of capturing data from the novel classes (and vice versa).



Table D5: ACC of NCD k-means averaged over 10 runs.





This paper is available on arxiv under CC 4.0 license.