Authors:
(1) Troisemaine Colin, Department of Computer Science, IMT Atlantique, Brest, France., and Orange Labs, Lannion, France;
(2) Reiffers-Masson Alexandre, Department of Computer Science, IMT Atlantique, Brest, France.;
(3) Gosselin Stephane, Orange Labs, Lannion, France;
(4) Lemaire Vincent, Orange Labs, Lannion, France;
(5) Vaton Sandrine, Department of Computer Science, IMT Atlantique, Brest, France. Table of Links Abstract and Intro Related work Approaches Hyperparameter optimization Estimating the number of novel classes Full training procedure Experiments Conclusion Declarations References Appendix A: Additional result metrics Appendix B: Hyperparameters Appendix C: Cluster Validity Indices numerical results Appendix D: NCD k-means centroids convergence study Appendix A Additional result metrics Appendix B Hyperparameters The Table B3 shows the hyperparameters found by the full procedure described in Section 6. Appendix C Cluster Validity Indices numerical results An estimate of the number of clusters in the 7 datasets considered in this paper can be found in Table C4. Among the 6 CVIs reported here, the Silhouette coefficient performed the best. Furthermore, compared to the original feature space, its average estimation error significantly decreased in the latent space, validating our approach. For some datasets, the Davies-Bouldin index continued to decrease and the Dunn index continued to increase as the number of clusters increased, resulting in very large overestimations. Note that the estimates of the number of novel classes in Table C4 are not needed in the experiments of Section 7.2.2, since Algorithm 1 directly incorporates such estimates in the training procedure. This table has only helped us to identify the most appropriate CVI for our problem. The only exception is the TabularNCD method, which requires an a priori estimation of the number of novel classes in the original feature space. Appendix D NCD k-means centroids convergence study In this appendix, we aim to determine how to achieve the best performance with NCD k-means. Specifically, after the centroid initialization described in Section 3.2, we investigate: (1) whether it is more effective to update the centroids of both known and novel classes, or only the centroids of novel classes; (2) whether the centroids need to be updated using data from both known and novel classes, or only using data from novel classes. The results are presented in Table D5 and show that for 5 out of 7 datasets, the best results are obtained when only the centroids of the novel classes are updated on the unlabeled data. Updating the centroids of the known classes always leads to worse performance, as the class labels are not used in this process. Thus, the centroids of the known classes run the risk of capturing data from the novel classes (and vice versa). This paper is available on arxiv under CC 4.0 license. Authors: (1) Troisemaine Colin, Department of Computer Science, IMT Atlantique, Brest, France., and Orange Labs, Lannion, France; (2) Reiffers-Masson Alexandre, Department of Computer Science, IMT Atlantique, Brest, France.; (3) Gosselin Stephane, Orange Labs, Lannion, France; (4) Lemaire Vincent, Orange Labs, Lannion, France; (5) Vaton Sandrine, Department of Computer Science, IMT Atlantique, Brest, France. Authors: Authors: (1) Troisemaine Colin, Department of Computer Science, IMT Atlantique, Brest, France., and Orange Labs, Lannion, France; (2) Reiffers-Masson Alexandre, Department of Computer Science, IMT Atlantique, Brest, France.; (3) Gosselin Stephane, Orange Labs, Lannion, France; (4) Lemaire Vincent, Orange Labs, Lannion, France; (5) Vaton Sandrine, Department of Computer Science, IMT Atlantique, Brest, France. Table of Links Abstract and Intro Abstract and Intro Related work Related work Approaches Approaches Hyperparameter optimization Hyperparameter optimization Estimating the number of novel classes Estimating the number of novel classes Full training procedure Full training procedure Experiments Experiments Conclusion Conclusion Declarations Declarations References References Appendix A: Additional result metrics Appendix A: Additional result metrics Appendix B: Hyperparameters Appendix B: Hyperparameters Appendix C: Cluster Validity Indices numerical results Appendix C: Cluster Validity Indices numerical results Appendix D: NCD k-means centroids convergence study Appendix D: NCD k-means centroids convergence study Appendix A Additional result metrics Appendix B Hyperparameters The Table B3 shows the hyperparameters found by the full procedure described in Section 6. Appendix C Cluster Validity Indices numerical results An estimate of the number of clusters in the 7 datasets considered in this paper can be found in Table C4. Among the 6 CVIs reported here, the Silhouette coefficient performed the best. Furthermore, compared to the original feature space, its average estimation error significantly decreased in the latent space, validating our approach. For some datasets, the Davies-Bouldin index continued to decrease and the Dunn index continued to increase as the number of clusters increased, resulting in very large overestimations. Note that the estimates of the number of novel classes in Table C4 are not needed in the experiments of Section 7.2.2, since Algorithm 1 directly incorporates such estimates in the training procedure. This table has only helped us to identify the most appropriate CVI for our problem. The only exception is the TabularNCD method, which requires an a priori estimation of the number of novel classes in the original feature space. Appendix D NCD k-means centroids convergence study In this appendix, we aim to determine how to achieve the best performance with NCD k-means. Specifically, after the centroid initialization described in Section 3.2, we investigate: (1) whether it is more effective to update the centroids of both known and novel classes, or only the centroids of novel classes; (2) whether the centroids need to be updated using data from both known and novel classes, or only using data from novel classes. The results are presented in Table D5 and show that for 5 out of 7 datasets, the best results are obtained when only the centroids of the novel classes are updated on the unlabeled data. Updating the centroids of the known classes always leads to worse performance, as the class labels are not used in this process. Thus, the centroids of the known classes run the risk of capturing data from the novel classes (and vice versa). This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

A Practical Approach to Novel Class Discovery in Tabular Data: Appendix

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Practical Approach to Novel Class Discovery in Tabular Data

It Is Okay If You Don't Know What You Like. We Do (feat. Deep Recommendation Algorithms)

10 Machine Learning, Data Science, and Deep Learning Courses for Programmers in 2020

10 Computer Vision Startups on Product Hunt with the Most Upvotes

10 Best Entry Level Machine Learning Tutorials

10 Best + Free Machine Learning Courses Collection

A Practical Approach to Novel Class Discovery in Tabular Data

It Is Okay If You Don't Know What You Like. We Do (feat. Deep Recommendation Algorithms)

10 Machine Learning, Data Science, and Deep Learning Courses for Programmers in 2020

10 Computer Vision Startups on Product Hunt with the Most Upvotes

10 Best Entry Level Machine Learning Tutorials

10 Best + Free Machine Learning Courses Collection

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps