Generating Private, High-Utility Tabular Data with Masked Language Models

Table of Links

Abstract & Introduction
Proposal
1. Classification Target
2. Masked Conditional Density Estimation (MaCoDE)
Theoretical Results
1. With Missing Data
Experiments
Results
1. Related Works
2. Conclusions and Limitations
3. References
A1 Proof of Theorem 1
1. A2 Proof of Proposition 1
2. A3 Dataset Descriptions
A4 Missing Mechanism
1. A5 Experimental Settings for Reproduction
A6 Additional Experiments
A7 Detailed Experimental Results

2.3 Theoretical Results

Remark 2. Since we constrain the values of continuous columns to the interval [0, 1], our proposed approach and Theorem 1 can be seen as similar to the copula density estimation method [22, 27, 9, 2]. However, given the heterogeneous nature of tabular data [3, 43], handling categorical variables is essential, and the categorical marginal distributions do not guarantee the uniqueness of the copula [44]. Therefore, instead of utilizing the copula density estimation method, we adopt a multi-class classification approach for density estimation.

2.4 With Missing Data

In (6), the missingness pattern is described by min(m, r), indicating that the missing data model is determined by the joint distribution p(m, r|x). Thus, as the masking vector defined in Definition 2 satisfies the conditions in Proposition 1, the missingness pattern of our proposed model also conforms to the MAR mechanism according to Proposition 1 if the given data is MAR. It implies that our inference strategies based on the observed dataset can be justified [34]. However, we empirically demonstrate in Section 3 that our proposed model is also applicable to other missing data scenarios.

Authors:

(1) Seunghwan An, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(2) Gyeongdong Woo, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(3) Jaesung Lim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(4) ChangHyun Kim, Department of Statistical Data Science, University of Seoul, S. Korea ([email protected]);

(5) Sungchul Hong, Department of Statistics, University of Seoul, S. Korea ([email protected]);

(6) Jong-June Jeon (corresponding author), Department of Statistics, University of Seoul, S. Korea ([email protected]).

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.