From Raw to Refined: Understanding Preprocessing, Cleaning, and Labeling in Data Preparation

Authors: (1) TIMNIT GEBRU, Black in AI; (2) JAMIE MORGENSTERN, University of Washington; (3) BRIANA VECCHIONE, Cornell University; (4) JENNIFER WORTMAN VAUGHAN, Microsoft Research; (5) HANNA WALLACH, Microsoft Research; (6) HAL DAUMÉ III, Microsoft Research; University of Maryland; (7) KATE CRAWFORD, Microsoft Research. Table of Links 1 Introduction 1.1 Objectives 2 Development Process 3 Questions and Workflow 3.1 Motivation 3.2 Composition 3.3 Collection Process 3.4 Preprocessing/cleaning/labeling 3.5 Uses 3.6 Distribution 3.7 Maintenance 4 Impact and Challenges Acknowledgments and References Appendix 3.4 Preprocessing/cleaning/labeling Dataset creators should read through these questions prior to any preprocessing, cleaning, or labeling and then provide answers once these tasks are complete. The questions in this section are intended to provide dataset consumers with the information they need to determine whether the “raw” data has been processed in ways that are compatible with their chosen tasks. For example, text that has been converted into a “bag-of-words” is not suitable for tasks involving word order. • Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remaining questions in this section. • Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data. • Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point. • Any other comments? This paper is available on arxiv under CC 4.0 license. Authors: (1) TIMNIT GEBRU, Black in AI; (2) JAMIE MORGENSTERN, University of Washington; (3) BRIANA VECCHIONE, Cornell University; (4) JENNIFER WORTMAN VAUGHAN, Microsoft Research; (5) HANNA WALLACH, Microsoft Research; (6) HAL DAUMÉ III, Microsoft Research; University of Maryland; (7) KATE CRAWFORD, Microsoft Research. Authors: Authors: (1) TIMNIT GEBRU, Black in AI; (2) JAMIE MORGENSTERN, University of Washington; (3) BRIANA VECCHIONE, Cornell University; (4) JENNIFER WORTMAN VAUGHAN, Microsoft Research; (5) HANNA WALLACH, Microsoft Research; (6) HAL DAUMÉ III, Microsoft Research; University of Maryland; (7) KATE CRAWFORD, Microsoft Research. Table of Links 1 Introduction 1 Introduction 1.1 Objectives 1.1 Objectives 2 Development Process 2 Development Process 3 Questions and Workflow 3 Questions and Workflow 3.1 Motivation 3.1 Motivation 3.2 Composition 3.2 Composition 3.3 Collection Process 3.3 Collection Process 3.4 Preprocessing/cleaning/labeling 3.4 Preprocessing/cleaning/labeling 3.5 Uses 3.5 Uses 3.6 Distribution 3.6 Distribution 3.7 Maintenance 3.7 Maintenance 4 Impact and Challenges 4 Impact and Challenges Acknowledgments and References Acknowledgments and References Appendix Appendix 3.4 Preprocessing/cleaning/labeling Dataset creators should read through these questions prior to any preprocessing, cleaning, or labeling and then provide answers once these tasks are complete. The questions in this section are intended to provide dataset consumers with the information they need to determine whether the “raw” data has been processed in ways that are compatible with their chosen tasks. For example, text that has been converted into a “bag-of-words” is not suitable for tasks involving word order. • Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remaining questions in this section. • Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? • Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data. • Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? • Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point. • Is the software that was used to preprocess/clean/label the data available? • Any other comments? • Any other comments? This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv