Ensuring Dataset Health: Strategies for Effective Maintenance and Support

Authors: (1) TIMNIT GEBRU, Black in AI; (2) JAMIE MORGENSTERN, University of Washington; (3) BRIANA VECCHIONE, Cornell University; (4) JENNIFER WORTMAN VAUGHAN, Microsoft Research; (5) HANNA WALLACH, Microsoft Research; (6) HAL DAUMÉ III, Microsoft Research; University of Maryland; (7) KATE CRAWFORD, Microsoft Research. Table of Links 1 Introduction 1.1 Objectives 2 Development Process 3 Questions and Workflow 3.1 Motivation 3.2 Composition 3.3 Collection Process 3.4 Preprocessing/cleaning/labeling 3.5 Uses 3.6 Distribution 3.7 Maintenance 4 Impact and Challenges Acknowledgments and References Appendix 3.7 Maintenance As with the questions in the previous section, dataset creators should provide answers to these questions prior to distributing the dataset. The questions in this section are intended to encourage dataset creators to plan for dataset maintenance and communicate this plan to dataset consumers. • Who will be supporting/hosting/maintaining the dataset? • How can the owner/curator/manager of the dataset be contacted (e.g., email address)? • Is there an erratum? If so, please provide a link or other access point. • Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (e.g., mailing list, GitHub)? • If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced. • Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers. • If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description. • Any other comments? This paper is available on arxiv under CC 4.0 license. Authors: (1) TIMNIT GEBRU, Black in AI; (2) JAMIE MORGENSTERN, University of Washington; (3) BRIANA VECCHIONE, Cornell University; (4) JENNIFER WORTMAN VAUGHAN, Microsoft Research; (5) HANNA WALLACH, Microsoft Research; (6) HAL DAUMÉ III, Microsoft Research; University of Maryland; (7) KATE CRAWFORD, Microsoft Research. Authors: Authors: (1) TIMNIT GEBRU, Black in AI; (2) JAMIE MORGENSTERN, University of Washington; (3) BRIANA VECCHIONE, Cornell University; (4) JENNIFER WORTMAN VAUGHAN, Microsoft Research; (5) HANNA WALLACH, Microsoft Research; (6) HAL DAUMÉ III, Microsoft Research; University of Maryland; (7) KATE CRAWFORD, Microsoft Research. Table of Links 1 Introduction 1 Introduction 1.1 Objectives 1.1 Objectives 2 Development Process 2 Development Process 3 Questions and Workflow 3 Questions and Workflow 3.1 Motivation 3.1 Motivation 3.2 Composition 3.2 Composition 3.3 Collection Process 3.3 Collection Process 3.4 Preprocessing/cleaning/labeling 3.4 Preprocessing/cleaning/labeling 3.5 Uses 3.5 Uses 3.6 Distribution 3.6 Distribution 3.7 Maintenance 3.7 Maintenance 4 Impact and Challenges 4 Impact and Challenges Acknowledgments and References Acknowledgments and References Appendix Appendix 3.7 Maintenance As with the questions in the previous section, dataset creators should provide answers to these questions prior to distributing the dataset. The questions in this section are intended to encourage dataset creators to plan for dataset maintenance and communicate this plan to dataset consumers. • Who will be supporting/hosting/maintaining the dataset? • Who will be supporting/hosting/maintaining the dataset? • How can the owner/curator/manager of the dataset be contacted (e.g., email address)? • How can the owner/curator/manager of the dataset be contacted (e.g., email address)? • Is there an erratum? If so, please provide a link or other access point. • Is there an erratum? • Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (e.g., mailing list, GitHub)? • Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? • If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced. • If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? • Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers. • Will older versions of the dataset continue to be supported/hosted/maintained? • If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description. • If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? • Any other comments? • Any other comments? This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv