Authors:
(1) TIMNIT GEBRU, Black in AI;
(2) JAMIE MORGENSTERN, University of Washington;
(3) BRIANA VECCHIONE, Cornell University;
(4) JENNIFER WORTMAN VAUGHAN, Microsoft Research;
(5) HANNA WALLACH, Microsoft Research;
(6) HAL DAUMÉ III, Microsoft Research; University of Maryland;
(7) KATE CRAWFORD, Microsoft Research. Table of Links 1 Introduction 1.1 Objectives 2 Development Process 3 Questions and Workflow 3.1 Motivation 3.2 Composition 3.3 Collection Process 3.4 Preprocessing/cleaning/labeling 3.5 Uses 3.6 Distribution 3.7 Maintenance 4 Impact and Challenges Acknowledgments and References Appendix 4 Impact and Challenges Since circulating an initial draft of this paper in March 2018, datasheets for datasets have already gained traction in a number of settings. Academic researchers have adopted our proposal and released datasets with accompanying datasheets [e.g., 7, 10, 23, 26]. Microsoft, Google, and IBM have begun to pilot datasheets for datasets internally within product teams. Researchers at Google published follow-up work on model cards that document machine learning models [20] and released a data card (a lightweight version of a datasheet) along with the Open Images dataset [17]. Researchers at IBM proposed factsheets [14] that document various characteristics of AI services, including whether the datasets used to develop the services are accompanied with datasheets. The Data Nutrition Project incorporated some of the questions provided in the previous section into the latest release of their Dataset Nutrition Label [9]. Finally, the Partnership on AI, a multi-stakeholder organization focused on studying and formulating best practices for developing and deploying AI technologies, is working on industry-wide documentation guidance that builds on datasheets for datasets, model cards, and factsheets.[3] These initial successes have also revealed implementation challenges that may need to be addressed to support wider adoption. Chief among them is the need for dataset creators to modify the questions and workflow provided in the previous section based on their existing organizational infrastructure and workflows. We also note that the questions and workflow may pose problems for dynamic datasets. If a dataset changes only infrequently, we recommend accompanying updated versions with updated datasheets. Datasheets for datasets do not provide a complete solution to mitigating unwanted societal biases or potential risks or harms. Dataset creators cannot anticipate every possible use of a dataset, and identifying unwanted societal biases often requires additional labels indicating demographic information about individuals, which may not be available to dataset creators for reasons including those individuals’ data protection and privacy [15]. When creating datasets that relate to people, and hence their accompanying datasheets, it may be necessary for dataset creators to work with experts in other domains such as anthropology, sociology, and science and technology studies. There are complex and contextual social, historical, and geographical factors that influence how best to collect data from individuals in a manner that is respectful. Finally, creating datasheets for datasets will necessarily impose overhead on dataset creators. Although datasheets may reduce the amount of time that dataset creators spend answering one-off questions about datasets, the process of creating a datasheet will always take time, and organizational infrastructure and workflows—not to mention incentives—will need to be modified to accommodate this investment. Despite these implementation challenges, there are many benefits to creating datasheets for datasets. In addition to facilitating better communication between dataset creators and dataset consumers, datasheets provide an opportunity for dataset creators to distinguish themselves as prioritizing transparency and accountability. Ultimately, we believe that the benefits to the machine learning community outweigh the costs. This paper is available on arxiv under CC 4.0 license. [3] https://www.partnershiponai.org/about-ml/ Authors: (1) TIMNIT GEBRU, Black in AI; (2) JAMIE MORGENSTERN, University of Washington; (3) BRIANA VECCHIONE, Cornell University; (4) JENNIFER WORTMAN VAUGHAN, Microsoft Research; (5) HANNA WALLACH, Microsoft Research; (6) HAL DAUMÉ III, Microsoft Research; University of Maryland; (7) KATE CRAWFORD, Microsoft Research. Authors: Authors: (1) TIMNIT GEBRU, Black in AI; (2) JAMIE MORGENSTERN, University of Washington; (3) BRIANA VECCHIONE, Cornell University; (4) JENNIFER WORTMAN VAUGHAN, Microsoft Research; (5) HANNA WALLACH, Microsoft Research; (6) HAL DAUMÉ III, Microsoft Research; University of Maryland; (7) KATE CRAWFORD, Microsoft Research. Table of Links 1 Introduction 1 Introduction 1.1 Objectives 1.1 Objectives 2 Development Process 2 Development Process 3 Questions and Workflow 3 Questions and Workflow 3.1 Motivation 3.1 Motivation 3.2 Composition 3.2 Composition 3.3 Collection Process 3.3 Collection Process 3.4 Preprocessing/cleaning/labeling 3.4 Preprocessing/cleaning/labeling 3.5 Uses 3.5 Uses 3.6 Distribution 3.6 Distribution 3.7 Maintenance 3.7 Maintenance 4 Impact and Challenges 4 Impact and Challenges Acknowledgments and References Acknowledgments and References Appendix Appendix 4 Impact and Challenges Since circulating an initial draft of this paper in March 2018, datasheets for datasets have already gained traction in a number of settings. Academic researchers have adopted our proposal and released datasets with accompanying datasheets [e.g., 7, 10, 23, 26]. Microsoft, Google, and IBM have begun to pilot datasheets for datasets internally within product teams. Researchers at Google published follow-up work on model cards that document machine learning models [20] and released a data card (a lightweight version of a datasheet) along with the Open Images dataset [17]. Researchers at IBM proposed factsheets [14] that document various characteristics of AI services, including whether the datasets used to develop the services are accompanied with datasheets. The Data Nutrition Project incorporated some of the questions provided in the previous section into the latest release of their Dataset Nutrition Label [9]. Finally, the Partnership on AI, a multi-stakeholder organization focused on studying and formulating best practices for developing and deploying AI technologies, is working on industry-wide documentation guidance that builds on datasheets for datasets, model cards, and factsheets.[3] These initial successes have also revealed implementation challenges that may need to be addressed to support wider adoption. Chief among them is the need for dataset creators to modify the questions and workflow provided in the previous section based on their existing organizational infrastructure and workflows. We also note that the questions and workflow may pose problems for dynamic datasets. If a dataset changes only infrequently, we recommend accompanying updated versions with updated datasheets. Datasheets for datasets do not provide a complete solution to mitigating unwanted societal biases or potential risks or harms. Dataset creators cannot anticipate every possible use of a dataset, and identifying unwanted societal biases often requires additional labels indicating demographic information about individuals, which may not be available to dataset creators for reasons including those individuals’ data protection and privacy [15]. When creating datasets that relate to people, and hence their accompanying datasheets, it may be necessary for dataset creators to work with experts in other domains such as anthropology, sociology, and science and technology studies. There are complex and contextual social, historical, and geographical factors that influence how best to collect data from individuals in a manner that is respectful. Finally, creating datasheets for datasets will necessarily impose overhead on dataset creators. Although datasheets may reduce the amount of time that dataset creators spend answering one-off questions about datasets, the process of creating a datasheet will always take time, and organizational infrastructure and workflows—not to mention incentives—will need to be modified to accommodate this investment. Despite these implementation challenges, there are many benefits to creating datasheets for datasets. In addition to facilitating better communication between dataset creators and dataset consumers, datasheets provide an opportunity for dataset creators to distinguish themselves as prioritizing transparency and accountability. Ultimately, we believe that the benefits to the machine learning community outweigh the costs. This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv [3] https://www.partnershiponai.org/about-ml/

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Datasheets for Datasets: Impact and Adoption Across Academic and Industry Sectors

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Out of One, Many: Using Language Models to Simulate Human Samples

Refining Dataset Documentation: A Two-Year Journey to Improve AI Data Transparency

How to Create Detailed Datasheets for AI Datasets

The Why and How of Dataset Creation

Understanding Dataset Instances and Relationships

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Out of One, Many: Using Language Models to Simulate Human Samples

Refining Dataset Documentation: A Two-Year Journey to Improve AI Data Transparency

How to Create Detailed Datasheets for AI Datasets

The Why and How of Dataset Creation

Understanding Dataset Instances and Relationships

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps