Authors:
(1) TIMNIT GEBRU, Black in AI;
(2) JAMIE MORGENSTERN, University of Washington;
(3) BRIANA VECCHIONE, Cornell University;
(4) JENNIFER WORTMAN VAUGHAN, Microsoft Research;
(5) HANNA WALLACH, Microsoft Research;
(6) HAL DAUMÉ III, Microsoft Research; University of Maryland;
(7) KATE CRAWFORD, Microsoft Research. Table of Links 1 Introduction 1.1 Objectives 2 Development Process 3 Questions and Workflow 3.1 Motivation 3.2 Composition 3.3 Collection Process 3.4 Preprocessing/cleaning/labeling 3.5 Uses 3.6 Distribution 3.7 Maintenance 4 Impact and Challenges Acknowledgments and References Appendix 3.3 Collection Process As with the questions in the previous section, dataset creators should read through these questions prior to any data collection to flag potential issues and then provide answers once collection is complete. In addition to the goals outlined in the previous section, the questions in this section are designed to elicit information that may help researchers and practitioners to create alternative datasets with similar characteristics. Again, questions that apply only to datasets that relate to people are grouped together at the end of the section. • How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If the data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how. • What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)? How were these mechanisms or procedures validated? • If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? • Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? • Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. • Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation. If the dataset does not relate to people, you may skip the remaining questions in this section. • Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)? • Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself. • Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented. • If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate). • Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation. • Any other comments? This paper is available on arxiv under CC 4.0 license. Authors: (1) TIMNIT GEBRU, Black in AI; (2) JAMIE MORGENSTERN, University of Washington; (3) BRIANA VECCHIONE, Cornell University; (4) JENNIFER WORTMAN VAUGHAN, Microsoft Research; (5) HANNA WALLACH, Microsoft Research; (6) HAL DAUMÉ III, Microsoft Research; University of Maryland; (7) KATE CRAWFORD, Microsoft Research. Authors: Authors: (1) TIMNIT GEBRU, Black in AI; (2) JAMIE MORGENSTERN, University of Washington; (3) BRIANA VECCHIONE, Cornell University; (4) JENNIFER WORTMAN VAUGHAN, Microsoft Research; (5) HANNA WALLACH, Microsoft Research; (6) HAL DAUMÉ III, Microsoft Research; University of Maryland; (7) KATE CRAWFORD, Microsoft Research. Table of Links 1 Introduction 1 Introduction 1.1 Objectives 1.1 Objectives 2 Development Process 2 Development Process 3 Questions and Workflow 3 Questions and Workflow 3.1 Motivation 3.1 Motivation 3.2 Composition 3.2 Composition 3.3 Collection Process 3.3 Collection Process 3.4 Preprocessing/cleaning/labeling 3.4 Preprocessing/cleaning/labeling 3.5 Uses 3.5 Uses 3.6 Distribution 3.6 Distribution 3.7 Maintenance 3.7 Maintenance 4 Impact and Challenges 4 Impact and Challenges Acknowledgments and References Acknowledgments and References Appendix Appendix 3.3 Collection Process As with the questions in the previous section, dataset creators should read through these questions prior to any data collection to flag potential issues and then provide answers once collection is complete. In addition to the goals outlined in the previous section, the questions in this section are designed to elicit information that may help researchers and practitioners to create alternative datasets with similar characteristics. Again, questions that apply only to datasets that relate to people are grouped together at the end of the section. • How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If the data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how. • How was the data associated with each instance acquired? • What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)? How were these mechanisms or procedures validated? • What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)? • If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? • If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? • Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? • Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? • Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. • Over what timeframe was the data collected? • Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation. • Were any ethical review processes conducted (e.g., by an institutional review board)? If the dataset does not relate to people, you may skip the remaining questions in this section. • Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)? • Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)? • Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself. • Were the individuals in question notified about the data collection? • Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented. • Did the individuals in question consent to the collection and use of their data? • If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate). • If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? • Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation. • Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? • Any other comments? • Any other comments? This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Data Collection for ML Models: Strategies and Protocols for Ensuring Dataset Integrity

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Out of One, Many: Using Language Models to Simulate Human Samples

Refining Dataset Documentation: A Two-Year Journey to Improve AI Data Transparency

How to Create Detailed Datasheets for AI Datasets

The Why and How of Dataset Creation

Understanding Dataset Instances and Relationships

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Out of One, Many: Using Language Models to Simulate Human Samples

Refining Dataset Documentation: A Two-Year Journey to Improve AI Data Transparency

How to Create Detailed Datasheets for AI Datasets

The Why and How of Dataset Creation

Understanding Dataset Instances and Relationships

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps