Data and its (dis)contents: A survey of dataset development and use in machine learning research

Amandalynne Paullada; Inioluwa Deborah Raji; Emily M Bender; Emily Denton; Alex Hanna

doi:10.1016/j.patter.2021.100336

Data and its (dis)contents: A survey of dataset development and use in machine learning research

Patterns (N Y). 2021 Nov 12;2(11):100336. doi: 10.1016/j.patter.2021.100336.

Authors

Amandalynne Paullada¹, Inioluwa Deborah Raji², Emily M Bender¹, Emily Denton³, Alex Hanna^{3

4}

Affiliations

¹ Department of Linguistics, University of Washington, Seattle, WA, USA.
² Mozilla Foundation, Mountain View, CA, USA.
³ Google Research, New York, NY, USA.
⁴ Google Research, San Francisco, CA, USA.

Abstract

In this work, we survey a breadth of literature that has revealed the limitations of predominant practices for dataset collection and use in the field of machine learning. We cover studies that critically review the design and development of datasets with a focus on negative societal impacts and poor outcomes for system performance. We also cover approaches to filtering and augmenting data and modeling techniques aimed at mitigating the impact of bias in datasets. Finally, we discuss works that have studied data practices, cultures, and disciplinary norms and discuss implications for the legal, ethical, and functional challenges the field continues to face. Based on these findings, we advocate for the use of both qualitative and quantitative approaches to more carefully document and analyze datasets during the creation and usage phases.

Keywords: datasets machine learning.

Publication types

Review

Grants and funding

T15 LM007442/LM/NLM NIH HHS/United States