This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Muhammed Yusuf Kocyigit, Boston University;
(2) Anietie Andy, University of Pennsylvania;
(3) Derry Wijaya, Boston University.
Language and context are not simple things especially considering we are working on a summarization of n-grams instead of sentences. Hence the results and analysis must be evaluated with this in mind. Also the identification of bias can sometimes inadvertently be affected by the researchers’ own biases, despite our best efforts to be rigorous in our methodology. A risk lies in potentially over-simplifying or misrepresenting complex socio-cultural dynamics. We have worked to minimize this risk through methodological transparency, iterative analyses, and peer review.
The study inherently involves discussing racially biased language, which could cause distress to some readers, particularly those from communities affected by such biases. Our presentation aims to be sensitive, focusing on systemic patterns rather than individuals or specific works.
Lastly, there is a potential risk of misuse of our findings. The purpose of our study is to raise awareness and prompt action towards eliminating bias, but we are aware that the same information could be used to support biased arguments or perpetuate harmful stereotypes even though the content of books are not a representation of reality but the representation of the perception of the authors which could be source of bias in some cases.
On the positive side, our research contributes to a better understanding of systemic racial biases in historical and cultural contexts. This deeper awareness can help us understand what we perceive as normal or what the next generation will perceive as normal putting a external reality check on the culture we inherit without knowing.
As a part of our ethical commitment, we will make our data and code public to invite critique, replication, and improvement. Our research is not an end but a beginning, providing a stepping stone for further inquiries into racial bias.
Akyurek, A. F.; Kocyigit, M. Y.; Paik, S.; and Wijaya, D. T. ¨ 2022a. Challenges in Measuring Bias via Open-Ended Language Generation. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), 76–76.
Akyurek, A. F.; Paik, S.; Kocyigit, M.; Akbiyik, S.; Run- ¨ yun, S. L.; and Wijaya, D. 2022b. On Measuring Social Biases in Prompt-Based Multi-Task Learning. In Findings of the Association for Computational Linguistics: NAACL 2022, 551–564.
An, J.; Kwak, H.; and Ahn, Y.-Y. 2018. SemAxis: A Lightweight Framework to Characterize Domain-Specific Word Semantics Beyond Sentiment. ArXiv, abs/1806.05521.
Arseniev-Koehler, A. 2021. Theoretical foundations and limits of word embeddings: what types of meaning can they capture?
Ash, J. T.; Goel, S.; Krishnamurthy, A.; and Misra, D. 2021a. Investigating the Role of Negatives in Contrastive Representation Learning.
Ash, J. T.; Goel, S.; Krishnamurthy, A.; and Misra, D. 2021b. Investigating the Role of Negatives in Contrastive Representation Learning.
Awasthi, P.; Dikkala, N.; and Kamath, P. 2022. Do More Negative Samples Necessarily Hurt in Contrastive Learning?
Bassignana, E.; Basile, V.; and Patti, V. 2018. Hurtlex: A Multilingual Lexicon of Words to Hurt. In Italian Conference on Computational Linguistics.
Bolukbasi, T.; Chang, K.-W.; Zou, J. Y.; Saligrama, V.; and Kalai, A. T. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In NIPS
Farrell, T.; Araque, O.; Fern ´ andez, M.; and Alani, H. 2020. ´ On the use of Jargon and Word Embeddings to Explore Subculture within the Reddit’s Manosphere. 12th ACM Conference on Web Science.
Garg, N.; Schiebinger, L.; Jurafsky, D.; and Zou, J. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16).
Gerbner, G. 1998. Cultivation analysis: An overview. Mass communication and society, 1(3-4): 175–194.
Goldberg, Y.; and Orwant, J. 2013. A Dataset of SyntacticNgrams over Time from a Very Large Corpus of English Books. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, 241–247. Atlanta, Georgia, USA: Association for Computational Linguistics.
Hagiwara, N.; Alderson, C. J.; and Mezuk, B. 2016. Differential effects of personal-level vs group-level racial discrimination on health among Black Americans. Ethnicity & disease, 26(3): 453.
Hamilton, W. L.; Leskovec, J.; and Jurafsky, D. 2016. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1489–1501. Berlin, Germany: Association for Computational Linguistics.
Harris, Z. S. 1954. Distributional Structure. ¡i¿WORD¡/i¿, 10(2-3): 146–162.
Joseph, K.; and Morgan, J. H. 2020. When do Word Embeddings Accurately Reflect Surveys on our Beliefs About People? In Annual Meeting of the Association for Computational Linguistics.
Kozlowski, A. C.; Taddy, M.; and Evans, J. A. 2019. The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings. American Sociological Review, 84(5): 905–949.
Loon, A. v.; Giorgi, S.; Willer, R.; and Eichstaedt, J. 2022. Negative Associations in Word Embeddings Predict Antiblack Bias across Regions–but Only via Name Frequency. Proceedings of the International AAAI Conference on Web and Social Media, 16(1): 1419–1424.
Lucy, L.; Tadimeti, D.; and Bamman, D. 2022. Discovering Differences in the Representation of People using Contextualized Semantic Axes. ArXiv, abs/2210.12170.
Martinc, M.; Kralj Novak, P.; and Pollak, S. 2020. Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift. In Proceedings of the Twelfth Language Resources and Evaluation Conference, 4811–4819. Marseille, France: European Language Resources Association. ISBN 979-10-95546-34-4.
Sorato, D.; Zavala-Rojas, D.; and del Carme Colominas Ventura, M. 2021. Using Word Embeddings to Quantify Ethnic Stereotypes in 12 years of Spanish News. In Proceedings of the The 19th Annual Workshop of the Australasian Language Technology Association, 34–46. Online: Australasian Language Technology Association.
Tripodi, R.; Warglien, M.; Sullam, S. L.; and Paci, D. 2019. Tracing Antisemitic Language Through Diachronic Embedding Projections: France 1789-1914. In LChange@ACL.
Valentini, F.; Slezak, D. F.; and Altszyler, E. 2022. The Dependence on Frequency of Word Embedding Similarity Measures. ArXiv, abs/2211.08203.
Wijaya, D. T.; and Yeniterzi, R. 2011. Understanding semantic change of words over centuries. In Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web, 35–40.
Xu, H.; Zhang, Z.; Wu, L.; and Wang, C.-J. 2019. The Cinderella Complex: Word embeddings reveal gender stereotypes in movies and books. PLOS ONE, 14(11): e0225385.