A Novel Method for Analysing Racial Bias: Collection of Person Level References: Databy@escholar

A Novel Method for Analysing Racial Bias: Collection of Person Level References: Data

tldt arrow

Too Long; Didn't Read

In this study, researchers propose a novel method to analyze representations of African Americans and White Americans in books between 1850 to 2000.
featured image - A Novel Method for Analysing Racial Bias: Collection of Person Level References: Data
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

This paper is available on arxiv under CC 4.0 license.


(1) Muhammed Yusuf Kocyigit, Boston University;

(2) Anietie Andy, University of Pennsylvania;

(3) Derry Wijaya, Boston University.


We collect names of significant figures from Wikidata. We use this collection of names to extract the relevant context n-grams from the Google Books dataset. Additionally, we use the 732 semantic axes presented by An, Kwak, and Ahn (2018) to analyze the semantic axes with which the two racial groups (African Americans and White Americans) diverge maximally over time. Finally, we use the timeadjusted Hurtlex (Bassignana, Basile, and Patti 2018) toxic words list to determine the toxic context per racial group over time.

Wikidata and Extracted Names

Manually collecting names on a large enough scale requires a considerable amount of time. Instead, we opted for a different approach and searched Wikidata for people who have lived, been a citizen of, or were born in the United States. The details can be found in Figure 8 in the Appendix. This query returns around 22K people of which the ethnicity distribution can be found in Figure 4. We don’t exclude or make any explicit filters on this set to get the broadest set of people that we can use in this work. Once we extract n-grams surrounding mentions of these names in Google Books dataset, only around 3K of these names return at least one n-gram in the Google Books dataset. We search for complete name matches (first name + last name) and remove the returning n-grams that are before the birth year of the matching person[3].

Figure 3: The context length, if too different, can cause bias in the learned embeddings of the names because of how the GoogleBooks dataset is structured. If a name consist of two names the analysis would include context words that are two words

Figure 4: The ethnicity distribution of the names that are extracted from Wikipedia and the distribution of unique peoplethat return any n-gram matches after the n-gram extraction.

Google Ngram Filtering

We work with the 20200217 version of the Google Books Ngram data (Goldberg and Orwant 2013) specifically, the 5- gram American English subset. In total we analyze around 140 million 5-grams in 15 decades. The smallest data comes from the earliest decade 1850-1860 with around 100K 5- grams, and the largest sample is from the latest decade 1990- 2000 with approximately 50M 5-grams. We upscale each ngram with its count given in the Ngram data and sample the n-grams for each group to a fixed sample to learn its embedding as described in Word Representation Learning section.

Figure 3 presents the average number of context words extracted per person per decade and the average context length per group. We observe that the number of context words for White Americans is significantly larger compared to African Americans in each decade. Thus, while the mention of African American figures has increased over time in books, their mention is still smaller in quantity compared to White Americans. We also observe that there is no significant difference in the mean context length (i.e window size) of the two groups that could introduce an unwanted difference in the semantic meaning captured by their learned embeddings.


To analyze the toxicity towards each group we utilize the toxic words list provided by Bassignana, Basile, and Patti (2018). We utilize v1.2 English and the ”conservative” level as defined by Hurtlex that contains 3360 toxic words. We use these words as a dictionary-based method and count the frequency of words in the context of African Americans or White Americans that are in this toxic words dictionary. However, long-term studies must consider the semantic shift problem, as words can become more or less toxic over time. We tackle this problem with a hybrid method, utilizing the toxic words dictionary and the semantic axes provided by An, Kwak, and Ahn (2018). We provide more necessary detail on how the time-adjusted toxic words are calculated in the Toxicity subsection under Analysis and Results.

[3] We use the assumption that a person must be at least 10 years old before they can be written in books.