This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Muhammed Yusuf Kocyigit, Boston University;
(2) Anietie Andy, University of Pennsylvania;
(3) Derry Wijaya, Boston University.
The representation power and how much word embeddings can model our beliefs is an active point of discussion. This line of research is critical because it allows differentiating under which conditions word representations generate good abstractions to analyze. The critique can be summarized in two main lines: structural and statistical problems.
Structural problems (Arseniev-Koehler 2021) relate to the Distributional Hypothesis (Harris 1954) that current word embeddings are based on. This is criticized by structuralism in multiple aspects. Primarily issues arise when the phrase a word is known by the company it keeps is interpreted as a word is the company it keeps. The representations of words or groups that we present are not to say that the content represents them precisely as the representations, but they are more closely linked to certain concepts if they are closer in the semantic space. The second important issue we must be aware of is that concepts are not always dichotomous or binary. They can be graded or related in multiple ways. We, however, in line with An, Kwak, and Ahn (2018) assume a dichotomous nature to a set of adjectives.
The statistical problems mainly stem from the frequency of different words/concepts. Valentini, Slezak, and Altszyler (2022) shows that word representations tend to yield higher similarity scores between high-frequency words than between high and low-frequency words. Joseph and Morgan (2020) show that word representations produce similarity measures that better represent human beliefs in with higher frequency words. Loon et al. (2022) argue that word embedding methods like word2vec cluster high-frequency words together and low-frequency words together. They argue that people have an intuitive positive bias where positive words are more frequent and if a racial/social group(black names in their case) is less frequently represented in the corpus compared to their control group(white names). They are just likely to be related to negative words and this relation would not necessarily be an indication of bias in the corpus.
Structural problems are related to how we analyze the results and make inferences from them. However, in this paper, we target statistical problems on two levels. First, we look at the distribution of the context words in the primary data where the word vectors were trained and observe that the context words for these groups are not significantly different, and on the frequency of names, we sample equal amounts of (word, context) pairs for both African Americans and White Americans and train the embeddings for these groups from scratch rather than using a pre-trained embedding that would depend on the frequency of names in the initial corpus.
One specific limitation that we had to work around in this project was the issue of false positive matches. Even though we search for exact name+surname combination in books the resulting n-grams are not necessarily always about the same person that we intent on. While a person with the same name+surname combination likely belongs to the same racial group in our analysis, this might not be the same for other use cases making this a relevant limitation to consider.
Another relevant limitation is the data size limitation. We have replicated our results with a manually curated name subsets however a smaller subset of n-grams was not a viable option. We have also observed that decades before 1850s return too few samples and the results are not meaningful for those decades. The question of how many n-grams would be enough for a new application is still an open question that we were not able to answer. This question would also extend the scope of this paper beyond a reasonable limit.