This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Muhammed Yusuf Kocyigit, Boston University;
(2) Anietie Andy, University of Pennsylvania;
(3) Derry Wijaya, Boston University.
Isolating the context is the first step of our method. As we have mentioned, we collect every person who lived, been a citizen of or was born in the United States and use their complete names to identify the n-grams that contain the names. For this stage we go over the 140M 5-grams for each of the 20K names we have collected. The experiments are ran over CPU in parallel and aggregated at a later stage. This step is the most computationally expensive step. The training and analysis steps are less compute intensive since these steps work on a smaller subspace and only learns two vectors.
The n-grams in the Google books dataset is created by collecting repeating n-grams from many books. So for a sentence to make it into the dataset it has to be repeated in other books as well. The dataset also contains n-grams where words are replaced with their part of speech tags. We ignore all part of speech tags in the filtered n-grams. We also remove numbers and stop words since they don’t contribute meaningfully to the final embedding. After extracting the context surrounding the these names, we aim to analyze the representation of individuals within that context.
The use of public figures to represent collective bias can be a challenge. Since public figures, in general, belong to a very small socioeconomic minority in any large social group, this brings up the question of the reliability of representations obtained using public figures. However, the main dataset used in this work is the Google Books dataset and statistically, books are more likely to talk about public figures. Since we are trying to analyze the representation in books, using public figures can be postulated to be the best option.
In previous studies, researchers trained word representations for each referring word, used these for their analysis and then aggregated results at the final stage. In this paper, we adopt a different approach by consolidating the group into a single entity and conducting all analyses on that entity. To achieve this, we replace all names in the text with a specialized group tokens (e.g., GRP A and GRP B) and learn representations for these group tokens. We train these group embeddings using a contrastive loss, which samples positive words from the context of that group and negative words from the context of other group.
As discussed in previous literature (Akyurek et al. ¨ 2022b,a), bias can arise not only in the presence of negative context but also from the lack of positive context. To capture this dynamic, we learn group representations with a contrastive loss.
We use the learned representations for the group tokens in three main types of analysis.