A Large-Scale Analysis of Inclusiveness-Related User Feedback: Discussion

Authors:

(1) Nowshin Nawar Arony;

(2) Ze Shi Li;

(3) Bowen Xu;

(4) Daniela Damian.

Table of Links

Abstract & Introduction

Motivation

Related Work

Methodology

A Taxonomy of Inclusiveness

Inclusiveness Concerns in Different Types of Apps

Inclusiveness Across Different Sources of User Feedback

Automated Identification of Inclusiveness User Feedback

Discussion

Conclusion & References

9 DISCUSSION

In this study, we propose a taxonomy of inclusiveness related user feedback based on an in-depth analysis of user feedback on fifty popular software apps using methods from socio-technical grounded theory [9]. Our approach involved labelling over 23 thousand user feedback posts across three popular sources of user feedback: Reddit, Google Play Store, and Twitter. The analysis resulted in identifying 1,211 user feedback related to inclusiveness. Having developed a labelled set, we also conducted an empirical investigation on a set of popular pre-trained large language models to classify user feedback from the three sources into inclusiveness or not. We achieved an F1-scores of 0.838. 0.849 and 0.938 for Reddit, Google Play Store, and Twitter, respectively. The promising scores indicate that the models are effective in identifying the inclusiveness-related user feedback.

9.1 Comparison with Related Work

Previous work on human aspects provided a preliminary introduction to inclusiveness based on their analysis from GitHub and Google Play Store [7] from both the developer and user perspectives. However, our study focused specifically on inclusiveness from the end user perspective. Furthermore, the prior work focused on open-source software, which may not reflect the entirety of end users or software apps. Only a fraction of users use open-source software, and even fewer provide user feedback. The prior research reported only 31 posts from Google Play Store related to inclusiveness, whereas our work identified a higher amount (1,211), indicating a greater representation of inclusiveness. Additionally, the vast majority of apps, especially the popular ones, are not open source, ranging from social media, entertainment, and business apps. Thus, an understanding of inclusiveness issues from a larger user base was required. In our study, we focused on popular for-profit apps used by millions of users across the world to obtain a more diverse representation of users.

In the work by Khalajzadeh et al. [7], the authors included 5 sub-categories under inclusiveness: compatibility, location, language, accessibility, and others. In our study, we identified the presence of these sub-categories. However, we categorize them under different higher-level categories for better restructuring.

The compatibility category, for instance, aligns closely with our technology category. Their category is primarily about the compatibility across different devices and platforms, considering socio-economic factors. Whereas we focus on concerns related to users encountering restrictions set by developers. We include socio-economic concerns under demography in our taxonomy as the feedback is more in line with dimensions of demographics. Regarding the location and language sub-categories, we choose to reclassify them within the demography category in this study. This was done as the two categories were not as dominant in our dataset compared to the previous work. Hence, amalgamating them into the demography category helped to achieve more clarity and better structure. Furthermore, we encountered only a small amount of user feedback explicitly emphasizing accessibility. Whereas we observed a broader distribution of posts focusing on visual and audio usability. Therefore, any accessibility posts were subsequently placed under the two types of usability, ensuring a more structured distribution of the user feedback related to accessibility concerns.

9.2 Algorithmic Bias in software apps: A Barrier to Inclusiveness

Our manual analysis allowed us to uncover various inclusiveness issues that users face while using the software. One of the key problems that emerged throughout the analysis was algorithmic bias. With the rapid evolution of AI, companies are increasingly inclined towards integrating algorithms in their decision making process. As such, many features are automated and often exhibit biases towards certain users [40]. Algorithmic bias predominately originates from underrepresented data and biased methods [41]. These biases create both a perceived and real non-inclusive experience for users.

In Section 5, when we discussed the fairness category, we described how the automated decision-making process leads to the exclusion of users from software or specific features in a software. For example, Facebook heavily relies on their algorithms for content review process. AI decides if a content is allowed on their platform based on Community Standards [42]. Our analysis revealed many users complaining about being banned from the platform due to policy violations without warning nor prior notification. In these cases, users have no idea what action even triggered the ban or account restriction. Despite the company asserting that users can seek a repeal if they believe their content aligns with the community standards, our observations in our fairness category reveals a contrasting reality.

(Reddit) - “I wanted to get started with Facebook ads and was really motivated about it. Only to find out that somehow my account on Meta Business was disabled, I tried to appeal but it somehow got rejected. I ended up purchasing another domain in order to create another account but during the creation of the facebook account (even though all the information was different compared to my main account), it instantly got disabled and told me to appeal. I feel like the situation is going to repeat itself. I’ve tried hard to find some support, but after hours of searching I didn’t find anything.” (Facebook)

Our fairness category has numerous similar examples where users experience a lack of support and must rely on automated algorithms to make further decisions. Similar to this example, many software organizations incorporate AI to make decisions and that creates more frustration and a feeling of exclusion amongst the users. We find similar patterns manifesting in other inclusiveness categories as well, such as other human values, where users report frustrations from apps enforcing their beliefs on users. In an ideal scenario, recommender systems would learn from user preferences and tailor them. However, we observe opposite instances where users receive engagements and recommendations that deviate from what they actually anticipate or prefer.

In recent years, there have been numerous instances of biased AI systems that have come to light. Famous incidents exposing biased AI systems include the COMPAS recidivism algorithm [43], which had a significantly greater likelihood of incorrectly judging black defendants in comparison to white defendants, whereby black defendants were more likely flagged as high risk. Recently, Meta, formerly known as Facebook, agreed to a settlement after it was revealed they implemented features in its advertising to exclude specific groups of people [44].

Stemming from these examples, the study of reducing bias in machine learning systems is actually a large subject area [45]. Several studies explore reducing bias in AI systems, particularly for those that conduct automated decision making [46], [47]. These studies attempt to research how to minimize algorithmic bias from a data collection, model training and testing level. However, as our study indicates, organizations should avoid completely relying on AI for decision making, whether it is adhering to community standards or generating recommendations.

9.3 Culture: A Factor impacting Inclusiveness

An interesting observation emerging from our analysis indicates the potential underlying influence of culture on the inclusiveness related user feedback. Culture is defined as the “collective programming of the mind which leads to a common way of doing things by a group of people in a larger society” [48]. The taxonomy proposed in our study illustrates how a lack of inclusiveness can impact a user’s experience using a software. The apps we analyzed have millions of users from different cultural backgrounds. However, app developers may not be aware of all the different cultural aspects, expectations, norms, and experiences. This may lead to the development of software that fails to meet the users’ expectations, resulting in a lack of inclusiveness.

To discuss the potential influence of culture, we draw on Hofstede’s Cultural Dimensions Theory [48] and consider, for example, the concepts of individualism and collectivism. Technology typically demonstrates the characteristics of the culture in which it is developed in [49]. However, depending on the user and their cultural background (i.e., individualistic or collectivistic), they may expect different functionalities. When software fails to adhere to different cultural expectations, users feel frustrated and consider leaving the app. The frustrations may be attributed to complicated or biased features, lack of available technology, and even technological literacy.

From our taxonomy categories like privacy, demography, and other human values, we see representative quotes that indicate a potential impact of ethnic culture and user preferences. This is a fruitful direction for future research as the cultural impact on system usage and end user preferences is largely unexplored. A study on culture and user feedback reported that aspects like length of the review, sentiment, ratings and amount of useful feedback provided by users on app reviews can indicate the user’s cultural background in terms of Uncertainty Avoidance and individualism/collectivism [50]. However, these aspects focus more on the characteristics of the feedback and lack an analysis of the actual content, which may help to understand the challenges users encounter due to cultural differences.

In our study, for instance, we found a user wishing the app would allow access to his wife, which exemplifies a collectivist point of view. (Play Store) - “I like the app but you need to change your policy I would like to add access for my wife.” (Robinhood) Similarly, in the benevolence subcategory under other human values, we found users expressing a desire to include family members in their software. The inability to do so results in a sense of exclusion.

A study conducted in rural India reported that often, only one male member of the family owns a mobile phone, and other members can only use it when the male member is around [51]. Another telling example is from an ethnographic study of ATM introduction in India where it is highlighted that people often shared their bank account cards with friends and family [52]. When first introduced to the ATM, they approached the learning process as a group and showed less concern about sharing sensitive information. This suggests a collectivist cultural perspective where concern for privacy is not as prominent as in an individualistic culture. Thus, depending on the cultural background, users may prefer software features that accommodate their situation. Failing to consider these cultural nuances can generate a sense of exclusion among users and influence them to stop using the app entirely. Therefore, for a software to be more inclusive, it is important that users’ cultural context is understood.

9.4 Implications for Practitioners

For practitioners, our empirical study shows the importance of considering user feedback for inclusiveness, as well as provides a practical approach to identifying inclusiveness related user feedback for their software. We detail a number of implications for practitioners here:

The taxonomy of inclusiveness can be used to categorize user feedback so that the issues are easy to identify and resolve. Developers are predominantly male, technically skilled, and affluent, therefore significantly different from the diverse end-users they serve. Awareness of the inclusiveness issues will allow them to learn and consider the diverse user needs and develop more inclusive software.

2) The increasing use of AI-enabled systems is resulting in a lack of inclusiveness which requires more attention. Even though automated AI-enabled systems are useful, they lack a sense of inclusiveness, and the issue is becoming a serious concern because generative AI techniques are deployed aggressively now. Our study highlights these inclusiveness issues that practitioners should recognize and consider during development.

3) Companies can prioritize the specific inclusiveness concerns based on our findings for each type of app. Many software companies are often resource constrained and lack the resources to address every single user need. Our study results brings categories identified from five types of software: business, entertainment, financial, shopping, and social media. We indicate which categories are more prevalent in each type of software that the companies can leverage accordingly.

4) The automated approach proposed in this study provides potential solutions in the form of automated flagging (i.e. a plugin) on source platforms to address the limitations of the current manual approach in identifying inclusiveness related user feedback. Online platforms like Reddit, Google Play Store, and Twitter generate a large number of user reviews, posing a challenge for companies to manually identify the issues. A plugin tool that automatically flags inclusiveness issues could enable companies to easily detect the inclusiveness related feedback from the respective pages on Reddit, Twitter, and Google Play Store.

9.5 Implications for Researchers

Our findings carry several implications for future work:

More research should be conducted with practitioners to understand how they address the inclusiveness user feedback, particularly how organizations manage inclusiveness requirements.

2) Researchers should focus more attention on studying additional sources of user feedback. These additional sources may help refine the categories and subcategories in our taxonomy.

3) Our study presents a large number of manually labelled inclusiveness related user feedback. Future researchers can leverage this data to conduct studies on improving the automated classification approach and automated identification of the categories of inclusiveness.

4) We found that inclusiveness concerns are often the result of human value violations. A number of issues are related to Schwartz’s theory [39]. Future research can further explore if there is a prevalence of other categories from the theory.

5) In addition, we suggest that culture may potentially influence end-users’ perception of inclusiveness. Therefore, we believe that studying the cultural context from the end user perspective is valuable as it may help make software more inclusive.

9.6 Threats to Validity

We describe several threats and mitigation strategies in our study using the total quality framework of Roller [53].

Credibility indicates “the completeness and accuracy associated with data gathering” [53]. This study may have the threat of sampling bias because we collected user feedback from 50 apps from the sources of feedback. However, we selected a diverse group of apps, and the feedback sources are also common platforms that users often use to discuss concerns. Our study also used standard web scraping libraries to collect the data. Additionally, we try limiting bias by creating a randomly sampled batch of user feedback to conduct manual annotation. We did not seek to give more weight to any particular app or source of feedback.

Analyzability refers to “completeness and accuracy related to the processing and verification of data” [53]. We analyze the data with two co-authors who follow a social technical grounded theory approach [9], where open coding, constant comparison, and memoing were used to analyze the feedback for inclusiveness. Furthermore, the co-authors were in constant dialogue during the coding process to ensure consistency and remove bias. Since this study leverages user feedback from three popular sources (i.e., Reddit, Twitter, and Google Play) and different apps, we were able to triangulate our analysis with the different sources.

Transparency is the “completeness of the final documents and the degree to which the research can be fully evaluated and its transferability” [53]. We provide extensive and rich descriptions of our methodology, as well as detailed quotes to support our taxonomy. The entirety of our data is provided in our replication package, including our manually labelled dataset [30].

Usefulness specifies the “ability to do something of value with the research outcomes” [53]. Our study aims to shed more insight into the role of inclusiveness in user feedback. More importantly, our study aims to advance the state of knowledge of inclusiveness by providing a taxonomy for the different types of inclusiviness related discussions. In particular, our study encompasses a significant number of user feedback and includes more empirical insights for organizations. We acknowledge that our results may not hold for every software app, but we believe organizations can benefit from the inclusiveness categories as they try to consider the concerns from diverse end users.

This paper is available on arxiv under CC 4.0 license.