Authors:
(1) Samia Belhadj∗, Lunit Inc., Seoul, Republic of Korea ([email protected]);
(2) Sanguk Park [0009 −0005 −0538 −5522]*, Lunit Inc., Seoul, Republic of Korea ([email protected]);
(3) Ambika Seth, Lunit Inc., Seoul, Republic of Korea ([email protected]);
(4) Hesham Dar [0009 −0003 −6458 −2097], Lunit Inc., Seoul, Republic of Korea ([email protected]);
(5) Thijs Kooi [0009 −0003 −6458 −2097], Kooi, Lunit Inc., Seoul, Republic of Korea ([email protected]).
Methods
Abstract. Fairness in medical AI is increasingly recognized as a crucial aspect of healthcare delivery. While most of the prior work done on fairness emphasizes the importance of equal performance, we argue that decreases in fairness can be either harmful or non-harmful, depending on the type of change and how sensitive attributes are used. To this end, we introduce the notion of positive-sum fairness, which states that an increase in performance that results in a larger group disparity is acceptable as long as it does not come at the cost of individual subgroup performance. This allows sensitive attributes correlated with the disease to be used to increase performance without compromising on fairness.
We illustrate this idea by comparing four CNN models that make di fferent use of the race attribute in the training phase. The results show that removing all demographic encodings from the images helps close the gap in performance between the different subgroups, whereas leveraging the race attribute as a model’s input increases the overall performance while widening the disparities between subgroups. These larger gaps are then put in perspective of the collective benefit through our notion of positive-sum fairness to distinguish harmful from non harmful disparities.
Medical imaging plays a critical role in diagnosis, treatment planning, and monitoring patient progress. However, the reliability of medical imaging algorithms is not uniformly distributed across different demographic groups, raising concerns about fairness and potential biases in the results. Fairness in medical imaging most often refers to the equitable treatment of patients from diverse demographic backgrounds, regardless of their gender, race, ethnicity, or other characteristics sensitive to discrimination [19,38].
This equitable treatment is often interpreted as a similar performance across different demographic subgroups. When applied to domains like credit card scoring or AI-powered recruiting, ignoring all sensitive attributes and prioritizing a similar performance across the different demographic subgroups is an acceptable approach. However, in the medical field, demographic attributes are important clinical factors which radiologists and clinicians often take into consideration as they can have a strong impact on their diagnoses and can guide them to consider specific tests or treatments based on the patient’s demographic profile. The prevalence of diseases can be correlated to demographic attributes. For example, studies have shown that breast cancer has a higher incidence among Ashkenazi Jewish women [37,30]. And, due to historical and social disparities as well as different physiological features across demographic subgroups, the difficulty level of medical tasks is not uniformly distributed. For this reason, even collecting more or more diverse data does not necessarily produce equal performance across demographic subgroups as the best achievable result is not the same for each of them [27]. In a domain where each improvement can save lives, it is hard to disregard the benefit of the population as a whole for the sake of decreasing the disparities between subgroups.
Petersen et al. [26] examined various types of demographic invariance in medical imaging AI, highlighting why they can be undesirable and stressing the need for better fairness assessments and mitigation techniques in this field. Several fairness measures suffer from degradation in the overall performance by penalizing the performance of an AI system for groups that it performs better on, in order to achieve parity with groups it performs worse on, which is referred to as “levelling down” [24]. While we are aware of papers suggesting training methods which aim to maximize the benefit of each subgroup (Berk Ustun [34], for instance, suggested debiasing methods following the ethical principles of beneficence (“do the best”) and non-maleficence (“do not harm”) [35] in regards to fairness), and methods which improve fairness by understanding and mitigating the demographic encodings present in images [39,3], we could not find any fairness evaluation framework or definition which allows to compare different models from the prism of harmful and non harmful disparities.
We, therefore, introduce the notion of positive-sum fairness: when looking at a situation where we have an initial model and are looking at the trade-off between fairness and performance while trying to improve it, inequitable performance can be acceptable as long as it does not come at the expense of other subgroups and allows a higher overall performance to be achieved. Specifically, we argue that differences in performance can be harmful and non-harmful. We consider a disparity harmful if it comes at the cost of the overall performance or if improving the overall performance is achieved by decreasing performance on any protected subgroup. A difference in performance across protected subgroups is considered non-harmful if, by improving an AI system’s performance, we exacerbate the disparities between subgroups without negatively impacting any specific subgroup. This main idea is summarized in figure 1.
We compare the positive-sum fairness framework with a more traditional group fairness definition, which is the largest disparity in performance across subgroups. We show that some models, while increasing this disparity, actually improve the performance of each subgroup individually and that other models which decrease the disparity ("improving fairness" from a classic point of view) are harming some subgroups to achieve it.
This paper is available on arxiv under CC BY-NC-SA 4.0 license.
* These authors contributed equally to this work