paint-brush
New Findings Show How Positive-Sum Fairness Changes the Performance of Medical AI Modelsby@demographic

New Findings Show How Positive-Sum Fairness Changes the Performance of Medical AI Models

by DemographicDecember 31st, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Researchers find that while M2 improves AI performance, M4 enhances fairness by removing race data, and M3 shows inconsistent results due to demographic encoding.
featured image - New Findings Show How Positive-Sum Fairness Changes the Performance of Medical AI Models
Demographic HackerNoon profile picture
  1. Abstract and Introduction

  2. Related work

  3. Methods

    3.1 Positive-sum fairness

    3.2 Application

  4. Experiments

    4.1 Initial results

    4.2 Positive-sum fairness

  5. Conclusion and References

4.1 Initial results

According to traditional group fairness, in assessing the results of the four models shown in figure 3a one could conclude that:


M2 improves the overall performance Our results show that M2 outperforms M1 in terms of AUROC. This is in line with our expectation as we are providing an additional relevant medical feature for the model to learn from. This better performance comes with a larger gap in AUROC between the most advantaged and most discriminated races, in other words less fairness from a traditional point of view. But this larger disparity is not necessarily harmful according to the positive-sum fairness as we will discuss it in the next section.


M4 improves the fairness M4 improves fairness for lung lesions and consolidations, while performing similar for pneumonia and pleural effusion. The improved fairness is likely due to the gradient reversal layer, which removes race information from the image and prevents the model from exploiting any demographic shortcut.


No clear pattern for M3 The results for M3 are less consistent. Its performance is lower than the baseline except for pneumonia and its fairness measurement is sometimes higher and other times lower than the baseline’s. If the baseline model exploited demographic encodings present in the images to generate shortcuts, training M3 to maximize the race prediction might have intensified the impact of these shortcuts.


Authors:

(1) Samia Belhadj∗, Lunit Inc., Seoul, Republic of Korea ([email protected]);

(2) Sanguk Park [0009 −0005 −0538 −5522]*, Lunit Inc., Seoul, Republic of Korea ([email protected]);

(3) Ambika Seth, Lunit Inc., Seoul, Republic of Korea ([email protected]);

(4) Hesham Dar [0009 −0003 −6458 −2097], Lunit Inc., Seoul, Republic of Korea ([email protected]);

(5) Thijs Kooi [0009 −0003 −6458 −2097], Kooi, Lunit Inc., Seoul, Republic of Korea ([email protected]).


This paper is available on arxiv under CC BY-NC-SA 4.0 license.