Effective Bias Detection and Mitigation: Key Findings from BiasPainter’s Evaluation

Authors: (1) Wenxuan Wang, The Chinese University of Hong Kong, Hong Kong, China; (2) Haonan Bai, The Chinese University of Hong Kong, Hong Kong, China (3) Jen-tse Huang, The Chinese University of Hong Kong, Hong Kong, China; (4) Yuxuan Wan, The Chinese University of Hong Kong, Hong Kong, China; (5) Youliang Yuan, The Chinese University of Hong Kong, Shenzhen Shenzhen, China (6) Haoyi Qiu University of California, Los Angeles, Los Angeles, USA; (7) Nanyun Peng, University of California, Los Angeles, Los Angeles, USA (8) Michael Lyu, The Chinese University of Hong Kong, Hong Kong, China. Table of Links Abstract 1 Introduction 2 Background 3 Approach and Implementation 3.1 Seed Image Collection and 3.2 Neutral Prompt List Collection 3.3 Image Generation and 3.4 Properties Assessment 3.5 Bias Evaluation 4 Evaluation 4.1 Experimental Setup 4.2 RQ1: Effectiveness of BiasPainter 4.3 RQ2 - Validity of Identified Biases 4.4 RQ3 - Bias Mitigation 5 Threats to Validity 6 Related Work 7 Conclusion, Data Availability, and References 7 CONCLUSION In this paper, we design and implement BiasPainter, a metamorphic testing framework for measuring the social biases in image generation models. Unlike existing frameworks, which only use sentence descriptions as input and evaluate the properties of the generated images, BiasPainter adopts an image editing manner that inputs both seed images and sentence descriptions to let image generation models edit the seed image and then compare the generated image and seed image to measure the bias. We conduct experiments on five widely deployed commercial software and famous research models to verify the effectiveness of BiasPainter. and demonstrate that BiasPainter can effectively trigger a massive amount of biased behavior with high accuracy. In addition, we demonstrate that BiasPainter can help mitigate the bias in image generation models. DATA AVAILABILITY All the code, data, and results have been uploaded[17] and will be released for reproduction and future research. REFERENCES [1] Jaimeen Ahn, Hwaran Lee, Jinhwa Kim, and Alice Oh. 2022. Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of DistilBERT. Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP) (2022). https://api.semanticscholar.org/CorpusID:250390701 [2] Hritik Bansal, Da Yin, Masoud Monajatipoor, and Kai-Wei Chang. 2022. How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 1358–1370. https://doi.org/10.18653/v1/2022.emnlp-main.88 [3] Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Y. Zou, and Aylin Caliskan. 2022. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (2022). https://api.semanticscholar. org/CorpusID:253383708 [4] Shikha Bordia and Samuel R. Bowman. 2019. Identifying and Reducing Gender Bias in Word-Level Language Models. In North American Chapter of the Association for Computational Linguistics. [5] Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023). [6] Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2018. VGGFace2: A dataset for recognising faces across pose and age. In International Conference on Automatic Face and Gesture Recognition. [7] Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Michael E. Sherr, Clay Shields, David A. Wagner, and Wenchao Zhou. 2016. Hidden Voice Commands. In USENIX Security Symposium. [8] Joymallya Chakraborty, Suvodeep Majumder, and Tim Menzies. 2021. Bias in machine learning software: why? how? what to do? Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2021). [9] Tsong Yueh Chen, S. C. Cheung, and Siu-Ming Yiu. 2020. Metamorphic Testing: A New Approach for Generating Next Test Cases. ArXiv abs/2002.12543 (2020). [10] Tsong Yueh Chen, Joshua W. K. Ho, Huai Liu, and Xiaoyuan Xie. 2008. An innovative approach for testing bioinformatics programs using metamorphic testing. BMC Bioinformatics 10 (2008), 24 – 24. [11] Zhenpeng Chen, J Zhang, Max Hort, Federica Sarro, and Mark Harman. 2022. Fairness Testing: A Comprehensive Survey and Analysis of Trends. ArXiv abs/2207.10223 (2022). https://api.semanticscholar.org/CorpusID:250920488 [12] Jaemin Cho, Abhaysinh Zala, and Mohit Bansal. 2022. DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers. ArXiv abs/2202.04053 (2022). https://api.semanticscholar.org/CorpusID: 246652218 [13] Aiyub Dawood. 2023. Number of Midjourney Users and Statistics. https://www. mlyearning.org/midjourney-users-statistics/. Accessed: 2023-08-01. 17https://drive.google.com/drive/folders/1VDe5EKszv9TEvJRygK7tIyQDLaeE4Rsn? usp=drive_link [14] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. CoRR abs/1905.03197 (2019). arXiv:1905.03197 http://arxiv.org/abs/1905.03197 [15] Leah V Durant. 2004. Gender bias and legal profession: A discussion of why there are still so few women on the bench. U. Md. LJ Race, Religion, Gender & Class 4 (2004), 181. [16] Anurag Dwarakanath, Manish Ahuja, Samarth Sikand, Raghotham M. Rao, R. P. Jagadeesh Chandra Bose, Neville Dubash, and Sanjay Podder. 2018. Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (2018). [17] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative adversarial networks. In NIPS. [18] Shashij Gupta. 2020. Machine Translation Testing via Pathological Invariance. 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) (2020), 107–109. [19] Pinjia He, Clara Meister, and Zhendong Su. 2021. Testing Machine Translation via Referential Transparency. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 410–422. [20] Max Hort, Zhenpeng Chen, J Zhang, Federica Sarro, and Mark Harman. 2022. Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey. ArXiv abs/2207.07068 (2022). https://api.semanticscholar.org/CorpusID:250526377 [21] et al. Hugo Touvron. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 (2023). https://api.semanticscholar.org/CorpusID: 259950998 [22] Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella. 2021. DeepCrime: mutation testing of deep learning systems based on real faults. Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (2021). [23] Os Keyes, Chandler May, and Annabelle Carrell. 2021. You Keep Using That Word: Ways of Thinking about Gender in Computing Research. Proc. ACM Hum.- Comput. Interact. 5, CSCW1, Article 39 (apr 2021), 23 pages. https://doi.org/10. 1145/3449113 [24] Sam Levin. 2018. Tesla fatal crash: ’autopilot’ mode sped up car before driver killed, report finds [Online]. https://www.theguardian.com/technology/2018/jun/ 07/tesla-fatal-crash-silicon-valley-autopilot-mode-report. Accessed: 2018-06. [25] Xinyue Li, Zhenpeng Chen, Jie Zhang, Federica Sarro, Y. Zhang, and Xuanzhe Liu. 2023. Dark-Skin Individuals Are at More Risk on the Street: Unmasking Fairness Issues of Autonomous Driving Systems. ArXiv abs/2308.02935 (2023). https://api.semanticscholar.org/CorpusID:260682572 [26] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013 [27] Yuanfu Luo, Malika Meghjani, Qi Heng Ho, David Hsu, and Daniela Rus. 2021. Interactive Planning for Autonomous Urban Driving in Adversarial Scenarios. 2021 IEEE International Conference on Robotics and Automation (ICRA) (2021), 5261–5267. [28] L. Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) (2018), 120–131. https://api.semanticscholar.org/CorpusID:36353796 [29] Ninareh Mehrabi, Fred Morstatter, Nripsuta Ani Saxena, Kristina Lerman, and A. G. Galstyan. 2019. A Survey on Bias and Fairness in Machine Learning. ACM Computing Surveys (CSUR) 54 (2019), 1 – 35. https://api.semanticscholar.org/ CorpusID:201666566 [30] Inc. Midjourney. 2023. Midjourney. https://www.midjourney.com/. Accessed: 2023-08-01. [31] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Sekhar Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Proceedings of the 26th Symposium on Operating Systems Principles (2017). [32] Hung Viet Pham, Mijung Kim, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan. 2021. DEVIATE: A Deep Learning Variance Testing Framework. 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2021), 1286–1290. [33] Evaggelia Pitoura, Kostas Stefanidis, and Georgia Koutrika. 2021. Fairness in rankings and recommendations: an overview. The VLDB Journal 31 (2021), 431 – 458. https://api.semanticscholar.org/CorpusID:233219774 [34] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952 [cs.CV] [35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:231591445 [36] Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. 2020. Testing machine learning based systems: a systematic mapping. Empir. Softw. Eng. 25 (2020), 5193–5254. [37] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV] [38] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 10674–10685. https://api.semanticscholar.org/CorpusID:245335280 [39] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695. [40] Qingchao Shen, Junjie Chen, J Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural Test Generation for Precise Testing of Question Answering Software. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022). [41] Charles Hamilton Smith and Samuel Kneeland. [n. d.]. The natural history of the human species. https://api.semanticscholar.org/CorpusID:162691300 [42] Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai Elsherief, Jieyu Zhao, Diba Mirza, Elizabeth M. Belding-Royer, Kai-Wei Chang, and William Yang Wang. 2019. Mitigating Gender Bias in Natural Language Processing: Literature Review. In Annual Meeting of the Association for Computational Linguistics. https://api. semanticscholar.org/CorpusID:195316733 [43] Yuchi Tian, Kexin Pei, Suman Sekhar Jana, and Baishakhi Ray. 2017. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2017), 303–314. https://api.semanticscholar.org/CorpusID:4055261 [44] Jen tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, and Michael R. Lyu. 2022. AEON: a method for automatic evaluation of NLP test cases. Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (2022). [45] James Tu, Huichen Li, Xinchen Yan, Mengye Ren, Yun Chen, Ming Liang, Eilyan Bitar, Ersin Yumer, and Raquel Urtasun. 2021. Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving. ArXiv abs/2101.06784 (2021). [46] Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. NIPS (2017). [47] Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R. Lyu. 2023. BiasAsker: Measuring the Bias in Conversational AI System. FSE (2023). [48] Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. 2021. RobOT: Robustness-Oriented Testing for Deep Learning Systems. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 300–311. [49] Jialu Wang, Xinyue Liu, Zonglin Di, Y. Liu, and Xin Eric Wang. 2023. T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation. ACL (2023). [50] Wenxuan Wang, Jingyuan Huang, Chang Chen, Jiazhen Gu, Jianping Zhang, Weibin Wu, Pinjia He, and Michael R. Lyu. 2023. Validating Multimedia Content Moderation Software via Semantic Fusion. Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2023). https://api. semanticscholar.org/CorpusID:258840941 [51] Wenxuan Wang, Jingyuan Huang, Jen tse Huang, Chang Chen, Jiazhen Gu, Pinjia He, and Michael R. Lyu. 2023. An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software. 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2023), 1339–1351. https://api.semanticscholar.org/CorpusID:261048642 [52] Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen tse Huang, Zhaopeng Tu, and Michael R. Lyu. 2023. Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models. ArXiv abs/2310.12481 (2023). https://api.semanticscholar.org/CorpusID:264305810 [53] Wenxuan Wang, Jen tse Huang, Weibin Wu, Jianping Zhang, Yizhan Huang, Shuqing Li, Pinjia He, and Michael R. Lyu. 2023. MTTM: Metamorphic Testing for Textual Content Moderation Software. ArXiv abs/2302.05706 (2023). [54] Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen tse Huang, Wenxiang Jiao, and Michael R. Lyu. 2023. All Languages Matter: On the Multilingual Safety of Large Language Models. ArXiv abs/2310.00905 (2023). https://api.semanticscholar.org/CorpusID:263605466 [55] Craig S. Webster, S Taylor, Courtney Anne De Thomas, and Jennifer M Weller. 2022. Social bias, discrimination and inequity in healthcare: mechanisms, implications and recommendations. BJA education (2022). [56] Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail E. Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and validating machine learning classifiers by metamorphic testing. The Journal of systems and software (2011). [57] J Zhang and Mark Harman. 2021. "Ignorance and Prejudice" in Software Fairness. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 1436–1447. [58] J Zhang, Mark Harman, Lei Ma, and Yang Liu. 2019. Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Transactions on Software Engineering 48 (2019), 1–36. https://api.semanticscholar.org/CorpusID:195657970 [59] J Zhang, Mark Harman, Lei Ma, and Yang Liu. 2022. Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Transactions on Software Engineering 48 (2022), 1–36. [60] Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018. DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) (2018). [61] Yanzhe Zhang, Lu Jiang, Greg Turk, and Diyi Yang. 2023. Auditing Gender Presentation Differences in Text-to-Image Models. arXiv:2302.03675 [cs.CV] [62] Jianyi Zhou, Feng Li, Jinhao Dong, Hongyu Zhang, and Dan Hao. 2020. CostEffective Testing of a Deep Learning Model through Input Reduction. 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE) (2020), 289–300. https://api.semanticscholar.org/CorpusID:212843936 [63] Chris Ziegler. 2016. A google self-driving car caused a crash for the first time. [Online]. https://www.theverge.com/2016/2/29/11134344/google-self-drivingcar-crash-report. Accessed: 2016-09. This paper is available on arxiv under CC0 1.0 DEED license. [17] https://drive.google.com/drive/folders/1VDe5EKszv9TEvJRygK7tIyQDLaeE4Rsn? usp=drive_link Authors: (1) Wenxuan Wang, The Chinese University of Hong Kong, Hong Kong, China; (2) Haonan Bai, The Chinese University of Hong Kong, Hong Kong, China (3) Jen-tse Huang, The Chinese University of Hong Kong, Hong Kong, China; (4) Yuxuan Wan, The Chinese University of Hong Kong, Hong Kong, China; (5) Youliang Yuan, The Chinese University of Hong Kong, Shenzhen Shenzhen, China (6) Haoyi Qiu University of California, Los Angeles, Los Angeles, USA; (7) Nanyun Peng, University of California, Los Angeles, Los Angeles, USA (8) Michael Lyu, The Chinese University of Hong Kong, Hong Kong, China. Authors: Authors: (1) Wenxuan Wang, The Chinese University of Hong Kong, Hong Kong, China; (2) Haonan Bai, The Chinese University of Hong Kong, Hong Kong, China (3) Jen-tse Huang, The Chinese University of Hong Kong, Hong Kong, China; (4) Yuxuan Wan, The Chinese University of Hong Kong, Hong Kong, China; (5) Youliang Yuan, The Chinese University of Hong Kong, Shenzhen Shenzhen, China (6) Haoyi Qiu University of California, Los Angeles, Los Angeles, USA; (7) Nanyun Peng, University of California, Los Angeles, Los Angeles, USA (8) Michael Lyu, The Chinese University of Hong Kong, Hong Kong, China. Table of Links Abstract Abstract 1 Introduction 1 Introduction 2 Background 2 Background 3 Approach and Implementation 3 Approach and Implementation 3.1 Seed Image Collection and 3.2 Neutral Prompt List Collection 3.1 Seed Image Collection and 3.2 Neutral Prompt List Collection 3.3 Image Generation and 3.4 Properties Assessment 3.3 Image Generation and 3.4 Properties Assessment 3.5 Bias Evaluation 3.5 Bias Evaluation 4 Evaluation 4 Evaluation 4.1 Experimental Setup 4.1 Experimental Setup 4.2 RQ1: Effectiveness of BiasPainter 4.2 RQ1: Effectiveness of BiasPainter 4.3 RQ2 - Validity of Identified Biases 4.3 RQ2 - Validity of Identified Biases 4.4 RQ3 - Bias Mitigation 4.4 RQ3 - Bias Mitigation 5 Threats to Validity 5 Threats to Validity 6 Related Work 6 Related Work 7 Conclusion, Data Availability, and References 7 Conclusion, Data Availability, and References 7 CONCLUSION In this paper, we design and implement BiasPainter, a metamorphic testing framework for measuring the social biases in image generation models. Unlike existing frameworks, which only use sentence descriptions as input and evaluate the properties of the generated images, BiasPainter adopts an image editing manner that inputs both seed images and sentence descriptions to let image generation models edit the seed image and then compare the generated image and seed image to measure the bias. We conduct experiments on five widely deployed commercial software and famous research models to verify the effectiveness of BiasPainter. and demonstrate that BiasPainter can effectively trigger a massive amount of biased behavior with high accuracy. In addition, we demonstrate that BiasPainter can help mitigate the bias in image generation models. DATA AVAILABILITY All the code, data, and results have been uploaded[17] and will be released for reproduction and future research. REFERENCES [1] Jaimeen Ahn, Hwaran Lee, Jinhwa Kim, and Alice Oh. 2022. Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of DistilBERT. Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP) (2022). https://api.semanticscholar.org/CorpusID:250390701 [2] Hritik Bansal, Da Yin, Masoud Monajatipoor, and Kai-Wei Chang. 2022. How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 1358–1370. https://doi.org/10.18653/v1/2022.emnlp-main.88 [3] Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Y. Zou, and Aylin Caliskan. 2022. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (2022). https://api.semanticscholar. org/CorpusID:253383708 [4] Shikha Bordia and Samuel R. Bowman. 2019. Identifying and Reducing Gender Bias in Word-Level Language Models. In North American Chapter of the Association for Computational Linguistics. [5] Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning to Follow Image Editing Instructions. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023). [6] Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, and Andrew Zisserman. 2018. VGGFace2: A dataset for recognising faces across pose and age. In International Conference on Automatic Face and Gesture Recognition. [7] Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Michael E. Sherr, Clay Shields, David A. Wagner, and Wenchao Zhou. 2016. Hidden Voice Commands. In USENIX Security Symposium. [8] Joymallya Chakraborty, Suvodeep Majumder, and Tim Menzies. 2021. Bias in machine learning software: why? how? what to do? Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2021). [9] Tsong Yueh Chen, S. C. Cheung, and Siu-Ming Yiu. 2020. Metamorphic Testing: A New Approach for Generating Next Test Cases. ArXiv abs/2002.12543 (2020). [10] Tsong Yueh Chen, Joshua W. K. Ho, Huai Liu, and Xiaoyuan Xie. 2008. An innovative approach for testing bioinformatics programs using metamorphic testing. BMC Bioinformatics 10 (2008), 24 – 24. [11] Zhenpeng Chen, J Zhang, Max Hort, Federica Sarro, and Mark Harman. 2022. Fairness Testing: A Comprehensive Survey and Analysis of Trends. ArXiv abs/2207.10223 (2022). https://api.semanticscholar.org/CorpusID:250920488 [12] Jaemin Cho, Abhaysinh Zala, and Mohit Bansal. 2022. DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers. ArXiv abs/2202.04053 (2022). https://api.semanticscholar.org/CorpusID: 246652218 [13] Aiyub Dawood. 2023. Number of Midjourney Users and Statistics. https://www. mlyearning.org/midjourney-users-statistics/. Accessed: 2023-08-01. 17https://drive.google.com/drive/folders/1VDe5EKszv9TEvJRygK7tIyQDLaeE4Rsn? usp=drive_link [14] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. CoRR abs/1905.03197 (2019). arXiv:1905.03197 http://arxiv.org/abs/1905.03197 [15] Leah V Durant. 2004. Gender bias and legal profession: A discussion of why there are still so few women on the bench. U. Md. LJ Race, Religion, Gender & Class 4 (2004), 181. [16] Anurag Dwarakanath, Manish Ahuja, Samarth Sikand, Raghotham M. Rao, R. P. Jagadeesh Chandra Bose, Neville Dubash, and Sanjay Podder. 2018. Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (2018). [17] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative adversarial networks. In NIPS. [18] Shashij Gupta. 2020. Machine Translation Testing via Pathological Invariance. 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) (2020), 107–109. [19] Pinjia He, Clara Meister, and Zhendong Su. 2021. Testing Machine Translation via Referential Transparency. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 410–422. [20] Max Hort, Zhenpeng Chen, J Zhang, Federica Sarro, and Mark Harman. 2022. Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey. ArXiv abs/2207.07068 (2022). https://api.semanticscholar.org/CorpusID:250526377 [21] et al. Hugo Touvron. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 (2023). https://api.semanticscholar.org/CorpusID: 259950998 [22] Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella. 2021. DeepCrime: mutation testing of deep learning systems based on real faults. Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (2021). [23] Os Keyes, Chandler May, and Annabelle Carrell. 2021. You Keep Using That Word: Ways of Thinking about Gender in Computing Research. Proc. ACM Hum.- Comput. Interact. 5, CSCW1, Article 39 (apr 2021), 23 pages. https://doi.org/10. 1145/3449113 [24] Sam Levin. 2018. Tesla fatal crash: ’autopilot’ mode sped up car before driver killed, report finds [Online]. https://www.theguardian.com/technology/2018/jun/ 07/tesla-fatal-crash-silicon-valley-autopilot-mode-report. Accessed: 2018-06. [25] Xinyue Li, Zhenpeng Chen, Jie Zhang, Federica Sarro, Y. Zhang, and Xuanzhe Liu. 2023. Dark-Skin Individuals Are at More Risk on the Street: Unmasking Fairness Issues of Autonomous Driving Systems. ArXiv abs/2308.02935 (2023). https://api.semanticscholar.org/CorpusID:260682572 [26] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013 [27] Yuanfu Luo, Malika Meghjani, Qi Heng Ho, David Hsu, and Daniela Rus. 2021. Interactive Planning for Autonomous Urban Driving in Adversarial Scenarios. 2021 IEEE International Conference on Robotics and Automation (ICRA) (2021), 5261–5267. [28] L. Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) (2018), 120–131. https://api.semanticscholar.org/CorpusID:36353796 [29] Ninareh Mehrabi, Fred Morstatter, Nripsuta Ani Saxena, Kristina Lerman, and A. G. Galstyan. 2019. A Survey on Bias and Fairness in Machine Learning. ACM Computing Surveys (CSUR) 54 (2019), 1 – 35. https://api.semanticscholar.org/ CorpusID:201666566 [30] Inc. Midjourney. 2023. Midjourney. https://www.midjourney.com/. Accessed: 2023-08-01. [31] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Sekhar Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Proceedings of the 26th Symposium on Operating Systems Principles (2017). [32] Hung Viet Pham, Mijung Kim, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan. 2021. DEVIATE: A Deep Learning Variance Testing Framework. 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2021), 1286–1290. [33] Evaggelia Pitoura, Kostas Stefanidis, and Georgia Koutrika. 2021. Fairness in rankings and recommendations: an overview. The VLDB Journal 31 (2021), 431 – 458. https://api.semanticscholar.org/CorpusID:233219774 [34] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952 [cs.CV] [35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:231591445 [36] Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. 2020. Testing machine learning based systems: a systematic mapping. Empir. Softw. Eng. 25 (2020), 5193–5254. [37] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV] [38] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 10674–10685. https://api.semanticscholar.org/CorpusID:245335280 [39] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695. [40] Qingchao Shen, Junjie Chen, J Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural Test Generation for Precise Testing of Question Answering Software. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022). [41] Charles Hamilton Smith and Samuel Kneeland. [n. d.]. The natural history of the human species. https://api.semanticscholar.org/CorpusID:162691300 [42] Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai Elsherief, Jieyu Zhao, Diba Mirza, Elizabeth M. Belding-Royer, Kai-Wei Chang, and William Yang Wang. 2019. Mitigating Gender Bias in Natural Language Processing: Literature Review. In Annual Meeting of the Association for Computational Linguistics. https://api. semanticscholar.org/CorpusID:195316733 [43] Yuchi Tian, Kexin Pei, Suman Sekhar Jana, and Baishakhi Ray. 2017. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2017), 303–314. https://api.semanticscholar.org/CorpusID:4055261 [44] Jen tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, and Michael R. Lyu. 2022. AEON: a method for automatic evaluation of NLP test cases. Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (2022). [45] James Tu, Huichen Li, Xinchen Yan, Mengye Ren, Yun Chen, Ming Liang, Eilyan Bitar, Ersin Yumer, and Raquel Urtasun. 2021. Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving. ArXiv abs/2101.06784 (2021). [46] Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. NIPS (2017). [47] Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R. Lyu. 2023. BiasAsker: Measuring the Bias in Conversational AI System. FSE (2023). [48] Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. 2021. RobOT: Robustness-Oriented Testing for Deep Learning Systems. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 300–311. [49] Jialu Wang, Xinyue Liu, Zonglin Di, Y. Liu, and Xin Eric Wang. 2023. T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation. ACL (2023). [50] Wenxuan Wang, Jingyuan Huang, Chang Chen, Jiazhen Gu, Jianping Zhang, Weibin Wu, Pinjia He, and Michael R. Lyu. 2023. Validating Multimedia Content Moderation Software via Semantic Fusion. Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2023). https://api. semanticscholar.org/CorpusID:258840941 [51] Wenxuan Wang, Jingyuan Huang, Jen tse Huang, Chang Chen, Jiazhen Gu, Pinjia He, and Michael R. Lyu. 2023. An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software. 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2023), 1339–1351. https://api.semanticscholar.org/CorpusID:261048642 [52] Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen tse Huang, Zhaopeng Tu, and Michael R. Lyu. 2023. Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models. ArXiv abs/2310.12481 (2023). https://api.semanticscholar.org/CorpusID:264305810 [53] Wenxuan Wang, Jen tse Huang, Weibin Wu, Jianping Zhang, Yizhan Huang, Shuqing Li, Pinjia He, and Michael R. Lyu. 2023. MTTM: Metamorphic Testing for Textual Content Moderation Software. ArXiv abs/2302.05706 (2023). [54] Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen tse Huang, Wenxiang Jiao, and Michael R. Lyu. 2023. All Languages Matter: On the Multilingual Safety of Large Language Models. ArXiv abs/2310.00905 (2023). https://api.semanticscholar.org/CorpusID:263605466 [55] Craig S. Webster, S Taylor, Courtney Anne De Thomas, and Jennifer M Weller. 2022. Social bias, discrimination and inequity in healthcare: mechanisms, implications and recommendations. BJA education (2022). [56] Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail E. Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and validating machine learning classifiers by metamorphic testing. The Journal of systems and software (2011). [57] J Zhang and Mark Harman. 2021. "Ignorance and Prejudice" in Software Fairness. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 1436–1447. [58] J Zhang, Mark Harman, Lei Ma, and Yang Liu. 2019. Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Transactions on Software Engineering 48 (2019), 1–36. https://api.semanticscholar.org/CorpusID:195657970 [59] J Zhang, Mark Harman, Lei Ma, and Yang Liu. 2022. Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Transactions on Software Engineering 48 (2022), 1–36. [60] Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018. DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) (2018). [61] Yanzhe Zhang, Lu Jiang, Greg Turk, and Diyi Yang. 2023. Auditing Gender Presentation Differences in Text-to-Image Models. arXiv:2302.03675 [cs.CV] [62] Jianyi Zhou, Feng Li, Jinhao Dong, Hongyu Zhang, and Dan Hao. 2020. CostEffective Testing of a Deep Learning Model through Input Reduction. 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE) (2020), 289–300. https://api.semanticscholar.org/CorpusID:212843936 [63] Chris Ziegler. 2016. A google self-driving car caused a crash for the first time. [Online]. https://www.theverge.com/2016/2/29/11134344/google-self-drivingcar-crash-report. Accessed: 2016-09. This paper is available on arxiv under CC0 1.0 DEED license. This paper is available on arxiv under CC0 1.0 DEED license. available on arxiv [17] https://drive.google.com/drive/folders/1VDe5EKszv9TEvJRygK7tIyQDLaeE4Rsn? usp=drive_link