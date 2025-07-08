Fortifying LLM Safety: phi-3's Responsible AI Alignment

July 8th, 2025
Discover how phi-3 models embody Microsoft's responsible AI principles, featuring multi-stage safety alignment, rigorous red-teaming, and extensive evaluations across dozens of RAI harm categories for safer LLM deployment.

Fortifying LLM Safety: phi-3's Responsible AI Alignment
Writings, Papers and Blogs on Text Models
4 Safety

Phi-3-mini was developed in accordance with Microsoft’s responsible AI principles. The overall approach consisted of safety alignment in post-training, red-teaming, automated testing and evaluations across dozens of RAI harm categories. Helpfulness and harmlessness preference datasets [BJN+ 22, JLD+ 23] with modifications inspired by [BSA+ 24] and multiple in-house generated datasets were leveraged to address the RAI harm categories in safety post-training. An independent red team at Microsoft iteratively examined phi-3-mini to further identify areas of improvement during the post-training process. Based on their feedback, we curated additional datasets tailored to address their insights, thereby refining the post-training dataset. This process resulted in significant decrease of harmful response rates, as shown in Figure 4.


Figure 4: Comparison of harmful response percentages by Microsoft AI Red Team between phi-3-mini before and after the safety alignment. Note that the harmful response percentages in this chart are inflated numbers as the red team tried to induce phi-3-mini in an adversarial way to generate harmful responses through multi-turn conversations.


Table 1: Comparison of Microsoft internal multi-turn conversation RAI benchmark results of phi-3 models and other models. Note that a lower value indicates a better performance for all metrics in the table.


The safety alignment of phi-3-small and phi-3-medium was conducted by undergoing the same red-teaming process, utilizing identical datasets, and incorporating a slightly larger number of samples. Table 1 shows the results of in-house RAI benchmarks [MHJ+ 23] for phi-3 models compared to phi-2 [JBA+ 23], Mistral-7b-v0.1 [JSM+ 23], Gemma 7b [TMH+ 24], and Llama-3-instruct-8b [AI]. This benchmark utilized GPT-4 to simulate multi-turn conversations in five different categories and to evaluate the model responses. Ungroundedness between 0 (fully grounded) and 4 (not grounded) measures if the information in a response is based on a given prompt. In other categories, responses were evaluated in terms of the severity of harmfulness from 0 (no harm) to 7 (extreme harm) and the defect rates (DR-x) were computed as the percentage of samples with the severity score being greater than or equal to x.


This paper is available on arxiv under CC BY 4.0 DEED license.


About Author

Writings, Papers and Blogs on Text Models
Writings, Papers and Blogs on Text Models
We publish the best academic papers on rule-based techniques, LLMs, & the generation of text that resembles human text.
Read my storiesAbout @textmodels

#phi-3-mini#language-models#ai-on-mobile-devices#transformer-architecture#model-quantization#dataset-filtering#robust-ai#long-context-models

