This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Shanshan Han & Qifan Zhang, UCI;
(2) Wenxuan Wu, Texas A&M University;
(3) Baturalp Buyukates, Yuhang Yao & Weizhao Jin, USC;
(4) Salman Avestimehr, USC & FedML.
The Proposed Two-Stages Anomaly Detection
Verifiable Anomaly Detection using ZKP
This section presents a comprehensive evaluation of our approach. We focus on the following aspects: (i) effectiveness of cross-round check detection; (ii) effectiveness of the cross-client detection; (iii) performance comparison of our approach against other defenses; (iv) robustness of our approach against a dynamic subset of malicious clients; (v) defense coverage of our approach on different models; (vi) performance of our ZKP-verified anomaly detection protocol.
Experimental setting. A summary of datasets and models for evaluations can be found in Table 1. By default, we employ CNN and the non-i.i.d. FEMNIST dataset (partition parameter α = 0.5), as the non-i.i.d. setting closely captures real-world scenarios. We utilize FedAVG in our experiments. By default, we use 10 clients for FL training – all clients participate in training in each round. We employ three attack mechanisms, including Byzantine attacks of random mode (Chen et al., 2017; Fang et al., 2020), and model replacement backdoor attack (Bagdasaryan et al., 2020b). We utilize three baseline defense mechanisms: m-Krum (Blanchard et al., 2017), Foolsgold (Fung et al., 2020), and RFA (Pillutla et al., 2022). For m-Krum, we set m to 5, which means 5 out of 10 submitted local models participate in aggregation in each FL training round. By default, the results are evaluated with the accuracy of the global model. Evaluations for anomaly detection are conducted on a server with 8 NVIDIA A100-SXM4-80GB GPUs, and evaluations for ZKP are conducted on Amazon AWS with an m5a.4xlarge instance with 16 CPU cores and 32 GB memory. Our code is implemented using FedMLSecurity benchmark Han et al. (2023).
Exp1: Selection of importance layer. We utilize the norm of gradients to evaluate the “sensitivity” of each layer. A layer with a high norm indicates high sensitivity and can represent the whole model. We evaluate the sensitivity of layers of CNN, RNN, and ResNet56. The results for RNN is shown in Figure 4, and the results for ResNet56 and CNN are deferred to Figure 14 and Figure 13 in Section B.2, respectively. The results show that we can utilize the second-to-the-last layer of a model as the importance layer, as it includes adequate information on the whole model.
In this subsection, we assess the efficiency of the cross-round check in detecting attacks. We employ the random-Byzantine and the model replacement backdoor attacks, and set the attack probability to 40% for each FL iteration, where 1 out of the clients are malicious when the attack happens. Ideally, the cross-round check should confirm the absence or presence of an attack accurately. The effectiveness of our approach is evaluated by the cross-round check accuracy, which measures the proportion of iterations in which the algorithm correctly detects cases with or without an attack relative to the number of total iterations. A 100% cross-round check accuracy means that all attacks are detected, and none of the benign cases are identified as “attacks”.
Exp2: Impact of the similarity threshold. We evaluate the impact of the cosine similarity threshold γ in the cross-round check by setting γ to 0.5, 0.6, 0.7, 0.8, and 0.9. According to Algorithm 2, a cosine score lower than γ indicates lower similarities between client models of the current round and the last round, which may be an indicator of an attack. As shown in Figure 5, the crossround accuracy is close to 100% in the case of random-Byzantine attacks. Further, when the cosine similarity threshold γ is set to 0.5, the performance is good in all cases, with at least 93% cross-round detection accuracy.
This subsection evaluates whether the cross-client detection can successfully detect malicious client submissions given the cases with attacks. We evaluate the quality of anomaly detection using modified Positive Predictive Values (PPV) (Fletcher, 2019), i.e., precision, the proportions of positive results in statistics and diagnostic tests that are true positive results. Let us denote the number of true positive and the false positive results to be NTP and NFP , respectively, then PPV = NTP NTP+NFP . In the context of anomaly detection in FL, client submissions that are detected as “malicious” and are actually malicious are defined as True Positive, corresponding to NTP , while client submissions that are detected as “malicious” even though they are benign are defined as False Positive, corresponding to NFP . Since we would like the PPV to reveal the relation between NTP and the total number of malicious submissions across all FL iterations, denoted as Ntotal , whether they are detected or not, we introduce a modified PPV as PPV = NTP NTP+NFP+Ntotal , where 0 ≤ PPV ≤ 1 2 . In ideal cases, all malicious submissions are detected, where NTP = Ntotal , and PPV is 1 2 . Due to the page limitations we deferred the proof of this statement to Appendix A.1.
Exp 3: Selection of the number of deviations. This experiment utilizes PPV to evaluate the impact of the number of deviations, i.e., the parameter λ in the anomaly bound µ + λσ. To evaluate a challenging case where a large portion of the clients are malicious, we set 4 out of 10 clients to be malicious in each FL training round. Given the number of FL iterations to be 100, the total number of malicious submissions Ntotal is 400. We set the number of deviations λ to be 0.5, 1, 1.5, 2, 2.5, and 3. We select the Byzantine attack of random mode and the model replacement attack, and evaluate our approach on three tasks, as follows: i) CNN+FEMNIST, ii) ResNet-56+Cifar100, and iii) RNN + Shakespeare. The results, as shown in Figure 6, indicate that when λ is 0.5, the results are the best, with the PPV being at least 0.37. In fact, when λ is 0.5, all malicious submissions are detected for the random Byzantine attack for all three tasks, with the PPV being exactly 1/2. In subsequent experiments, unless specified otherwise, we set λ to 0.5 by default.
Exp 4: Comparisons between anomaly detection and various defenses against attacks. This experiment evaluates the effect of our approach compared with various defense mechanisms, including Foolsgold, m-Krum (m = 5), and RFA in the context of ongoing attacks. We include a “benign” baseline scenario with no activated attack or defense, and select random-Byzantine and model replacement backdoor attacks. The results for the random-Byzantine attack and model replacement backdoor attack are shown in Figure 7and Figure 8, respectively. These results demonstrate that our approach effectively mitigates the negative impact of the attacks and significantly outperforms the other defenses, with the test accuracy much closer to the benign case.
Exp 5: Varying the number of malicious clients. This experiment evaluates the impact of varying number of malicious clients on test accuracy. We set the number of malicious clients out of 10 clients in each FL training round to 2 and 4, and include a baseline case where all clients are benign. As shown in Figure 9, the test accuracy remains relatively consistent across different numbers of malicious clients, as in each FL training round, our approach filters out the local models that tend to be malicious to effectively minimize the impact of malicious client models on the aggregation.
Exp 6: Evaluations on different models. We evaluate defense mechanisms against the random mode of the Byzantine attack with different models and datasets, including: i) ResNet-20 + Cifar10, ii) ResNet-56 + Cifar100, and iii) RNN + Shakespeare. The results are shown in Figures 10, 11, and 12, respectively. The results show that while the baseline defense mechanisms can mitigate the impact of attacks in most cases, some defenses may fail in some tasks, e.g., m-Krum fails in RNN in Figure 12. This is because those methods either select a fixed number of local models or re-weight the local models in aggregation, which potentially eliminates some local models that are important to the aggregation, leading to an unchanged test accuracy in later FL iterations. In contrast, our approach outperforms those baseline defense methods by effectively filtering out local models that are detected as “outliers”, with a test accuracy close to the benign scenarios.
Exp 7: Evaluations of ZKP circuit size, proving time, and verification time. We implement the ZKP system in Circom (Circom Contributors, 2022). This system includes a module for cross-round check and cross-client detection. We also implement a prover’s module which contains JavaScript code to generate witness for the ZKP, as well as to perform fixed-point quantization. In our experiments, we include CNN, RNN, and ResNet-56 as our machine learning models. Specifically, we only pull out parameters from the second-to-last layer of each model, i.e., the importance layer, as our weight to reduce complexity. For instance, the second-to-last layer of the CNN model contains only 7, 936 trainable parameters, as opposed to 1, 199, 882 should we use the entire model. We implement both algorithms across 10 clients, and report our results in Table 2.