This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Zhipeng Wang, Department of Computing, Imperial College London;
(2) Nanqing Dong Department of Computer Science, University of Oxford;
(3) Jiahao Sun, Data Science Institute, Imperial College London;
(4) William Knottenbelt, Department of Computing, Imperial College London.
Theoretical and Empirical Analysis
We commence our evaluation by conducting an in-depth analysis of the total training time per epoch for each network backbone, focusing on the conventional FL approach without the integration of zkFL. This evaluation encompasses three crucial components that contribute to the overall training time: the local training time of each individual client, the time required for aggregating local model updates on the central aggregator, and the synchronization. time. As shown in Figs. 3(b) 4(b) 5(b) 6(b) 7(b) 8(b), our findings indicate that the local training time of each client is the primary contributor to the total training time. Moreover, as the number of clients increases, the local training time for individual clients decreases due to the effective distribution of training tasks among them.
We conduct a thorough evaluation of the encryption time for clients and the aggregation time for the central aggregator within the zkFL system. In addition to the computational costs associated with the FL training protocol, the client-side tasks involve computing commitments, such as encryption computations for each model update parameter. Our results demonstrate that this additional cost varies based on the choice of underlying network backbones and increases with the number of parameters in the network. For instance, the encryption time for ResNet18 (i.e., 10 − 20 mins) is lower than that for ResNet34 (i.e., 25 − 45 mins), as the latter has a larger number of network parameters. Moreover, as illustrated in Figs 3(d) 4(d) 5(d) 6(d) 7(d) 8(d), the aggregation time of the central aggregator in zkFL is influenced by the network parameters and exhibits an approximately linear relationship with the number of clients in the FL system.
As depicted in Fig. 9, the time required for Halo2 ZKP generation and verification exhibits variations depending on the chosen network and increases with the size of the network parameters. Among the six networks evaluated, ResNet50 demands the longest time for proof generation, taking approximately 54.72 ± 0.12 mins. Notably, the proof verification time is approximately half of the generation time. This favorable characteristic makes zkFL more practical, as the aggregator, equipped with abundant computing power, can efficiently generate the proof, while the source-limited clients can verify it without significant computational overhead. This highlights the feasibility and applicability of zkFL in real-world scenarios, where the FL clients may have constrained resources compared to the central aggregator.
Compared to the traditional FL, zkFL will also increase the communication costs for the clients and the aggregator, as the encrypted local training model updates Enc(wi) (1 ≤ i ≤ n) and the ZKP proof π. As shown in Table 1, for each network backbone, we compare the size of encrypted data with the size of the model updates in plaintext. We show that the size of encrypted data grows linearly in the number of parameters. Thus, the communication costs are dominated by the encrypted data.
For ResNet50, the network backbone with the largest number of model parameters in our experiment, the additional data transmitted to the centralized aggregator per client is approximately 2× compared to the plaintext). However, it is of utmost importance to underscore that although the relative size increase might appear significant, the absolute size of the updates is merely equivalent to a few minutes of compressed HD video. As a result, the data size remains well within the capabilities of the FL clients utilizing wireless or mobile connections.
In addition to analyzing the additional computation and communication costs introduced by zkFL, we thoroughly investigate its potential impact on training performance, including accuracy and convergence speed.
Theoretically, we demonstrate that compared to traditional FL, zkFL solely affects the data output from the clients, leaving the training process unaffected. Our experimental results provide strong evidence supporting this claim. As illustrated in Figs 3(a) 4(a) 5(a) 6(a) 7(a) 8(a), the final training results’ accuracy and convergence speed do not exhibit any significant difference between FL settings with and without zkFL.
Furthermore, we observe that the convergence speed is primarily influenced by the number of clients involved in the process. This reaffirms the practical viability of zkFL as it maintains training performance while enhancing security and privacy in the federated learning framework.