Privacy in Cloud Computing through Immersion-based Coding: Abstract and Introduction

Authors: (1) Haleh Hayati, Department of Mechanical Engineering, Dynamics and Control Group, Eindhoven University of Technology, The Netherlands; (2) Nathan van de Wouw, Department of Mechanical Engineering, Dynamics and Control Group, Eindhoven University of Technology, The Netherlands; (3) Carlos Murguia, Department of Mechanical Engineering, Dynamics and Control Group, Eindhoven University of Technology, The Netherlands, and with the School of Electrical Engineering and Robotics, Queensland University of Technology, Brisbane, Australia. Table of Links Abstract and Introduction Problem Formulation Affine Solution to Problem 1 Privacy Guarantee General Guidelines for Implementation Case Studies Conclusion and References Cloud computing enables users to process and store data remotely on high-performance computers and servers by sharing data over the Internet. However, transferring data to clouds causes unavoidable privacy concerns. Here, we present a synthesis framework for designing coding mechanisms that allow sharing and processing data in a privacy-preserving manner without sacrificing data utility and algorithmic performance. We consider the setup in which the user aims to run an algorithm in the cloud using private data. The cloud then returns some data utility back to the user (utility refers to the service that the algorithm provides, e.g., classification, prediction, AI models, etc.). To avoid privacy concerns, the proposed scheme provides tools to co-design: 1) coding mechanisms to distort the original data and guarantee a prescribed differential privacy level; 2) an equivalent-but-different algorithm (referred here to as the target algorithm) that runs on distorted data and produces distorted utility; and 3) a decoding function that extracts the true utility from the distorted one with a negligible error. Then, instead of sharing the original data and algorithm with the cloud, only the distorted data and target algorithm are disclosed, thereby avoiding privacy concerns. The proposed scheme is built on the synergy of differential privacy and system immersion tools from control theory. The key underlying idea is to design a higher-dimensional target algorithm that embeds all trajectories of the original algorithm and works on randomly encoded data to produce randomly encoded utility. We show that the proposed scheme can be designed to offer any level of differential privacy without degrading the algorithm’s utility. We present two use cases to illustrate the performance of the developed tools: privacy in optimization/learning algorithms and a nonlinear networked control system. Introduction Scientific and technological advancements in today’s hyper- connected world have resulted in a massive amount of user data being collected and processed by hundreds of companies over public networks and servers. Companies use these data to provide personalized services and targeted advertising. However, this technology has come at the cost of an alarming loss of privacy in society. Depending on their resources, ad- versaries can infer sensitive (private) information about users from public data available on the Internet and/or unsecured networks/servers. A motivating example of this privacy loss is the potential use of data from smart electrical meters by crimi- nals, advertising agencies, and governments for monitoring the presence and activities of occupants [1], [2]. Other examples are privacy loss caused by information sharing in distributed control systems and cloud computing [3]; the use of travel data for traffic estimation in intelligent transportation systems [4]; and data collection and sharing by the Internet-of-Things (IoT) [5]. These, and many more, privacy challenges have drawn the interests of researchers from different fields (e.g., information theory, computer science, and control engineering) to the broad research subject of privacy and security in data-sharing and cloud computing, see [6]–[12]. In recent years, various privacy-preserving schemes have been proposed to address privacy concerns in data-sharing and cloud computing. Most of them rely on perturbation techniques (inject randomness to distort data) or cryptographic tools (encrypt data/algorithms before disclosure), see [13]. Perturbation-based techniques use information-theoretic pri- vacy metrics [14]–[17] and/or differential privacy [7], [18]. In these techniques, random vectors drawn from known prob- ability distributions are injected into sensitive data before sharing it with the cloud. Although these methods effectively reduce the amount of private data that can be inferred, the induced distortion is never removed, which inevitably leads to performance degradation of the utility when used to run algorithms in the cloud. In standard cryptographic methods, data is encrypted before sharing and then decrypted in the cloud before processing, see [19]. This technique is suitable for making eavesdropping attacks over communication networks difficult. However, it is still vulnerable to insider attacks in the cloud (as they have the decryption key). Alternatively, Homomorphic Encryption (HE) methods do not require data to be decrypted before processing. They allow computations over plain data by performing appro- priate computations on the encrypted data [20], [21]. However, standard HE methods (e.g., Paillier and ElGamal encryption [22]) work over finite rings of integers, which implies that the original algorithm must be reformulated to also work on rings of integers [23]. This reformulation, although effective when possible, is hard to accomplish for general algorithms. In the literature, it has only been successfully performed on small classes of algorithms (mainly algorithms with linear transitions and a fairly small number of variables). Even when the reformulation is possible, mapping algorithms working on the reals to operate on finite rings leads to performance degradation and large computational overhead [24]. In summary, although current solutions improve privacy of data and algorithms, they often do it at the expense of data utility and algorithm performance degradation. Balancing these tradeoffs is a key challenge when implementing privacy so- lutions for cloud computing. Hence, novel privacy-preserving schemes must be designed to provide strict privacy guarantees with a fair computational cost and without compromising the application performance and utility. The aim of this work is to devise synthesis tools that allow designing coding mechanisms that protect private information and allow the implementation of remote dynamic algorithms – algorithms that perform iterations and have memory. We propose a coding scheme built on the synergy of random coding and system immersion tools [25] from control theory. The key idea is to treat the original algorithm as a dynamical system that we seek to immerse into a higher-dimensional system (the so-called target algorithm), i.e., trajectories of the target algorithm must embed all trajectories of the original (up to a random (left-)invertible transformation). The target algorithm must be designed such that: 1) trajectories of the original algorithm are immersed/embedded in its trajectories, and 2) it operates on randomly encoded higher-dimensional data to produce randomly encoded utility. We formulate the coding mechanism at the user side as a random change of coor- dinates that maps original private data to a higher-dimensional space. Such coding enforces that the target algorithm produces an encoded higher-dimensional version of the utility of the original algorithm. The encoded utility is decoded at the user side using a left-inverse of the encoding map. This is basically a homomorphic encryption scheme that operates over the reals instead of finite rings of integers (as in the standard case). Working over the reals provides much more freedom to redesign algorithms to work on encoded/encrypted data, which translates to our scheme being able to tackle a larger class of (nonlinear and high-dimensional) dynamic algorithms. Another difference compared to HE schemes is the privacy guarantee. While standard HE schemes provide hard security guarantees, the scheme we propose gives arbitrarily strong probabilistic guarantees (in terms of differential privacy). We show that our scheme provides the same utility as the original algorithm (i.e., when no coding is employed to protect against data inference), (practically) reveals no information about private data, can be applied to large-scale algorithms (hundreds of thousands of algorithm variables, as commonly occurs in deep-learning and large-scale simulations), is computationally efficient, and offers any desired level of differential privacy without degrading the algorithm utility (because the induced distortion is removed at the user side). Summarizing, the main contribution of the paper is a prescriptive synthesis framework (i.e., it uses the original undistorted algorithm as the basis) that allows the joint design of coding schemes to randomize data before disclosure and new (target) algorithms that work on encoded data, does not lead to performance degradation, and provides any de- sired level of differential privacy. The concept of immersion- based coding scheme was initially introduced in our previous works [26], [27], [28]. In those studies, we established the foundational concepts and demonstrated their application specifically in ”privacy-preserving federated learning” and ”privacy-preserving remote anomaly detection”. This paper builds on our earlier research by significantly extending and generalizing the immersion-based coding scheme to support various dynamic algorithms. These algorithms are modeled as sets of difference/differential parameter-varying equations that process user data to produce utility. Furthermore, we introduce a novel aspect by integrating arbitrarily strong probabilistic guarantees in terms of differential privacy into the scheme. In particular, we prove that the proposed scheme can provide any desired level of differential privacy without reducing the accuracy and performance of the original algorithm, which is a key contribution distinguishing our current work from our preliminary studies. Furthermore, we demonstrate the performance of the proposed tools for two use cases: privacy in optimization/learning algorithms and nonlinear networked control systems. The remainder of this paper is organized as follows. In Sec- tion II, we formulate the problem of designing the immersion- based coding for privacy of dynamical algorithms. In Section III, we construct a prescriptive solution design to the proposed problem using random affine maps. In Section IV, we prove that the proposed solution can provide any desired level of differential privacy. In Section V, we provide a synthesis procedure to summarize the operation methodology for the proposed mechanism. In Section VI, the performance of the immersion-based coding is shown for two use cases of privacy in optimization/learning algorithms and nonlinear networked control systems. Conclusions are drawn in Section VII. Notation: The symbol R represents the real numbers and R+ denotes the set of positive real numbers. The symbol N represents the set of natural numbers and N0 denotes the set of natural numbers with zero. For x ∈ Rn, the Euclidian norm is denoted by ||x||, ||x||2 = x⊤x, where ⊤ denotes transposition, and the l1 norm as ∥x∥1 = PN i=1 |xi|, x = (x1, . . . , xN )⊤. The n × n identity matrix is denoted by In or simply I if n is clear from the context. Similarly, n×m matrices composed of only ones and only zeros are denoted by 1n×m and 0n×m, re- spectively, or simply 1 and 0 when their dimensions are clear. The notation x ∼ N [μx, Σx] means x ∈ Rn is a normally distributed random vector with mean E[x] = μx ∈ Rn and covariance matrix E[(x−μx)(x−μx)⊤] = Σx ∈ Rn×n, where E[a] denotes the expected value of the random vector a. The notation y ∼ Laplace(μy , Σy ) stands for the Laplace random vector y ∈ Rm with mean E(y) = μy and scale parameter 1 2 E[(y − μy )(y − μy )⊤] = Σy ∈ Rm×m. The left inverse of matrix X is denoted by XL where XLX = I. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Haleh Hayati, Department of Mechanical Engineering, Dynamics and Control Group, Eindhoven University of Technology, The Netherlands; (2) Nathan van de Wouw, Department of Mechanical Engineering, Dynamics and Control Group, Eindhoven University of Technology, The Netherlands; (3) Carlos Murguia, Department of Mechanical Engineering, Dynamics and Control Group, Eindhoven University of Technology, The Netherlands, and with the School of Electrical Engineering and Robotics, Queensland University of Technology, Brisbane, Australia. Authors: Authors: (1) Haleh Hayati, Department of Mechanical Engineering, Dynamics and Control Group, Eindhoven University of Technology, The Netherlands; (2) Nathan van de Wouw, Department of Mechanical Engineering, Dynamics and Control Group, Eindhoven University of Technology, The Netherlands; (3) Carlos Murguia, Department of Mechanical Engineering, Dynamics and Control Group, Eindhoven University of Technology, The Netherlands, and with the School of Electrical Engineering and Robotics, Queensland University of Technology, Brisbane, Australia. Table of Links Abstract and Introduction Abstract and Introduction Problem Formulation Problem Formulation Affine Solution to Problem 1 Affine Solution to Problem 1 Privacy Guarantee Privacy Guarantee General Guidelines for Implementation General Guidelines for Implementation Case Studies Case Studies Conclusion and References Conclusion and References Cloud computing enables users to process and store data remotely on high-performance computers and servers by sharing data over the Internet. However, transferring data to clouds causes unavoidable privacy concerns. Here, we present a synthesis framework for designing coding mechanisms that allow sharing and processing data in a privacy-preserving manner without sacrificing data utility and algorithmic performance. We consider the setup in which the user aims to run an algorithm in the cloud using private data. The cloud then returns some data utility back to the user (utility refers to the service that the algorithm provides, e.g., classification, prediction, AI models, etc.). To avoid privacy concerns, the proposed scheme provides tools to co-design: 1) coding mechanisms to distort the original data and guarantee a prescribed differential privacy level; 2) an equivalent-but-different algorithm (referred here to as the target algorithm) that runs on distorted data and produces distorted utility; and 3) a decoding function that extracts the true utility from the distorted one with a negligible error. Then, instead of sharing the original data and algorithm with the cloud, only the distorted data and target algorithm are disclosed, thereby avoiding privacy concerns. The proposed scheme is built on the synergy of differential privacy and system immersion tools from control theory. The key underlying idea is to design a higher-dimensional target algorithm that embeds all trajectories of the original algorithm and works on randomly encoded data to produce randomly encoded utility. We show that the proposed scheme can be designed to offer any level of differential privacy without degrading the algorithm’s utility. We present two use cases to illustrate the performance of the developed tools: privacy in optimization/learning algorithms and a nonlinear networked control system. Introduction Scientific and technological advancements in today’s hyper- connected world have resulted in a massive amount of user data being collected and processed by hundreds of companies over public networks and servers. Companies use these data to provide personalized services and targeted advertising. However, this technology has come at the cost of an alarming loss of privacy in society. Depending on their resources, ad- versaries can infer sensitive (private) information about users from public data available on the Internet and/or unsecured networks/servers. A motivating example of this privacy loss is the potential use of data from smart electrical meters by crimi- nals, advertising agencies, and governments for monitoring the presence and activities of occupants [1], [2]. Other examples are privacy loss caused by information sharing in distributed control systems and cloud computing [3]; the use of travel data for traffic estimation in intelligent transportation systems [4]; and data collection and sharing by the Internet-of-Things (IoT) [5]. These, and many more, privacy challenges have drawn the interests of researchers from different fields (e.g., information theory, computer science, and control engineering) to the broad research subject of privacy and security in data-sharing and cloud computing, see [6]–[12]. In recent years, various privacy-preserving schemes have been proposed to address privacy concerns in data-sharing and cloud computing. Most of them rely on perturbation techniques (inject randomness to distort data) or cryptographic tools (encrypt data/algorithms before disclosure), see [13]. Perturbation-based techniques use information-theoretic pri- vacy metrics [14]–[17] and/or differential privacy [7], [18]. In these techniques, random vectors drawn from known prob- ability distributions are injected into sensitive data before sharing it with the cloud. Although these methods effectively reduce the amount of private data that can be inferred, the induced distortion is never removed, which inevitably leads to performance degradation of the utility when used to run algorithms in the cloud. In standard cryptographic methods, data is encrypted before sharing and then decrypted in the cloud before processing, see [19]. This technique is suitable for making eavesdropping attacks over communication networks difficult. However, it is still vulnerable to insider attacks in the cloud (as they have the decryption key). Alternatively, Homomorphic Encryption (HE) methods do not require data to be decrypted before processing. They allow computations over plain data by performing appro- priate computations on the encrypted data [20], [21]. However, standard HE methods (e.g., Paillier and ElGamal encryption [22]) work over finite rings of integers, which implies that the original algorithm must be reformulated to also work on rings of integers [23]. This reformulation, although effective when possible, is hard to accomplish for general algorithms. In the literature, it has only been successfully performed on small classes of algorithms (mainly algorithms with linear transitions and a fairly small number of variables). Even when the reformulation is possible, mapping algorithms working on the reals to operate on finite rings leads to performance degradation and large computational overhead [24]. In summary, although current solutions improve privacy of data and algorithms, they often do it at the expense of data utility and algorithm performance degradation. Balancing these tradeoffs is a key challenge when implementing privacy so- lutions for cloud computing. Hence, novel privacy-preserving schemes must be designed to provide strict privacy guarantees with a fair computational cost and without compromising the application performance and utility. The aim of this work is to devise synthesis tools that allow designing coding mechanisms that protect private information and allow the implementation of remote dynamic algorithms – algorithms that perform iterations and have memory. We propose a coding scheme built on the synergy of random coding and system immersion tools [25] from control theory. The key idea is to treat the original algorithm as a dynamical system that we seek to immerse into a higher-dimensional system (the so-called target algorithm), i.e., trajectories of the target algorithm must embed all trajectories of the original (up to a random (left-)invertible transformation). The target algorithm must be designed such that: 1) trajectories of the original algorithm are immersed/embedded in its trajectories, and 2) it operates on randomly encoded higher-dimensional data to produce randomly encoded utility. We formulate the coding mechanism at the user side as a random change of coor- dinates that maps original private data to a higher-dimensional space. Such coding enforces that the target algorithm produces an encoded higher-dimensional version of the utility of the original algorithm. The encoded utility is decoded at the user side using a left-inverse of the encoding map. This is basically a homomorphic encryption scheme that operates over the reals instead of finite rings of integers (as in the standard case). Working over the reals provides much more freedom to redesign algorithms to work on encoded/encrypted data, which translates to our scheme being able to tackle a larger class of (nonlinear and high-dimensional) dynamic algorithms. Another difference compared to HE schemes is the privacy guarantee. While standard HE schemes provide hard security guarantees, the scheme we propose gives arbitrarily strong probabilistic guarantees (in terms of differential privacy). We show that our scheme provides the same utility as the original algorithm (i.e., when no coding is employed to protect against data inference), (practically) reveals no information about private data, can be applied to large-scale algorithms (hundreds of thousands of algorithm variables, as commonly occurs in deep-learning and large-scale simulations), is computationally efficient, and offers any desired level of differential privacy without degrading the algorithm utility (because the induced distortion is removed at the user side). Summarizing, the main contribution of the paper is a prescriptive synthesis framework (i.e., it uses the original undistorted algorithm as the basis) that allows the joint design of coding schemes to randomize data before disclosure and new (target) algorithms that work on encoded data, does not lead to performance degradation, and provides any de- sired level of differential privacy. The concept of immersion- based coding scheme was initially introduced in our previous works [26], [27], [28]. In those studies, we established the foundational concepts and demonstrated their application specifically in ”privacy-preserving federated learning” and ”privacy-preserving remote anomaly detection”. This paper builds on our earlier research by significantly extending and generalizing the immersion-based coding scheme to support various dynamic algorithms. These algorithms are modeled as sets of difference/differential parameter-varying equations that process user data to produce utility. Furthermore, we introduce a novel aspect by integrating arbitrarily strong probabilistic guarantees in terms of differential privacy into the scheme. In particular, we prove that the proposed scheme can provide any desired level of differential privacy without reducing the accuracy and performance of the original algorithm, which is a key contribution distinguishing our current work from our preliminary studies. Furthermore, we demonstrate the performance of the proposed tools for two use cases: privacy in optimization/learning algorithms and nonlinear networked control systems. The remainder of this paper is organized as follows. In Sec- tion II, we formulate the problem of designing the immersion- based coding for privacy of dynamical algorithms. In Section III, we construct a prescriptive solution design to the proposed problem using random affine maps. In Section IV, we prove that the proposed solution can provide any desired level of differential privacy. In Section V, we provide a synthesis procedure to summarize the operation methodology for the proposed mechanism. In Section VI, the performance of the immersion-based coding is shown for two use cases of privacy in optimization/learning algorithms and nonlinear networked control systems. Conclusions are drawn in Section VII. Notation: The symbol R represents the real numbers and R+ denotes the set of positive real numbers. The symbol N represents the set of natural numbers and N0 denotes the set of natural numbers with zero. For x ∈ Rn, the Euclidian norm is denoted by ||x||, ||x||2 = x⊤x, where ⊤ denotes transposition, and the l1 norm as ∥x∥1 = PN i=1 |xi|, x = (x1, . . . , xN )⊤. The n × n identity matrix is denoted by In or simply I if n is clear from the context. Similarly, n×m matrices composed of only ones and only zeros are denoted by 1n×m and 0n×m, re- spectively, or simply 1 and 0 when their dimensions are clear. The notation x ∼ N [μx, Σx] means x ∈ Rn is a normally distributed random vector with mean E[x] = μx ∈ Rn and covariance matrix E[(x−μx)(x−μx)⊤] = Σx ∈ Rn×n, where E[a] denotes the expected value of the random vector a. The notation y ∼ Laplace(μy , Σy ) stands for the Laplace random vector y ∈ Rm with mean E(y) = μy and scale parameter 1 2 E[(y − μy )(y − μy )⊤] = Σy ∈ Rm×m. The left inverse of matrix X is denoted by XL where XLX = I. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv