We are excited to announce the initial release of the MC² Project, a collection of open-source tools for computing and collaborating on confidential data. Developed at UC Berkeley’s RISELab, MC² (Multi-Party Collaboration and Coopetition) enables rich analytics and machine learning on encrypted data, ensuring that data remains concealed even when it’s being processed. The data in use remains hidden from the server running the job, allowing confidential workloads to be offloaded to untrusted third parties or cloud providers. This not only protects confidential data from intrusions but also enables secure collaboration — multiple data owners can jointly run analytics or ML on their collective data, without explicitly revealing their individual data to anyone else: not even a trusted third party.
Personal data is becoming more pervasive and privacy concerns continue to grow. This is causing global data protection laws to become stricter; organizations now face increasingly higher noncompliance risks. At the same time, these organizations are realizing the enormous benefits of being able to share their data with each other — banks can collaborate to detect financial crime, health institutions can collaborate on medical studies, etc.
Driven by these developments, Gartner predicts that, by 2025, “50% of large organizations will adopt privacy-enhancing computation for processing data in untrusted environments and multiparty data analytics use cases.”
The goal of the MC² Project is to realize this vision and solve this tension between expanding cloud adoption, the need for data sharing, and the increasing concern over data privacy.
Use Cases
MC² has already seen industry adoption and interest in applications surrounding finance and telecommunications: Ant Financial and Scotiabank for efforts towards anti-money laundering, fraud detection, or credit risk modeling; Ericsson for predicting hardware faults and performance problems across different mobile network operators.
More generally, industries that have data locked down due to privacy concerns can benefit from MC². Our platform keeps any confidential data, such as SSNs or PHI data, completely hidden during computation with the use of secure enclaves such as Intel SGX.
What are secure enclaves?
Enclaves provide isolated execution: Secure enclaves are a recent technology that enables the creation of a trusted execution environment (TEE) within an otherwise untrusted machine. Each enclave has access to a restricted portion of the memory; any data or software placed within the enclave is encrypted and isolated from the rest of the system. No other process on the same processor — not even privileged software such as the OS or the hypervisor — can access the encrypted enclave memory. This creates a layer of protection against any intrusion from the operating system itself; when used properly, anyone with root access to a machine running the workload can learn little to no information about what is happening inside the enclave.
Enclaves support remote attestation: Another key feature of secure enclaves is remote attestation. This is a feature that enables users to cryptographically verify that an enclave is running trusted, unmodified code. The MC² Project provides a remote attestation platform for users to attest any non-local compute service from a trusted local client running on their own machine.
Enclaves and side-channels: Unfortunately, loading existing software into enclaves could expose the data to certain side-channel attacks, where an attacker can learn additional information about the encrypted data by observing auxiliary information such as data access patterns during the software’s execution. Preventing such leakage is left to the software developer; MC² tackles this problem by fortifying the enclave code and ensuring it is resilient to side-channel leakage via memory access patterns.
Secure enclaves vs. other approaches
Secure enclaves are not the only privacy-enhancing approach out there for computing confidential data. Here, we compare it to other popular alternatives:
In particular, MC² provides a platform that can seamlessly run popular analytics and machine learning frameworks (Apache Spark, XGBoost, etc.) within enclaves securely and efficiently, abstracting away the complexities of writing enclave code from the end-user.
One approach to using enclaves is to simply load the entire application (e.g., Apache Spark) into the enclave. However, doing so adversely affects both the security and efficiency of the enclave application. For instance, if the program is memory-intensive, the performance will be greatly impacted by excessive encryption/decryption and paging. Instead:
MC² partitions the enclave code for security and efficiency: MC² partitions the application so that only the components that need to compute directly on the sensitive data are loaded into the enclave. Other components, such as network communication and task scheduling, are executed outside the enclave. This also benefits security by reducing the trusted computing base, i.e., the amount of code that runs within the enclave and therefore needs to be vetted beforehand.
MC² fortifies enclave execution: MC² fortifies the enclave components using cryptographic techniques to provide stronger security guarantees. This is done in two ways. First, MC² builds in measures to verify the integrity of jobs that have distributed execution. Second, since enclaves are known to be vulnerable to side-channel leakage, MC² makes use of data-oblivious techniques in enclave code to make sure that no side-channel information is leaked via memory access patterns. Data obliviousness ensures that the memory access patterns do not reveal any information about the sensitive data being accessed.
The MC² Client: The entry point to all compute jobs supported by MC² is the MC² Client. This tool runs in a trusted environment, typically the user’s local machine. Through a command line or Python interface, the client software is responsible for handling remote attestation and submitting jobs to the untrusted compute cluster. The client also contains additional features to generate keys needed for the compute service and to start/stop a cluster of machines on Microsoft Azure. (Visit the documentation for concrete details on how all of this can be achieved, or the quickstart for a hands-on demonstration of the workflow.)
The MC² Compute Services: MC² offers several compute services: these include Spark SQL, distributed XGBoost, and secure aggregation for federated learning. All are intended to run in a primary untrusted environment, such as a cluster of machines hosted on a public cloud, that has support for trusted execution environments (hardware enclaves). Data is encrypted in transit using a client key and only ever decrypted inside hardware enclaves, providing the previously mentioned security guarantees for data in use. For all compute services, MC² leverages the Open Enclave SDK, a project intended to provide a consistent API for a variety of different enclave architectures.
Research Prototypes
MC² also includes the following exploratory research prototypes (not integrated with the MC² Client) enabling secure computation with novel cryptographic techniques. These works were published at USENIX Security, a top security conference.
MC² is a platform for running secure analytics on data that stays encrypted even when in use. By doing so, the project also enables secure collaboration among multiple organizations, where individual data owners can use our platform to jointly analyze their collective data without revealing it to one another.
The development of the MC² Project is actively maintained by Opaque Systems. To learn more about how Opaque can help you take advantage of confidential computing, visit our website at opaque.co.
We would love your contributions! Visit our GitHub page to see all the projects under the MC² umbrella.
Also published at https://towardsdatascience.com/secure-collaborative-analytics-and-ml-using-mc%C2%B2-4be376cfaba0