Artificial Intelligence is taking the world by storm and Machine Learning is everywhere. But is it actually useful without the enormous amount of data that is required for training these systems so that it can be used to infer on new data? Data will remain as the trump-card of most machine learning algorithms until there is a significant improvement in technologies like Single Shot Learning.
We humans are considered to be the most valuable resource available right now, not for the fact that we can spend a lot of money, but for the fact that we can do generate a lot of content which can be used to make a lot of money. Yes, the data that we make is one of the most valuable resource out there right now. That is how social networking providers make a lot of money just by selling our data and our time with attention.
Yes, data is the new currency.
With the advent of mobile devices, the amount of data that is generated has exploded exponentially. There are concerns about the vast amount of personal data send across to servers and used for generating money without the actual producer of data benefiting. The emergence of Machine Learning had made things even more interesting. Most of the machine learning algorithms available now works by training models in a highly centralised environment, this means that user’s data need to be send to a server before actually used for learning. In addition to raising concerns over the privacy of data, it also imposes potential bottle necks over the network traffic (remember up-link speeds are usually slow).
These kind of bandwidth limitations has also led to the emergence of technologies like Edge Computing where data processing happens near to the source of the data which significantly decrease the volumes of data that must be moved to a server, the consequent traffic, and the distance the data must travel, thereby reducing transmission costs, shrinking latency, and improving quality of service.
Introducing OpenMined, an open-source community focused on building technology to facilitate the decentralised ownership of data and intelligence by providing a peer-to-peer network which allows any new company or person to train their AI models on user data, without the actual owner of data losing control over it. OpenMined aims to provide data sets which are very similar to those owned by big companies and to provide compensation for actual owners of data. This may help reduce the monopoly of large companies by reducing the barrier of entry for newer players who wants access to large data sets and it also make sure that the financial gains are equally distributed.
Its not just the idea of decentralized ownership of data and learning that keeps us interested here, but the enabling technical stack of OpenMined is far more exciting. Its a combination of Deep Learning, Federated Learning, Blockchain and Homomorphic encryption which facilitates learning form data without the owner of data loosing contol over it.
Assume that we have some data that can be used to train their model. (A model is a representation of a system which is derived from a large data set and can be used for inference from a new data set.) In a normal scenario like this, we will need to upload the data available to the cloud and then train the model based on the data. But that will compromise our sensitive data to that organisation which owns the central server and the organisation can keep track of our data.
What if we can bring the model down to our device and update the model locally and send it back to the cloud? This can be a tedious task, especially to upload the model after training due to the fact that up-speeds on most of the network are significantly slower than down-speed.
Federated Learning is the technology which enables us to train deep neural networks in a distributed manner with out sending the actual data to a centralised server.
In federated learning we train a deep network by bringing the model down to the place of data , improves it by learning from data on our phone, and then summarises the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud.
Federated Learning is being used by Google in Gboard on Android, the Google Keyboard, for auto correction and predicting the next word. We can find more about federated learning at this google research blog post. Also, the research paper Communication-Efficient Learning of Deep Networks from Decentralized Data explains the practical sides of federated learning.
So, federated learning can significantly reduce privacy and security risks by limiting the attack surface to only the device, rather than the device and the cloud. This can be used to train the deep neural networks in a distributed manner, in other words we can perform Deep Learning.
In Deep Learning one type of data is converted to another form of data by using a predict, compare and learn cycle.
For this purpose it uses a collection of functions with associated weights which are initialised randomly and organised in a hierarchical structure. In fact a Deep Neural Network is an Artificial Neural Network with multiple hidden layers. An in depth explanation of how artificial neural network works can be found here. In case of federated learning, the gradients are calculated at the source of data and only this gradient/diff is send across to the server. So that it can then be used to update the weights of actual model and hence improving the system.
At this point we have two fundamental problems with this approach. First is the fact that the data source will get access to the model which they can download and use (or misuse) separately. This can prove costly for the owner of the model, as he will lose his trained model which may have cost some considerable amount of resources. Second, the trend of user data (or even the actual data) can be reverse engineered by the owner of the model.
Lets see how OpenMined solves these problems.
In order to make sure that the data source will not have access to the model, OpenMined depends on a special kind of encryption. (Encryption is basically the process of converting a usable data into an encoded format using some keys and that specific key is required to decode the data back to usable form.) But in our particular case if we encrypt the model and send it to the data source, it will be difficult to train these encrypted models using the data available at data source.
Protection of model(or data) while being computed on is a challenging and expensive problem to solve. (Here is an example: Galois Awarded $1 million IARPA Contract To Improve Security Of Data Computation) The cryptographic system which allows third parties to perform operation on a cipher-text without having the knowledge of secret key is called homomorphic encryption.
A homomorphism is a map between two algebraic structures of the same type (that is of the same name), that preserves the operations of the structures. i.e. If the operations on two algebraic objects A and B are both addition, then the homomorphism condition is f(a+b) = f(a) + f(b).
Homomorphic encryption is a form of encryption which allows specific types of computations to be carried out on ciphertexts and generate an encrypted result which, when decrypted, matches the result of operations performed on the plaintexts.
In the below image we can see that 3 and 5 and converted into Cypher A and Cypher B via homomorphic encryptor functions.
Now we can do mathematical operations on these Cyphers which will produce a new Cyphers which when decrypted will provide the same result as if we are doing calculations on 3 and 5.
There are different levels of homomorphic encryption available. Partial homomorphic encryption was the one which was commonly used for quite some time (from 1978). RSA is the first public-key encryption scheme with a homomorphic property but it uses some padding bits for semantic security. In the common RSA implementation, multiplication of ciphertexts is equal to the multiplication of the original messages (but not in case of addition). So unpadded RSA is a partially homomorphic encryption system.
A recent and well know implementation of partial homomorphic encryption is by Helios to encrypts vote and store it in cloud.
Some what homomorphic encryption may contain certain noise which will increase with the amount of computation performed on homomorphic system. In 2009, Craig Gentry constructed the first Fully Homomorphic Encryption scheme.
At a high-level, the essence of fully homomorphic encryption is simple: given ciphertexts that encrypt π1, . . . , πt , fully homomorphic encryption should allow anyone (not just the key-holder) to output a ciphertext that encrypts f(π1, . . . , πt) for any desired function f, as long as that function can be efficiently computed. No information about π1, . . . , πt or f(π1, . . . , πt), or any intermediate plaintext values, should leak; the inputs, output and intermediate values are always encrypted.
More info at : A FULLY HOMOMORPHIC ENCRYPTION SCHEME written by Craig Gentry
This means that Fully Homomorphic Encryption is a variant of somewhat Homomorphic Encryption that allows computing a wider range of functions on data that is encrypted so that the data is never in the clear while the computation is going on. It produces an encrypted result that a user with the right key can decrypt. One of the key design goal of Fully Homomorphic Encryption systems is to have minimal performance impacts as it tends to do wider range of operations on encrypted data.
Hmm… Is it getting boring? Lets see this video to understand the concept.
Coming back to our actual use-case companies can homomorphically encrypt the actual model and send it to the data source and a learning algorithm can update the model and send gradients it to the company. In this case, first problem that we described is solved, means data source can no longer use the model as the key of homomorphic encryption is not available to them.
(OpenMined got its module called Syft which is a Homomorphically Encrypted Deep Learning Library and it containing Neural Networks that can be trained in an encrypted state. OpenMined got different implementation of homomorphic encryption like BV Homomorphic Encryption Scheme, Aono homomorphic encryption scheme by Yoshinori Aono and Yet Another Somewhat Homomorphic Encryption (YASHE))
Now the second problem need to be tackled. It is a problem with trust. We as data source do not believe the company with encryption key and updated model. They can get details about the data by comparing previous model and gradients. We need a platform or protocol which can enable us to trust the company. It is Blockchain Smart Contract, which mitigates this by moving trust from a centralised party to the system itself.
Blockchain is a collection of blocks that grows continuously and lives in a distributed environment which are linked and secured using cryptography.
You can read more about blockchain here.
Smart contract is a term used to describe computer program code that is capable of facilitating, executing, and enforcing the negotiation or performance of an agreement using blockchain technology. The entire process is automated can act as a complement, or substitute, for legal contracts, where the terms of the smart contract are recorded in a computer language as a set of instructions.
In other words blockchain is an immutable list of records that is kept across different nodes and smart contract is a piece of code that is executed when a particular condition occurs with in the blockchain. Once initialised no one can change the order or data with in the blockchain and changes are visible to every one.
In case of OpenMined, use of blockchain not only allows to make sure that the organisation is using user data correctly but also to guarantee that data source is compensated for providing valid data which improves the actual model. The component called Sonar UI allows any organisation to start a new request for training a model. This request is termed as campaign request. Campaign request will contain the model initialisation data along with some compensation which are usually in a crypto currency format. The compensation provided by organisations is called bounty.
The campaign request will be directed to a federated learning server running on blockchain called Sonar. Sonar will initialise the a model using smart contracts with the given specification.
Wait a second.
The model need to be homomorphically encrypted right? How will sonar do that? It is a blockchain run server so it can not keep track of the encryption key.
The task of generating public and private keys is done by a external third-party PGP server called Capsule which will ensure that Sonar neural network stays encrypted. The Sonar will send the model spec to the capsule along with the organisations public key. Capsule will initialise the model by generating the homomorphic keys and keep track of decryption key and organisations public key. The generated model will be stored on the InterPlanetary File System.
InterPlanetary File System (IPFS) is a peer to peer protocol which allows sharing of data in a decentralised manner using content address. It is a distributed file system that seeks to connect all computing devices with the same system of files.
Component that represent data source which can be used to improve the model is called Data “Mine” in OpenMined and the entity which owns it is called Miner. All miners will be hosting individual data repositories called mines which contain actual data that can be used for improving the model. Also miners have option to increase the amount of data available by using Adapters and pulling in data from other sources like Twitter.
Each mine will be constantly checking with the Sonar to find out new neural networks to contribute. If an opportunity is found by the mine, then it will download the model from the IPFS and train the model locally and submits the gradients to IPFS and receives an address. Then miner will send the IPFS address of gradient to Sonar. Sonar with the help of Capsule, sorts encrypted loss corresponding to different gradients it received and decides how each gradient to be weighted. Using this information about the actual contribution of each miners, reward for miners will be executed on the Smart contract of Sonar. So the owner of the actual data can be benefited for every useful contribution.
Once training is complete, the encrypted model will be send back to the capsule where it will be getting homomorphically decrypted using the private key and then encrypted with the public key of the organisation which initiated the campaign and send to the organisation. So how can we trust capsule with all the private keys? For this purpose who ever hosting the capsule is required to put a sum of money into another smart contract which can be deducted in case of breech in any of the private key.
You can find repositories and more details of OpenMined here.
We can see that the above tech stack can solve problems like data privacy and aggregated data power while allowing owners to use data as a natural source of income. Of-course there are other solutions in same product lines like Ocean Protocol which aims to Unlock Data for AI.
Ocean Protocol is a decentralized data exchange protocol that lets people share and monetize data while guaranteeing control, auditability, transparency and compliance to all actors involved.
At this moment in time, we can not predict whether or not OpenMined is going to succeed, but it is solving one real world problem and the way that OpenMined choose to solve this problem by combining the latest available tech in an innovative and collaborative way is truly motivational. This kind of open-source community efforts to tackle real world problem can help us to learn more about new technologies along with wide range of possibilities when using them in correct combination.
Thanks for your time.
Edited By : Ashok Jeevan