Scientist and engineer, thoughts on machine learning, physics, economics, society
There is a trend in neural networks that has existed since the beginning of the deep learning revolution which is succinctly captured in one word: scale.
The natural dimensions of scaling these models have been compute and data. Compute scaling has been a massive undertaking at every level of the stack, from hardware accelerators to high-level software tools, for more than a decade. Large datasets have become a key advantage in the new digital economy, and it has been the driver behind models that can ingest and make sense of such datasets.
In most cases the best models are at the cutting edge of our ability to scale on these resources. This becomes a practical issue as predictions sometimes need to be computed at high volume with low latency, but generally the models that learn the best aren't the fastest. The field of study that relates to tackling this issue is called model compression. There are many techniques, some feeling a little hacky like setting small weights to zero and inducing sparse data representations. Others seem more principled like knowledge distillation, where a trained heavy-weight model is used to "teach" a much smaller network, or another technique called conditional computation where a controller decides which parts of a network to activate for a given input sample.
It's an open question as to how much compressed models lose in generalizability versus the originally trained network. If you read the papers you'll of course see that in the experiments the compressed models perform just as well or surprisingly even better than the original models on their evaluation sets. But one thing I've learned in deploying models in the real world is that it's never the end of the story. The environment can always change under you, sometimes slowly and sometimes rather quickly.
The road to truly intelligent and adaptive systems might rely on what I've been inspired by the old gods of creativity and genius to call model decompression. The principle is straightforward: when you detect your model drifting in its performance, add back some capacity in the compressed model and learn on the fly.
This sounds like a stupid idea from a technical standpoint because the way we deal with model drift today is much more effective. We just retrain or continually train the big network and repeat the compression process when we want to ship out a new model. Understanding why that is the suboptimal thing requires a bit of imagination with respect to edge computing.
The reason model compression gets any sort of real attention is because we want to deploy these amazing technologies without being bottlenecked by networks. Network communication is typically the slowest and worst resource to be limited by, so if you need a heavy cloud server to do all of your ML and send it back to a mobile phone or remote device you're going to have a bad time with reliability and latency. Model compression enables us to ship these models and run them on-device with low latency and low power consumption.
I'll walk back on calling our current methods silly a bit because when the environment changes radically you can't expect a hyperspecialized model to simply adapt itself completely. If that were true we would never need a heavyweight model to begin with. There is some recent work on using intermediate models, called "teaching assistants", to help build a hierarchy of distillation to help the student network learn better. Taking this to its logical end for solving model drift at the edge, it seems that the intermediate networks can sit on a spectrum of resource requirements between the teacher and the student.
Of course for smaller drift this makes a lot more sense, and perhaps one could even get away with no new resources at all (besides compute power). It is all relative, and something that humans in the form of ML engineers shouldn't be tuning themselves when they're alerted that the models are suddenly regressing. It makes a lot more sense that we introduce new components to this system that help automate this process.
There have been a lot of criticisms thrown at our distinction between a training and inference phase. The thought is that training should always be happening like in brains, not just one-and-done. This is somewhat related to the study of continual learners.
Unfortunately this doesn't easily fit with the hardware story. Training is assumed to be very compute intensive, and the situation is rather hopeless without hardware acceleration. But this isn't always the reality: training from scratch is known to be compute intensive, but we may not need dozens or hundreds of passes over our training data to adapt to new situations. Perhaps it's worth revisiting the idea of training at the edge with current hardware.
These are hard problems to solve today for the edge because of the hard constraints on resources. They are much less encumbering in the cloud, so it's a wonder why we don't have systems that auto-tune themselves and constantly ingest data from a variety of streams. Well, we have them, but of the organizations I know that do this the size of their ML teams is in the dozens to hundreds. One big reason is just that building scalable data processing systems is still non-trivial. But the bigger reason is that our models are still hard to train, for a variety of reasons, and it involves a lot of people trying different ideas to make gains in performance.
This makes sense in the age of ML where features are generated by hand. But deep learning has changed that, or at least it promised to. And we have shown in the past decades, especially our most recent one, that this does work. But most real-world scenarios haven't had the benefit of some very bright scientists and years of research thinking about inductive biases and data preparation strategies. Scaling their efforts has resulted in things like neural architecture search, hyperparameter optimization, and transfer learning. These have also been proven to work internally at Google, so much so that it's now available as a cloud service. With newer forms of unsupervised pretraining methods in different data domains, the canonical successes of deep learning may finally be realized for many real world problems across domains and industries.
These are the foundational elements of our next-generation system.
As of yet these elements are not integrated into one system. There are several reasons for why we haven't gotten there which I will outline.
Some of the parts seem incompatible. For instance, after finding an optimal architecture via neural architecture search, how does one then employ transfer learning from pretrained models? And is there software out there where you can store models and query them when you want to build a new one, retrieving the best one to use as a pretrained starting point? Do similar things exist for getting raw data into the optimal form for current networks to effectively learn from? Surely those algorithms exist with hyperparameter equivalents that should be tuned in conjunction with models? I expect in time we will solve some of these things, and it's something that must be driven by an industry research lab with the expertise in both systems engineering and deep learning as a science.
There is also a large cost to running such a system, ignoring the necessary upfront capital to build it. But cost is less interesting if we expand our view on the time dimension. Because everyone knows that compute costs will always trend aggressively downwards, it has been true for half a century now, it continues despite the end of Moore's Law, and there is now an entire global economy pushing that advancement along.
Cost is also less interesting because transfer learning promises to remove the bulk of necessary compute power. One only needs to train a good base model once and fine-tune it as many times as necessary with cost being orders of magnitude smaller. This evokes the application of another advanced technique in the area: federated learning.
It is often the case that companies have proprietary data they don't want to share. And yet they may also be hesitant to exploit that data due to the severe cost limitations on apply deep learning to it. But perhaps they would be more willing if we had our next-generation adaptive deep learning system coupled with the ability to do federated learning, that is, train locally (as in, on-premises inside a company's network) and only report back gradients to the outside world. Concretely, this could enable multiple health records companies or law firms to jointly create the best model for their problems, leaving its true application downstream for fine-tuning on their specific problems where they don't have to share anything about the model (and the other parties likely don't care much for it anyway as it's of only tangential value to them). Federated learning could drive the cost of even the initial pretraining models down further.
Perhaps we need not wait for better compute hardware after all.