Why Deep Learning Is Still Too Difficult

By Naren Krishna, Determined AI

While deep learning has great potential, building practical applications powered by deep learning remains to be too expensive and too difficult for many organizations. In this article, we will describe some of the challenges to broader adoption of deep learning. We will also explain how those challenges differ from those of traditional machine learning systems, and the path forward to making deep learning more widely accessible.

Why is Deep Learning (DL) Popular?

The core concepts underlying even the latest deep learning models can be traced back to the early ‘70s when the first artificial neural networks were born. The goal then, as it is now, is simple: how can we train a computer system to identify and extrapolate patterns in data? Interestingly, in the last few years, there has been an exponential growth of ML-related papers on Arxiv (nearly 100 new papers/day!) [1].

So, why now, even though major breakthroughs happened decades prior? After all, backpropagation was published in 1986 and the first convolutional neural network was created shortly after, yet there was an “AI winter” from the late ‘80s through the mid-’90s.

We assert that there are three main factors that have contributed to DL’s recent popularity:

The sheer volume and availability of data. As the world has become increasingly connected, data collection has become less burdensome with the advent of new technologies like smart devices and social networking platforms. Alongside increased capabilities of large file storage systems, public datasets have given researchers the opportunity to advance state-of-the-art with abundant examples to train their models on.
Specialized compute resources like GPUs, TPUs, and custom ASICs. Cloud providers like Amazon Web Services and Google Cloud Platform give developers convenient access to specialized managed hardware, enabling more complex models that would otherwise have been infeasible to train.
Step-changes in state-of-the-art accuracy and efficiency for computer vision, language processing, and unsupervised tasks. State-of-the-art results like those achieved by AlexNet in 2012 have driven an increase in deep learning interest and funding, and continue to inspire rapid progress.

What are the challenges to broadening the DL community?

Although deep learning has become increasingly popular in part due to data volume and specialized compute resources, each of these breakthroughs still hides new challenges. Much of the progress in DL has happened at a very small number of elite organizations and research institutions with considerable resources. However, across production applications, a lack of infrastructure and tooling for architecting and debugging deep learning pipelines, data issues, and high training costs continue to inhibit broader adoption.

With the breakneck pace of published papers, it is hard to keep up with state-of-the-art workflows. As newer deep learning models become more accurate and efficient, it becomes necessary to replace outmoded older ones. For many organizations without robust infrastructure, incorporating the latest and greatest deep learning model into an existing codebase is time-consuming – often requiring manually installing dependencies to get the model running, re-architecting the data pipeline to plug into the new system, and significant debugging effort. As such, reproducing the results of a previous training job is non-trivial with most deep learning tools.

Furthermore, diagnosing and solving problems in deep learning pipelines is a significant challenge. In contrast with traditional software engineering disciplines like web development and database systems, developer experience and tools around debugging deep learning workflows are severely lacking.

Even gaining visibility into how DL models are interacting with data is an open research area. Moreover, the model code released by researchers is not optimized for production use cases – inefficient data loading, lack of fault tolerance, and non-optimal data shuffling/batching are problematic for mission-critical systems.

Given that the tools and processes around productionizing models are lacking, building deep learning pipelines necessitates better infrastructure to handle deep learning workloads.

We mentioned earlier that data availability was one of the core reasons that have driven the success of deep learning. For the cases where data collection is not a problem, data organization often is. When dealing with terabytes (and, for certain applications, even petabytes) of raw data, filtering and labeling that data becomes a significant challenge.

This problem is amplified because of systemic biases often encoded in datasets, an issue that has spurred an entire subfield of research. Because deep learning models are only as good as the data they learn from, both the severity and prevalence of data issues in many deep learning applications are causes for concern.

In more traditional non-deep learning workflows, systemic biases still exist, but at a scale more easily managed and countered through manual model introspection and dataset balancing. However, deep learning use cases require orders of magnitude more data for models to generalize, making this problem more difficult to solve.

Another prohibitive roadblock to broadening use of deep learning is cost. The log-scale graph below from a talk given by Ilya Sutskever, research director at OpenAI, shows a significant increase in compute power necessary to train state-of-the-art deep learning models [2].

It is worth noting that state-of-the-art in 2012 (AlexNet) took ~5-6 days to train on GTX 580 3GB GPUs, which in today’s terms would be ~$100 in AWS dollars. In less than a decade, achieving state-of-the-art performance in deep learning has become nearly 300,000x more expensive since the advent of AlexNet. Some recent models like GPT-3 have an estimated training cost of over $12 million!

As these models become more computationally complex, the ability to run state-of-the-art workflows often requires expensive specialized hardware, typically GPUs on-premise or in the cloud. Although on-premise deep learning can be significantly more cost-effective than using the cloud, neither option is cheap.

This is vastly different from traditional machine learning workflows using tools like regressions, decision trees, and clustering algorithms -- which can be run efficiently on relatively low-cost CPUs. The graph below shows the increase of NVIDIA data center revenue over the past two fiscal years with a culminating 2021 revenue of $1.14 billion, a significant portion of which is attributed to the increased complexity of deep learning models and workflows [3].

On the subject of cost, another key blocker to broad adoption of deep learning is the cost of talent. If you don’t have a robust ML platform, your ML engineers aren’t going to be productive; instead, they will be forced to spend much of their time doing DevOps and other low-value work. On the other hand, building a custom ML platform from scratch is expensive, time-consuming, and requires scarce expertise in its own right. As such, either scenario can quickly become cost-prohibitive.

How do we effectively make DL more widely accessible?

Reducing Cost. Let us begin by addressing the rising costs of developing, training, and running deep learning models. Recent advances in state-of-the-art performance with faster and lighter-weight models like MobileNetV2 and EfficientNet, alongside innovations in model compression techniques have driven research into cost-effective deep learning.

In 2018, MLPerf v0.5 benchmarked the ability of researchers to train ResNet-50 on ImageNet using 8 Volta V100 GPUs in 64.1 minutes; the 2020 MLPerf v0.7 results by Fujitsu achieve a similar task in 68.82 minutes using only 2 V100 GPUs – a significant reduction in cloud spend achieved by algorithmic advances in distributed training.

Novel techniques in deep learning parallelism like weight-sharing applied to Neural Architecture Search have outperformed previous state-of-the-art methods in terms of resource utilization by orders of magnitude. Furthermore, the introduction of preemptible instances on GCP have discounted cloud costs for model development by nearly 70%. The results achieved by the aforementioned algorithmic and software improvements show promise in making DL more accessible to developers by reducing the price tag associated with DL development.

Open-Source Tooling. From a workflow perspective, the most obvious course of action to aid developers in getting started with deep learning would be supporting the creation of open-source standardized frameworks like PyTorch and TensorFlow. Not only will this enable developers to better reproduce results, but it will also lower the barrier of entry to deep learning for new practitioners through easy-to-use high-level APIs that can still achieve good performance. Increased awareness of and contributions to open-source platforms will also allow engineers to engage with the open-source community and build low-level understanding of complex workflows and systems.

Better Infrastructure. While standardized frameworks and pre-trained models found in tutorials can help individual developers get their feet wet, they do not address the obstacles to ensuring DL works in an enterprise setting. As mentioned earlier, a robust ML infrastructure is critical to taking advantage of state-of-the-art improvements in deep learning. Determined AI’s open-source platform allows model developers to focus on usability and leverage Determined’s DL experimentation infrastructure. Our platform provides push-button distributed training and state-of-the-art parallel hyperparameter search capabilities alongside automated experiment and model tracking for easy reproducibility.

Want to learn more about how Determined can help make your deep learning organization more productive? Try it out and let us know how it goes!

[1] J. Dean, D. Patterson and C. Young, "A New Golden Age in Computer Architecture: Empowering the Machine-Learning Revolution," in IEEE Micro, vol. 38, no. 2, pp. 21-29, Mar./Apr. 2018, doi: 10.1109/MM.2018.112130030.
[2] Dario Amodei and Danny Hernandez. AI and compute, 2018. Blog post.
[3] Brumley, James. “NVIDIA's Data Center Business Is Now Almost as Big as Gaming.” Nasdaq, 24 May 2020, 2:42PM EDT, www.nasdaq.com/articles/nvidias-data-center-business-is-now-almost-as-big-as-gaming-2020-05-24.
Featured image: ThriveGlobal