Speed up state-of-the-art models in Hugging Face š¤ up to 2300% (25x times faster ) with Databricks, Nvidia, and Spark NLP š ViT I am one of the contributors to the open-source project and just recently this library started supporting end-to-end models. I use Spark NLP and other ML/DL open-source libraries for work daily and I have decided to deploy a ViT pipeline for a state-of-the-art image classification task and provide in-depth comparisons between and . Spark NLP Vision Transformers (ViT) Hugging Face Spark NLP The purpose of this article is to demonstrate how to scale out Vision Transformer (ViT) models from Hugging Face and deploy them in production-ready environments for accelerated and high-performance inference. By the end, we will scale a ViT model from Hugging Face by by using Databricks, Nvidia, and Spark NLP. 25x times (2300%) In this article I will: A short introduction to Vision Transformer (ViT) Benchmark Hugging Face inside Dell server on CPUs & GPUs Benchmark Spark NLP inside Dell server on CPUs & GPUs Benchmark Hugging Face inside Databricks Single Node on CPUs & GPUs Benchmark Spark NLP inside Databricks Single Node on CPUs & GPUs Benchmark Spark NLP inside Databricks scaled to 10x Nodes with CPUs & GPUs Sum up everything! In the spirit of full transparency, all the notebooks with their logs, screenshots, and even the excel sheet with numbers are provided here on GitHub Introduction to Vision Transformer (ViT) models Back in 2017, a group of researchers at Google AI published a paper that introduced a transformer model architecture that changed all Natural Language Processing (NLP) standards. The paper describes a novel mechanism called self-attention as a new and more efficient model for language applications. For instance, the two of the most popular families of transformer-based models are GPT and BERT. A bit of Transformer history https://huggingface.co/course/chapter1/4 There is a great chapter about ā which I highly recommend for reading if you are interested. How Transformers Work ā Although these new Transformer-based models seem to be revolutionizing NLP tasks, their usage in Computer Vision (CV) remained pretty much limited. The field of Computer Vision has been dominated by the usage of convolutional neural networks (CNNs) and there are popular architectures based on CNNs (like ResNet). This had been the case until another team of researchers this time at Google Brain introduced the (ViT) in June 2021 in a paper titled: āVision Transformerā ā An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale ā This paper represents a breakthrough when it comes to image recognition by using the same self-attention mechanism used in transformer-based models such as BERT and GPT as we just discussed. In Transformed-based language models like BERT, the input is a sentence (for instance a list of words). However, in ViT models we first split an image into a grid of sub-image patches, we then embed each patch with a linear project before having each embedded patch become a token. The result is a sequence of embeddings patches which we pass to the model similar to BERT. An overview of the ViT model structure as introduced in Google Researchās original 2021 paper Vision Transformer focuses on higher accuracy but with less compute time. Looking at the benchmarks published in the paper, we can see the training time against the dataset (published by Google in Jun 2020) has been decreased by 80% even though the accuracy state is more or less the same. For more information regarding the ViT performance today you should visit its page on Noisy Student Papers With Code : Comparison with state of the art on popular image classification benchmarks. ( ) https://arxiv.org/pdf/2010.11929.pdf It is also important to mention that once you have trained a model via ViT architecture, you can pre-train and fine-tune your transformer just as you do in NLP. (thatās pretty cool actually!) If we compare ViT models to CNNs we can see that they have higher accuracy with much lower cost for computations. You can use ViT models for a variety of downstream tasks in Computer Vision like image classification, detecting objects, and image segmentation. This can be also domain-specific in Healthcare you can pre-train/fine-tune your ViT models for , , , , and .¹ femur fractures emphysema breast cancer COVID-19 Alzheimerās disease I will leave references at the end of this article just in case you want to dig deeper into how ViT models work. [1]: Deep Dive: Vision Transformers On Hugging Face Optimum Graphcore https://huggingface.co/blog/vision-transformers Some ViT models in action Vision Transformer (ViT) model ( ) pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224: vit-base-patch16ā224 https://huggingface.co/google/vit-base-patch16-224 Fine-tuned ViT models used for food classification: ā https://huggingface.co/nateraw/food https://huggingface.co/julien-c/hotdog-not-hotdog There are however limitations & restrictions to any DL/ML models when it comes to prediction. There is no model with 100% accuracy so keep in mind when you are using them for something important like Healthcare: https://www.akc.org/expert-advice/lifestyle/do-you-live-in-dog-state-or-cat-state/āā : https://huggingface.co/julien-c/hotdog-not-hotdog Image is taken from: ViT model Can we use these models from Hugging Face or fine-tune new ViT models and use them for inference in real production? How can we scale them by using managed services for distributed computations such as AWS EMR, Azure Insight, GCP Dataproc, or Databricks? Hopefully, some of these will be answered by the end of this article. Let the Benchmarks Begin! Some details about our benchmarks: ImageNet mini: (>3K)āā (>34K) 1- Dataset: sample full I have downloaded ImageNet 1000 (mini) dataset from Kaggle: https://www.kaggle.com/datasets/ifigotin/imagenetmini-1000 I have chosen the train directory with over and called it since all I needed was enough images to do benchmarks that take longer. In addition, I have randomly selected less than 10% of the full dataset and called it which has for my smaller benchmarks and also to fine-tune the right parameters like the batch size. 34K images imagenet-mini imagenet-mini-sample 3544 images The ā ā by Google 2- Model: vit-base-patch16ā224 We will be using this model from Google hosted on Hugging Face: https://huggingface.co/google/vit-base-patch16-224 š¤ & š 3- Libraries: Transformers Spark NLP Benchmarking Hugging Face on a Bare MetalĀ Server ViT model on a Dell PowerEdge C4130 What is a bare-metal server? A - is just a physical computer that is only being used by one user. There is no hypervisor installed on this machine, there are no virtualizations, and everything is being executed directly on the main OS (LinuxāāāUbuntu)āāāthe detailed specs of CPUs, GPUs, and the memory of this machine are inside the notebooks. bare metal server As my initial tests plus almost every blog post written by the Hugging Face engineering team comparing inference speed among DL engines have revealed, the best performance for inference in the Hugging Face library (Transformer) is achieved by using PyTorch over TensorFlow. I am not sure whether this is due to TensorFlow being a second-class citizen in Hugging Face due to fewer supported features, fewer supported models, fewer examples, outdated tutorials, and yearly surveys for the last 2 years answered by users asking more for TensorFlow or PyTorch just has a lower latency in inference on both CPU and GPU. TensorFlow remains the most-used deep learning framework Regardless of the reason, I have chosen PyTorch in the Hugging Face library to get the best results for our image classification benchmarks. This is a simple code snippet to use a ViT model (PyTorch of course) in Hugging Face: from PIL import Image import requests from transformers import ViTFeatureExtractor, ViTForImageClassification url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream= True ).raw) feature_extractor = ViTFeatureExtractor.from_pretrained( 'google/vit-base-patch16-224' ) model = ViTForImageClassification.from_pretrained( 'google/vit-base-patch16-224' ) inputs = feature_extractor(images=image, return_tensors= "pt" ) outputs = model(**inputs) logits = outputs.logits # model predicts one of the 1000 ImageNet classes predicted_class_idx = logits.argmax(- 1 ).item() print("Predicted class:", model.config.id2label [predicted_class_idx] ) This may look straightforward to predict an image as an input, but it is not suitable for larger amounts of images, especially on a GPU. To avoid predicting images sequentially and to take advantage of accelerated hardware such as GPU is best to feed the model with batches of images which is possible in Hugging Face via . Needless to say, you can implement your batching technique either by extending Hugging Faceās Pipelines or doing it on your own. Pipelines A simple pipeline for will look like this: Image Classification from transformers import ViTFeatureExtractor, ViTForImageClassification from transformers import pipeline feature_extractor = ViTFeatureExtractor.from_pretrained( 'google/vit-base-patch16-224' ) model = ViTForImageClassification.from_pretrained( 'google/vit-base-patch16-224' ) pipe = pipeline( "image-classification" , model=model, feature_extractor=feature_extractor, device=- 1 ) As per documentation, I have downloaded/loaded for the feature extractor and model (PyTorch checkpoints of course) to use them in the pipeline with image classification as the task. There are 3 things in this pipeline that is important to our benchmarks: google/vit-base-patch16ā224 : If itāsĀ -1Ā (default) it will only use CPUs while if itās a positive int number it will run the model on the associated CUDA device id.(itās best to hide the GPUs and force PyTorch to use CPU and not just rely on this number here). > device When the pipeline will use (when passing a dataset, on GPU for a Pytorch model), the size of the batch to use, for inference is not always beneficial. > batch_size: DataLoader You have to use either DataLoader or PyTorch Dataset to take full advantage of batching in Hugging Face pipelines on a GPU. > Before we move forward with the benchmarks, you need to know one thing regarding the batching in Hugging Face Pipelines for inference, that it doesnāt always work. As it is stated in Hugging Faceās documentation, setting may not increase the performance of your pipeline at all. It may slow down your pipeline: batch_size https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching To be fair, in my benchmarks I used a range of batch sizes starting from 1 to make sure I can find the best result among them. This is how I benchmarked the Hugging Face pipeline on CPU: from transformers import pipeline pipe = pipeline( "image-classification" , model=model, feature_extractor=feature_extractor, device=- 1 ) for batch_size in [ 1 , 8 , 32 , 64 , 128 ]: print ( "-" * 30 ) print ( f"Streaming batch_size= {batch_size} " ) for out in tqdm(pipe(dataset, batch_size=batch_size), total= len (dataset)): pass Letās have a look at the results of our very first benchmark for the Hugging Face image classification pipeline on CPUs over the sample (3K) ImageNet dataset: Hugging Face image-classification pipeline on CPUs ā predicting 3544 images As it can be seen, it took around 3 minutes ( to finish processing around from the sample dataset. Now that I know which batch size (8) is the best for my pipeline/dataset/hardware, I can use the same pipeline over a larger dataset ( ) with this batch size: 188 seconds) 3544 images 34K images Hugging Face image-classification pipeline on CPUs ā predicting 34745 images This time it took around 31 minutes ( ) to finish predicting classes for on CPUs. 1,879 seconds 34745 images To improve most deep learning models, especially these new transformer-based models, one should use accelerated hardware such as GPU. Letās have a look at how to benchmark the very same pipeline over the very same datasets but this time on a device. As mentioned before, we need to change the to a CUDA device id likeĀ 0Ā (the first GPU): GPU device model = model.to(device) from transformers import ViTFeatureExtractor, ViTForImageClassification from transformers import pipeline import torch device = "cuda:0" if torch.cuda.is_available() else "cpu" print (device) feature_extractor = ViTFeatureExtractor.from_pretrained( 'google/vit-base-patch16-224' ) model = ViTForImageClassification.from_pretrained( 'google/vit-base-patch16-224' ) pipe = pipeline( "image-classification" , model=model, feature_extractor=feature_extractor, device= 0 ) for batch_size in [ 1 , 8 , 32 , 64 , 128 , 256 , 512 , 1024 ]: print ( "-" * 30 ) print ( f"Streaming batch_size= {batch_size} " ) for out in tqdm(pipe(dataset, batch_size=batch_size), total= len (dataset)): pass In addition to settingĀ device=0, I also followed the recommended way to run a PyTorch model on a GPU device viaĀ .to(device). Since we are using accelerated hardware (GPU) I also increased the maximum batch size for my testings to 1024 to find the best result. Letās have a look at our Hugging Face image classification pipeline on a GPU device over the sample ImageNet dataset (3K): Hugging Face image-classification pipeline on a GPU ā predicting 3544 images As it can be seen, it took around to finish processing around from our imagenet-mini-sample dataset on a . The batching improved the speed especially compare to the results coming from the CPUs, however, the improvements stopped around the batch size of 32. Although the results are the same after batch size 32, I have chosen batch size for my larger benchmark to utilize enough GPU memory as well. 50 seconds 3544 images GPU device 256 Hugging Face image-classification pipeline on a GPU ā predicting 34745 images This time our benchmark took around 8:17 minutes ( ) to finish predicting classes for on a device. If we compare the results from our benchmarks on CPUs and a GPU device we can see that the GPU here is the winner: 497 seconds 34745 images GPU Hugging Face (PyTorch) is up to 3.9x times faster on GPU vs. CPU I used Hugging Face Pipelines to load ViT PyTorch checkpoints, load my data into the torch dataset, and use out-of-the-box provided batching to the model on both CPU and GPU. The is up to faster compared to running the same pipelines on CPUs. GPU ~3.9x times We have improved our ViT pipeline to perform image classification by using a instead of CPUs, but can we our pipeline further on both & in a single machine before scaling it out to multiple machines? Letās have a look at the Spark NLP library. GPU device improve CPU GPU Spark NLP: State-of-the-Art Natural Language Processing Spark NLP is an open-source state-of-the-art Natural Language Processing library ( ) https://github.com/JohnSnowLabs/spark-nlp Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with pretrained and in more than . It also offers tasks such as Tokenization, Word Segmentation, Part-of-Speech Tagging, Word and Sentence Embeddings, Named Entity Recognition, Dependency Parsing, Spell Checking, Text Classification, Sentiment Analysis, Token Classification, Machine Translation (+180 languages), Summarization & Question Answering, Text Generation, Image Classification (ViT), and many more . 7000+ pipelines models 200+ languages NLP tasks Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as , , , , , , , , , , , , Google , , , and Vision Transformer ( ) not only to and , but also to JVM ecosystem ( , , and ) at scale by extending Apache Spark natively. BERT CamemBERT ALBERT ELECTRA XLNet DistilBERT RoBERTa DeBERTa XLM-RoBERTa Longformer ELMO Universal Sentence Encoder T5 MarianMT GPT2 ViT Python R Java Scala Kotlin Benchmarking Spark NLP on a Bare Metal Server ViT models on a Dell PowerEdge C4130 Spark NLP has the same ViT features for as Hugging Face which were added in the recent release. The feature is called and a simple code to use this feature in Spark NLP looks like this: Image Classification 4.1.0 ViTForImageClassification, it has over 240 pre-trained models ready to go , imageAssembler = ImageAssembler() \ imageClassifier = ViTForImageClassification \ pipeline = Pipeline(stages=[ imageAssembler, imageClassifier ]) from sparknlp.annotator import * from sparknlp.base import * from pyspark.ml import Pipeline .setInputCol( "image" ) \ .setOutputCol( "image_assembler" ) .pretrained( "image_classifier_vit_base_patch16_224" ) \ .setInputCols( "image_assembler" ) \ .setOutputCol( "class" ) \ .setBatchSize( 8 ) If we compare Spark NLP and Hugging Face side by side for downloading and loading a pre-trained ViT model for an Image Classification prediction, apart from loading images and using post calculations likeĀ argmaxĀ outside the Hugging Face library, they are both pretty straightforward. Also, they both can be saved and serve later as a pipeline to reduce these lines into only 1 line of code: Loading and using ViT models for Image Classification in Spark NLP (left) and Hugging Face (right) Since Apache Spark has a concept called it doesnāt start the execution of the process until an is called. Actions in Apache Spark can beĀ .count()Ā orĀ .show()Ā orĀ .write()Ā and so many other RDD-based operations which I wonāt get into it now and you wonāt need to know them for this article. I usually choose eitherĀ count()Ā the target column orĀ write()Ā the results on disks to trigger executing all the rows in the DataFrame. Also, like Hugging Face benchmarks, I will loop through selected batch sizes to make sure I can have all the possible results without missing the best outcome. Lazy Evaluation ACTION Now, we know how to load ViT model(s) in Spark NLP, we also know how to trigger an action to force computation over all the rows in our DataFrame to benchmark, and all that is left to learn is oneDNN from . Since the DL engine in Spark NLP is TensorFlow, you can also enable oneDNN to improve the speed on CPUs (like everything else, you need to test this to be sure it improves the speed and not the other way around). I will also be using this flag in addition to normal CPUs without oneDNN enabled oneAPI Deep Neural Network Library (oneDNN) Now that we know all the ViT models from Hugging Face are also available in Spark NLP and how to use them in a pipeline, we will repeat our previous benchmarks on the bare-metal Dell server to compare CPU vs. GPU. Letās have a look at the results of Spark NLPās image classification pipeline on CPUs over our sample (3K) ImageNet dataset: Spark NLPimage-classification pipeline on a CPU without oneDNN ā predicting 3544 images It took around 2.1 minutes ( to finish processing around from our sample dataset. Having a smaller dataset to try different batch sizes is helpful to choose the right batch size for your task, your dataset, and your machine. Here is clear that is the best size for our pipeline to deliver the best result. 130 seconds) 3544 images batch size 16 I would like to also enable to see if in this specific situation it improves my benchmark compare to the CPUs without oneDNN. You can enable oneDNN in Spark NLP by setting the environment variable of to Letās see what happens if I enable this flag and re-run the previous benchmark on the CPU to find the best batch size: oneDNN TF_ENABLE_ONEDNN_OPTS 1. Spark NLPimage-classification pipeline on a CPU with oneDNN ā predicting 3544 images OK, so clearly enabling oneDNN for TensorFlow in this specific situation improved our results by at least 14%. Since we donāt have to do/change anything and all it takes is to sayĀ export TF_ENABLE_ONEDNN_OPTS=1Ā I am going to use that for the benchmark with a larger dataset as well to see the difference. Here is around seconds faster, but 14% on the larger dataset can shave off minutes of our results. Now that I know the batch size of 16 for CPU without oneDNN and batch size of 2 for CPU with oneDNN enabled have the best results I can continue with using the same pipeline over a larger dataset ( ): 34K images Spark NLP image-classification pipeline on CPUs without oneDNN ā predicting 34745 images This time our benchmark took around 24 minutes ( ) to finish predicting classes for on a device without oneDNN enabled. Now letās see what happens if I enable oneDNN for TensorFlow and use the batch size of 2 (the best results): 1423 seconds 34745 images CPU Spark NLP image-classification pipeline on CPUs with oneDNN ā predicting 34745 images This time it took around 21 minutes ( ). As expected from our sample benchmarks, we can see around in the results which did shave off minutes compared to not having oneDNN enabled. 1278 seconds 11% improvements Letās have a look at how to benchmark the very same pipeline on a GPU device. In Spark NLP, all you need to use GPU is to start it withĀ gpu=TrueĀ when you are starting the Spark NLP session: spark = sparknlp.start(gpu=True) # you can set the memory here as well spark = sparknlp.start(gpu=True, memory="16g") Thatās it! If you have something in your pipeline that can be run on GPU it will do it automatically without the need to do anything explicitly. Letās have a look at our Spark NLP image classification pipeline on a GPU device over the sample ImageNet dataset (3K): Spark NLPimage-classification pipeline on a GPU ā predicting 3544 images Out of curiosity to see whether my crusade to find a good batch size on a smaller dataset was correct I ran the same pipeline with GPU on a larger dataset to see if the batch size 32 will have the best result: Spark NLP image-classification pipeline on a GPU ā predicting 34745 images Thankfully, it is batch size 32 that yields the best time. So it took around 4 and a half minutes ( 277 seconds). I will pick the results from since they were faster and I will compare them to the results: CPUs with oneDNN GPU Spark NLP (TensorFlow) is up to 4.6x times faster on GPU vs. CPU (oneDNN) This is great! We can see Spark NLP on GPU is up to than CPUs even with oneDNN enabled. 4.6x times faster Letās have a look at how these results are compared to Hugging Face benchmarks: Spark NLP is 65% faster than Hugging Face on CPUs in predicting image classes for the sample dataset with 3K images and 47% on the larger dataset with 34K images. Spark NLP is also 79% faster than Hugging Face on a single GPU inference larger dataset with 34K images and up to 35% faster on a smaller dataset. Spark NLP was faster than Hugging Face in a single machine by using either CPU or GPU ā image classification by using Vision Transformer (ViT) Spark NLP & Hugging Face on Databricks All your data, analytics, and AI on one platform What is Databricks? Databricks is a Cloud-based platform with a set of data engineering & data science tools that are widely used by many companies to process and transform large amounts of data. Users use Databricks for many purposes from processing and transforming extensive amounts of data to running many ML/DL pipelines to explore the data. This was my interpretation of Databricks, it does come with lots of other features and you should check them out: Disclaimer: https://www.databricks.com/product/data-lakehouse Databricks supports AWS, Azure, and GCP clouds: https://www.databricks.com/product/data-lakehouse Hugging Face in Databricks Single Node with CPUs on AWS Databricks offers a cluster type when you are creating a cluster that is suitable for those who want to use Apache Spark with only 1 machine or use non-spark applications, especially ML and DL-based Python libraries. Hugging Face comes already installed when you choose DatabricksĀ 11.1 MLĀ runtime. Here is what the cluster configurations look like for my Single Node Databricks (only CPUs) before we start our benchmarks: āSingle Nodeā Databricks single-node cluster ā CPU runtime The summary of this cluster that uses instance on is that it has 1 Driver (only 1 node), of memory, of CPU, and it costs per hour. You can read about āDBUā on AWS here: m5n.8xlarge AWS 128 GB 32 Cores 5.71 DBU https://www.databricks.com/product/aws-pricing Databricks single-cluster ā AWS instance profile Letās replicate our benchmarks from the previous section (bare-metal Dell server) here on our single-node Databricks (CPUs only). We start with Hugging Face and our sample-sized dataset of ImageNet to find out what batch size is a good one so we can use it for the larger dataset since this happened to be a proven practice in the previous benchmarks: Hugging Face image-classification pipeline on Databricks single-node CPUs ā predicting 3544 images It took around 2 minutes and a half ( ) to finish processing around from our sample dataset on a single-node Databricks that only uses . The best batch size on this machine using only CPUs is so I am gonna use that to run the benchmark on the larger dataset: 149 seconds 3544 images CPUs 8 Hugging Face image-classification pipeline on Databricks single-node CPUs ā predicting 34745 images On the larger dataset with over 34K images, it took around 20 minutes and a half ( ) to finish predicting classes for those images. For our next benchmark we need to have a single-node Databricks cluster, but this time we need to have a GPU-based runtime and choose a GPU-based AWS instance. 1233 seconds Hugging Face in Databricks Single Node with a GPU on AWS Letās create a new cluster and this time we are going to choose a runtime with GPU which in this case is calledĀ 11.1 ML (includes Apache Spark 3.3.0, GPU, Scala 2.12)Ā and it comes with all required CUDA and NVIDIA software installed. The next thing we need is to also select an AWS instance that has a GPU and I have chosen that has 1 GPU and a similar number of cores/memory as the other cluster. This GPU instance comes with a and 15 GB usable GPU memory). g4dn.8xlarge Tesla T4 16 GB memory ( Databricks single-node cluster ā GPU runtime This is the summary of our single-node cluster like the previous one and it is the same in terms of the number of cores and the amount of memory, but it comes with a Tesla T4 GPU: Databricks single-node cluster ā AWS instance profile Now that we have a single-node cluster with a GPU we can continue our benchmarks to see how Hugging Face performs on this machine in Databricks. I am going to run the benchmark on the smaller dataset to see which batch size is more suited for our GPU-based machine: Hugging Face image-classification pipeline on Databricks single-node CPU ā predicting 3544 images It took around a minute ( ) to finish processing around from our sample dataset on our single-node Databricks cluster with a GPU device. The batching improved the speed if we look at batch size 1 result, however, after batch size 8 the results pretty much stayed the same. Although the results are the same after batch size 8, I have chosen batch size for my larger benchmark to utilize more GPU memory as well. (to be honest, 8 and 256 both performed pretty much the same) 64 seconds 3544 images 256 Letās run the benchmark on the larger dataset and see what happens with batch size 256: Hugging Face image-classification pipeline on Databricks single-node CPU ā predicting 34745 images On a larger dataset, it took almost 11 minutes ( ) to finish predicting classes for over 34K images. If we compare the results from our benchmarks on a single node with CPUs and a single node that comes with 1 GPU we can see that the GPU node here is the winner: 659 seconds Hugging Face (PyTorch) is up to 2.3x times faster on GPU vs. CPU The is up to faster compared to running the same pipeline on CPUs in Hugging Face on Databricks Single Node GPU ~2.3x times Now we are going to run the same benchmarks by using Spark NLP in the same clusters and over the same datasets to compare it with Hugging Face. Benchmarking Spark NLP on a Single Node Databricks First, letās install Spark NLP in your Single Node Databricks CPUs: In the tab inside your cluster you need to follow these steps: ā Install New -> PyPI -> -> Install ā Install New -> Maven -> Coordinates -> -> Install ā Will add ` ` to `Cluster->Advacend Options->Spark->Environment variables` to enable oneDNN Libraries spark-nlp==4.1.0 com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0 TF_ENABLE_ONEDNN_OPTS=1 How to install Spark NLP in Databricks on CPUs for Python, Scala, and Java Spark NLP in Databricks Single Node with CPUs on AWS Now that we have Spark NLP installed on our Databricks single-node cluster we can repeat the benchmarks for a sample and full datasets on both CPU and GPU. Letās start with the benchmark on CPUs first over the sample dataset: Spark NLP image-classification pipeline on Databricks single-node CPUs (oneDNN) ā predicting 3544 images It took around 2 minutes ( ) to finish processing and predicting their classes on the same single-node Databricks cluster with CPUs we used for Hugging Face. We can see that the batch size of 16 has the best result so I will use this in the next benchmark on the larger dataset: 111 seconds 3544 images Spark NLP image-classification pipeline on Databricks single-node CPUs (oneDNN) ā predicting 34742 images On the larger dataset with over , it took around 18 minutes ( ) to finish predicting classes for those images. Next up, I will repeat the same benchmarks on the cluster with GPU. 34K images 1072 seconds Databricks Single Node with a GPU on AWS First, install Spark NLP in your Single Node Databricks (the only difference is the use of ā from Maven): GPU spark-nlp-gpuā Install in your ā In the tab inside the cluster you need to follow these steps: ā Install New -> PyPI -> -> Install ā Install New -> Maven -> Coordinates -> -> Install Spark NLP Databricks cluster Libraries spark-nlp==4.1.0 com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.1.0 How to install Spark NLP in Databricks on GPUs for Python, Scala, and Java I am going to run the benchmark on the smaller dataset to see which batch size is more suited for our GPU-based machine: Spark NLP image-classification pipeline on Databricks single-node GPU ā predicting 3544 images It took less than a minute ( ) to finish processing around from our sample dataset on our single-node Databricks with a GPU device. We can see that performed the best in this specific use case so I will run the benchmark on the larger dataset: 47 seconds 3544 images batch size 8 Spark NLP image-classification pipeline on Databricks single-node GPU ā predicting 34742 images On a larger dataset, it took almost 7 minutes and a half ( ) to finish predicting classes for over . If we compare the results from our benchmarks on a single node with CPUs and a single node that comes with 1 GPU we can see that the GPU node here is the winner: 435 seconds 34K images Spark NLP is up to 2.5x times faster on GPU vs. CPU in Databricks Single Node This is great! We can see Spark NLP on GPU is up to than CPUs even with oneDNN enabled (oneDNN improves results on CPUs between 10% to 20%). 2.5x times faster Letās have a look at how these results are compared to Hugging Face benchmarks in the same Databricks Single Node cluster: is up to faster than Hugging Face on in predicting image classes for the sample dataset with 3K images and up to on the larger dataset with 34K images. is also than Hugging Face on a single for a larger dataset with 34K images and up to on a smaller dataset with 3K images. Spark NLP 15% CPUs 34% Spark NLP 51% faster GPU 36% faster is faster on both and vs. in Databricks Single Node Spark NLP CPUs GPUs Hugging Face Scaling beyond a single machine So far we established that on is faster than the on on a bare-metal server and Databricks Single Node. This is what you expect when you are comparing GPU vs. CPU with these new transformer-based models. Hugging Face GPU Hugging Face CPUs We have also established that outperforms for the very same pipeline (ViT model), on the very same datasets, in both bare-metal server and Databricks single node cluster, and it performs better on both CPU and GPU devices. This on the other hand was not something I expected. When I was preparing this article I expected TensorFlow inference in Spark NLP to be slightly slower than inference in Hugging Face by using PyTorch or at least be neck and neck. I was aiming for this section, . But it seems Spark NLP is faster than Hugging Face even in a single machine, on both and , over both and datasets. Spark NLP Hugging Face scaling the pipeline beyond a single machine CPU GPU small large What if you want to make your ViT pipeline even faster? What if you have even larger datasets and you just cannot fit them inside one machine or it just takes too long to get the results back? Question: Scaling out! This means instead of resizing the same machine, add more machines to your cluster. You need something to manage all those jobs/tasks/scheduling DAGs/manage failed tasks/etc. and those have their overheads, but if you need something to be faster or to be possible (beyond a single machine) you have to use some sort of distributed system. Answer: making your machine bigger or faster so that it can handle more load. Scaling up = adding more machines in parallel to spread out a load. Scaling out = Scaling out Hugging Face: Looking at the page on Hugging Faceās official Website suggests scaling inference is only possible by using Multi-GPUs. As we describe what scaling out is, this is still stuck in a single machine: https://huggingface.co/docs/transformers/performance Also, not to mention that the solution for in Hugging Face doesnāt exist at the moment: Multi-GPUs inference https://huggingface.co/docs/transformers/perf_infer_gpu_many So it seems there is no native/official way to Hugging Face pipelines. You can implement your architecture consisting of some microservices such as a job queue, messaging protocols, RESTful APIs backend, and some other required components to distribute each request over different machines, but this scales the requests by individual users instead of scaling out the actual system itself. scale out In addition, the latency of such systems is not comparable with natively distributed systems such as Apache Spark (gRPC might lower this latency, but still not competitive). Not to mention the single point of failure issue, managing failed jobs/tasks/inputs, and hundreds of other features you get out-of-the-box from Apache Spark that now you have to implement/maintain by yourself. There is a blog post on the Hugging Face Website portraying the very same architecture by scaling REST endpoints to serve more users: ā ā ā I believe other companies are using similar approaches to scale out Hugging Face, however, they are all scaling the number of users/requests hitting the inference REST endpoints. In addition, you cannot scale Hugging Face this way on Deploying š¤ ViT on Kubernetes with TF Serving Databricks. For instance, inference inside fastAPI is 10x times slower than local inference: https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c Once Hugging Face offers some native solutions to scale out I will re-run the benchmarks again. Until then, there is no scaling out when you have to loop through the dataset from a single machine to hit REST endpoints in a round-robin algorithm. (think again about the part we batched rows/sequences/images to feed the GPU all at once, then youāll get it) Scaling out Spark NLP: Spark NLP is an extension of Spark ML therefore it scales natively and seamlessly over all supported platforms by Apache Spark such as (and not limited) Databricks, AWS EMR, Azure Insight, GCP Dataproc, Cloudera, SageMaker, Kubernetes, and many more. Zero code changes are needed! Spark NLP can scale from a single machine to an infinite number of machines without changing anything in the code! You also donāt need to export any models out of Spark NLP to use it in an entirely different library to speed up or scale the inference. Spark NLP ecosystem: optimized, tested, and supported integrations Databricks Multi-Node with CPUs on AWS Letās create a cluster and this time we choose inside . This means we can have more than 1 node in our cluster which in Apache Spark terminology it means 1 Driver and N number of Workers (Executors). Standard Cluster mode We also need to install Spark NLP in this new cluster via theĀ LibrariesĀ tab. You can follow the steps I mentioned in the previous section for Single Node Databricks with CPUs. As you can see, I have chosen the same CPU-baed AWS instance I used to benchmark both Hugging Face and Spark NLP so we can see how it scales out when we add more nodes. This is what our Cluster configurations look like: Databricks multi-node (standard) cluster with only CPUs I will reuse the same Spark NLP pipeline I used in previous benchmarks and also I will only use the larger dataset with 34K images. Letās begin! (no need to change any code) Scale Spark NLP on CPUs with 2x nodes Databricks with 2x Nodes ā CPUs only Letās just add 1 more node and make the total of the machines that will do the processing to 2 machines. Letās not forget the beauty of Spark NLP when you go from a single machine setup (your Colab, Kaggle, Databricks Single Node, or even your local Jupyter notebook) to a multi-node cluster setup (Databricks, EMR, GCP, Azure, Cloudera, YARN, Kubernetes, etc.), zero-code change is required! And I mean zero! With that in mind, I will run the same benchmark inside this new cluster on the larger datasets with 34K images: Spark NLP image-classification pipeline on with CPUs (oneDNN) ā predicting 34742 images 2x nodes It took around ( ) to finish predicting classes for 34K images. Letās compare this result on with Spark NLP and Hugging Face results on Databricks single node (I will keep repeating the Hugging Face results on a Single Node as a reference since Hugging Face could not be scaled out on multiple machines, especially on Databricks): 9 minutes 550 seconds 2x Nodes is than Hugging Face with Spark NLP 124% faster 2x Nodes Previously, Spark NLP beat Hugging Face on a Single Node Databricks cluster by using only CPUs by . 15% This time, by having only 2x nodes instead of 1 node, Spark NLP finished the process of over 34K images 124% faster than Hugging Face.Scale Spark NLP on CPUs with 4x nodes Letās double the size of our cluster like before and go from to This is how the cluster would look like with 4x nodes: 2x Nodes 4x Nodes. Databricks with 4x Nodes ā CPUs only I will run the same benchmark on this new cluster on the larger datasets with 34K images: Spark NLP image-classification pipeline on with CPUs (oneDNN) ā predicting 34742 images 4x nodes It took around ( ) to finish predicting classes for 34K images. Letās compare this result on with Spark NLP vs. Hugging Face on CPUs on Databricks: 5 minutes 289 seconds 4x Nodes is than Hugging Face with Spark NLP 327% faster 4x Nodes As it can be seen, Spark NLP is now than Hugging Face on CPUs while using only in Databricks. 327% faster 4x Nodes Scale Spark NLP on CPUs with 8x nodes Now letās double the previous cluster by adding 4x more Nodes and make the total of . This resizing the cluster by the way is pretty easy, you just increase the number of workers in your cluster configurations: 8x Nodes Resizing Spark Cluster in Databricks Databricks with 8x Nodes ā CPUs only Letās run the same benchmark this time on 8x Nodes: Spark NLP image-classification pipeline on with CPUs (oneDNN) ā predicting 34742 images 8x nodes It took over 2 minutes and a half ( ) to finish predicting classes for 34K images. Letās compare this result on with Spark NLP vs. Hugging Face on CPUs on Databricks: 161 seconds 8x Nodes is than Hugging Face with Spark NLP 666% faster 8x Nodes As it can be seen, Spark NLP is now than Hugging Face on CPUs while using only in Databricks. 666% faster 8x Nodes Letās just ignore the number of 6s here! (it was 665.8% if it makes you feel better) Scale Spark NLP on CPUs with 10x nodes To finish our scaling out ViT models predictions on CPUs in Databricks by using Spark NLP I will resize the cluster one more time and increase it to 10x Nodes: Databricks with 10x Nodes ā CPUs only Letās run the same benchmark this time on 10x Nodes: Spark NLP image-classification pipeline on with CPUs (oneDNN) ā predicting 34742 images 10x nodes It took less than ( ) to finish predicting classes for 34K images. Letās compare this result on with all the previous results from Spark NLP vs. Hugging Face on CPUs on Databricks: 2 minutes 112 seconds 10x Nodes is than Hugging Face with Spark NLP 1000% faster 10x Nodes And this is how you Vision Transformer model coming from Hugging Face on by using in Databricks! Our pipeline now is than Hugging Face on CPUs. scale out the 10x Nodes Spark NLP 1000% faster We managed to make our pipeline than Hugging Face which is stuck in 1 single node by simply using Spark NLP, but we only used . Letās see if we can get the same improvements by scaling out our pipeline on a . ViT 1000% faster CPUs GPU cluster Databricks Multi-Node with GPUs on AWS Having a GPU-based multi-node Databricks cluster is pretty much the same as having a single-node cluster. The only difference is choosing and keeping the same ML/GPU Runtime with the same AWS Instance specs we chose in our benchmarks for GPU on a single node. Standard We also need to install Spark NLP in this new cluster via the tab. Same as before, you can follow the steps I mentioned in Single Node Databricks with a GPU. Libraries Databricks multi-node (standard) cluster with GPUs Scale Spark NLP on GPUs with 2x nodes Our multi-node Databricks GPU cluster uses the same AWS GPU instance of that we used previously to run our benchmarks to compare Spark NLP vs. Hugging Face on a single-node Databricks cluster. g4dn.8xlarge This is a summary of what it looks like this time with 2 nodes: Databricks with 2x Nodes ā with 1 GPU per node I am going to run the same pipeline in this GPU cluster with 2x nodes: Spark NLP image-classification pipeline on with GPUs ā predicting 34742 images 2x nodes It took 4 minutes ( ) to finish predicting classes for . Letās compare this result on with Spark NLP vs. Hugging Face on GPUs in Databricks: 231 seconds 34K images 2x Nodes is than Hugging Face with Spark NLP 185% faster 2x Nodes Spark NLP with is almost ( ) than Hugging Face on 1 single node while using 2x Nodes 3x times faster 185% GPU. Scale Spark NLP on GPUs with 4x nodes Letās resize our GPU cluster from 2x Nodes to This is a summary of what it looks like this time with using a GPU: 4x Nodes. 4x Nodes Databricks with 4x Nodes ā with 1 GPU per node Letās run the same benchmark on 4x Nodes and see what happens: Spark NLP image-classification pipeline on with GPUs ā predicting 34742 images 4x nodes This time it took almost 2 minutes ( ) to finish classifying all in our dataset. Letās visualize this just to have a better view of what this means in terms of Hugging Face in a single node vs. Spark NLP in a multi-node cluster: 118 seconds 34K images is than Hugging Face with Spark NLP 458% faster 4x Nodes Thatās a compared to Hugging Face. We just made our pipeline by using Spark NLP with 458% increased performance 5.6x times faster 4x nodes. Scale Spark NLP on GPUs with 8x nodes Next, I will resize the cluster to have in my Databricks with the following summary: 8x Nodes Databricks with 8x Nodes ā with 1 GPU per node Just as a reminder, each AWS instance ( ) has 1 (15GB useable memory). Letās re-run the benchmark and see if we can spot any improvements as scaling out in any distributed system have its overheads and you cannot just keep on adding machines: g4dn.8xlarge NVIDIA T4 GPU 16GB Spark NLP image-classification pipeline on with GPUs ā predicting 34742 images 8x nodes It took almost a minute ( ) to finish classifying with in our Databricks cluster. It seems we still managed to improve the performance. Letās put this result next to previous results from Hugging Face in a single node vs. Spark NLP in a multi-node cluster: 61 seconds 34K images 8x Nodes is than Hugging Face with Spark NLP 980% faster 8x Nodes Spark NLP with is almost than Hugging Face on GPUs. 8x Nodes 11x times faster (980%) Scale Spark NLP on GPUs with 10x nodes Similar to our multi-node benchmarks on CPUs I would like to resize the GPU cluster one more time to have and match them in terms of the final number of nodes. The final summary of this cluster is as follows: 10x Nodes Databricks with 10x Nodes ā with 1 GPU per node Letās run our very last benchmark in this specific GPU cluster (with zero code changes): Spark NLP image-classification pipeline on with GPUs ā predicting 34742 images 10x nodes It took less than a minute ( ) to finish predicting classes for over . Letās put them all next to each other and see how we progressed scaling out our Vision Transformer model coming from Hugging Face in the Spark NLP pipeline in Databricks: 51 seconds 34743 images is than Hugging Face with Spark NLP 1200% faster 10x Nodes And we are done! We managed to our model coming from Hugging Face on by using in Databricks! Our pipeline is now with compared to Hugging Face on GPU. scale out Vision Transformer 10x Nodes Spark NLP 13x times faster 1200% performance improvements Letās sum up all these benchmarks by comparing first the improvements between CPUs, and GPUs, and then how much faster our pipeline can be by going from Hugging Face CPUs to 10x Nodes on Databricks by using Spark NLP on GPUs. Bringing it all together: Databricks: Single Node & Multi Nodes Spark NLP š on 10x Nodes with CPUs is 1000% (11x times) faster than Hugging Face š¤ stuck in a single node with CPUs Spark NLP š on 10x Nodes with GPUs is 1192% (13x times) faster than Hugging Face š¤ stuck in a single node with GPU What about the price differences between our AWS CPU instance and AWS GPU instance? (I mean, you get more if you pay more, right?) with CPUs vs. with 1 GPU and similar specs AWS m5d.8xlarge AWS g4dn.8xlarge OK, so the price seems pretty much the same! With that in mind, what improvements do you get if you move from on stuck in a single machine to on with ? Hugging Face CPUs Spark NLP 10x Nodes 10x GPUs on GPUs is than Hugging Face on CPUs Spark NLP 25x times (2366%) faster Spark NLP š on 10x Nodes with GPUs is 2366% (25x times) faster than Hugging Face š¤ in a single node with CPUs Final words In the spirit of full transparency, all the notebooks with their logs, screenshots, and even the excel sheet with numbers are provided here on GitHub Scaling Spark NLP requires zero code change. Running the benchmarks from a single node Databricks to the 10 nodes meant just re-running the same block of code in the same notebook Keep in mind these two libraries come with many best practices to optimize their speed and efficiency in different environments for different use cases. For instance, I didnāt talk about partitions and their relation to parallelism and distributions in Apache Spark. There are many Spark configs to fine-tune a cluster, especially balancing the number of tasks between CPUs and GPUs. Now the question is, would it be possible to speed up any of them within the very same environments we used for our benchmarks? The answer is 100%! I tried to keep everything for both libraries with default values and right out-of-the-box features in favor of simplicity for the majority of the users. You may want to wrap Hugging Face and other DL-based Pythonish libraries in a Spark UDF to scale them. This works to a degree as I have done this myself and still do (when there is no native solution). I wonāt get into the details of excessive memory usage, possible serialization issues, higher latency, and other problems when one wraps such transformer-based models in a UDF. I would just say if you are using Apache Spark use the library that is natively extending your required features on Apache Spark. Throughout this article, I went out of my way to mention Hugging Face on PyTorch and Spark NLP on TensorFlow. This is a big difference given the fact that in every single benchmark done by Hugging Face between PyTorch and TensorFlow, PyTorch was and still is the winner for inference. In Hugging Face, PyTorch just has a much lower latency and it seems to be much faster than TensorFlow in Transformers. The fact that Spark NLP uses the very same TensorFlow and comes ahead in every benchmark compare to PyTorch in Hugging Face is a big deal. Either the TensorFlow in Hugging Face is neglected, or PyTorch is just faster in inference compared to TensorFlow. Either way, I canāt wait to see what happens when Spark NLP starts supporting TorchScript and ONNX Runtime in addition to TensorFlow. The ML and ML GPU Databricks runtimes come with Hugging Face installed, thatās pretty nice. But that doesn't mean Hugging Face is easy to use in Databricks. The Transformer library by Hugging Face doesnāt support DBFS (the native distributed file system of Databricks) or Amazon S3. As you see in the notebooks, I had to download a compressed version of datasets and extract them to use them. Thatās not really how users in Databricks and other platforms in productions do things. We keep our data within distributed file systems, there are security measures implemented, and most of them are large enough that cannot be downloaded by a personal computer. I had to download the datasets I already had on DBFS, zip them, upload them on S3, make them public, and re-download them again in the notebooks. A pretty tedious process that could have been avoided if Hugging Face could support DBFS/S3. References ViT https://arxiv.org/pdf/2010.11929.pdf https://github.com/google-research/vision_transformer Vision Transformers (ViT) in Image Recognition ā 2022 Guide https://github.com/lucidrains/vit-pytorch https://medium.com/mlearning-ai/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-51f3561a9f96 https://medium.com/nerd-for-tech/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-paper-summary-3a387e71880a https://gareemadhingra11.medium.com/summary-of-paper-an-image-is-worth-16x16-words-3f7f3aca941 https://medium.com/analytics-vidhya/vision-transformers-bye-bye-convolutions-e929d022e4ab https://medium.com/syncedreview/google-brain-uncovers-representation-structure-differences-between-cnns-and-vision-transformers-83b6835dbbac Hugging Face https://huggingface.co/docs/transformers/main_classes/pipelines https://huggingface.co/blog/fine-tune-vit https://huggingface.co/blog/vision-transformers https://huggingface.co/blog/tf-serving-vision https://huggingface.co/blog/deploy-tfserving-kubernetes https://huggingface.co/google/vit-base-patch16-224 https://huggingface.co/blog/deploy-vertex-ai https://huggingface.co/models?other=vit Databricks https://www.databricks.com/spark/getting-started-with-apache-spark https://docs.databricks.com/getting-started/index.html https://docs.databricks.com/getting-started/quick-start.html See the best of DATA+AI SUMMIT 2022 https://www.databricks.com/blog/2020/05/15/shrink-training-time-and-cost-using-nvidia-gpu-accelerated-xgboost-and-apache-spark-on-databricks.html Spark NLP Spark NLP GitHub (Spark NLP examples) Spark NLP Workshop Spark NLP Transformers Spark NLP Models Hub Speed Optimization & Benchmarks in Spark NLP 3: Making the Most of Modern Hardware Hardware Acceleration in Spark NLP Serving Spark NLP via API: Spring and LightPipelines Serving Spark NLP via API (1/3): Microsoftās Synapse ML Serving Spark NLP via API (2/3): FastAPI and LightPipelines Serving Spark NLP via API (3/3): Databricks Jobs and MLFlow Serve APIs Leverage deep learning in Scala with GPU on Spark 3.0 Getting Started with GPU-Accelerated Apache Spark 3 Apache Spark Performance Tuning Possible extra optimizations on GPUs: RAPIDS Accelerator for Apache Spark Configuration