Rishabh

Student • ML and NLP Research Enthusiast • Journalist • Co-founder of MLBlocks • https://rish-16.git

Training Your Models on Cloud TPUs in 4 Easy Steps on Google Colab

You have a plain old TensorFlow model that’s too computationally expensive to train on your standard-issue work laptop. I get it. I’ve been there too, and if I’m being honest, seeing my laptop crash twice in a row after trying to train a model on it is painful to watch.
In this article, I’ll be breaking down the steps on how to train any model on a TPU in the cloud using Google Colab. After this, you’ll never want to touch your clunky CPU ever again, believe me.
TLDR: This article shows you how easy it is to train any TensorFlow model on a TPU with very few changes to your code.

What’s a TPU?

The Tensor Processing Unit (TPU) is an accelerator — custom-made by Google Brain hardware engineers — that specialises in training deep and computationally expensive ML models.
Let’s put things into perspective just to give you an idea of how awesome and powerful a TPU is. A standard MacBook Pro Intel CPU can perform a few operations per clock cycle. A standard off-the-shelf GPU can perform tens of thousands of operations per cycle. A state-of-the-art TPU can perform hundreds of thousands of operations per cycle (sometimes up to 128K OPS).
To understand the scale, imagine using these devices to print a book. A CPU can print character-by-character. A GPU can print a few words at a time. A TPU? Well, it can print a whole page at a time. That’s some amazing speed and power that we now have at our disposal; shoutout to Google for giving broke high-schoolers (like me) access to high-performance hardware.
If you’re interested in the inner workings of a TPU and what makes it so amazing, go check out the Google Cloud blog article where they discuss everything from hardware to software here.
Also, the operations per clock cycle for these devices are rough estimates stated by Google engineers themselves. Please don’t call me out to a duel.

Running an MNIST model on TPUs

For the sake of this tutorial, I’ll be running through a quick and easy MNIST model. Do note that this model can be whatever you want it to be. To better help you visualise what’s going on, I’ve chosen good old MNIST.
Let’s first begin by extracting and preprocessing our dataset. This shouldn’t be much of a problem:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# preprocessing the dataset
x_train = x_train.reshape(x_train.shape[0], 1) # 784, 1]
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)
print (x_train.shape, y_train.shape)
print (x_test.shape, y_test.shape)

# >> (60000, 784), (60000, 10)
# >> (10000, 784), (10000, 10)
I’ve tried to make this article as comprehensive as possible so that you can train your models at unfathomable speeds and feel like you’re on top of the world. There are 4 steps we will be taking to train our model on a TPU:
1. Connect to a TPU
2. Initialise a parallelly-distributed training strategy
3. Build our model under that strategy
4.Train the model and feel like a Superhero
Let’s delve right into it!
Note: At times, I’ve heard people say that we need to convert our datasets into
TFRecords
. Bear in mind that it isn’t necessary and that any dataset can be fed into a model for training on a TPU.

Connecting to a TPU

When I was messing around with TPUs on Colab, connecting to one was the most tedious. It took quite a few hours of searching online and looking through tutorials, but I was finally able to get it done.
For that, we first need to connect our Colab notebook running locally to the TPU runtime. To change runtime, simply click the Runtime tab in the navigation bar. A dropdown menu will be displayed from which you can select the Change runtime type option. A popup window will show up where you can select the TPU option from the dropdown selector.
Next, we need to see if any TPUs are available for use:
import os
import pprint # for pretty printing our device stats

if 'COLAB_TPU_ADDR' not in os.environ:
    print('ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!')
else:
    tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
    print ('TPU address is', tpu_address)

    with tf.Session(tpu_address) as session:
      devices = session.list_devices()

    print('TPU devices:')
    pprint.pprint(devices)
This should give you a list of 8 devices that can be used to train our models on. It should print something out that looks like this:

Initialising a Distributed Training Strategy

Now that we’ve changed the runtime and have acquired a list of available TPUs, we need to invoke the TPU by creating a distributed training strategy — a wrapper around the model that makes it compatible with multi-core training. To do so, we need to create a 
TPUClusterResolver
 that takes in the available device and provisions one for us.
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)
The 
resolver
 gives us access to the TPU such that we can finally build a parallelly-distributed pipeline on it. This is a necessary step because a TPU is a distributed training processor and is not single-core like a traditional CPU. With this strategy method, jumping on board a TPU is very simple!

Building our model under the distributed training strategy

With a training strategy now at hand, we can proceed to build our model using that strategy as such:
with strategy.scope():
	"""
	Note: This model can be whatever you want it to be.
	Here, I'm building a simple fully-connected network using 
	our distributed training strategy. 
	This essentailly takes our model and makes it 
	compatible to train on a TPU.
	"""
	model = tf.keras.models.Sequential()
	model.add(tf.keras.layers.Dense(512, input_size=[784,], activation='relu'))
	model.add(tf.keras.layers.Dense(256, activation='relu'))
	model.add(tf.keras.layers.Dense(128, activation='relu'))
	model.add(tf.keras.layers.Dense(64, activation='relu'))
	model.add(tf.keras.layers.Dense(10, activation='softmax'))

	# compiling the model using the RMSProp optimizer 
	# and Sparse Categorical Crossentropy loss
	model.compile(
		optimizer=tf.train.RMSPropOptimizer(learning_rate=1e-2),
		loss=tf.keras.losses.sparse_categorical_crossentropy,
		metrics=['sparse_categorical_accuracy']
	)

	model.summary()
This looks like a lot of jargon, but all it does is convert a regular
tf.keras
model (that is typically run on a CPU) into a TPU-ready model for distributed training.

Training our model

This is the exciting part of the process. We can finally train our model on a Cloud TPU for free. With so much power in our hands, let’s make good use of it and train on MNIST (I know…very anti-climatic).
history = model.fit(x_train,  y_train, epochs=20, steps_per_epoch=50)
model.save_weights('./mnist_model.h5', overwrite=True)
Ensure that the number of instances is perfectly divisible by the 
steps_per_epoch
 parameter so that all the instances are used during training. For example, we have 60000 instances in our training set. 60000 is divisible by 50 so that means all our instances are fed into the model without any leftovers.

A final round-up

With that, training should commence soon. Colab will boot up a TPU and upload the model architecture on it. You should soon see the classic Keras progress bar style layout in the terminal output. Congrats! You’ve successfully just used a TPU.
Introspectively, I always thought training on a TPU was a thing only wizards could handle. TPU training was always that one big thing out of my grasp simply because of the hype around it (making it seem very difficult to use). Never in my wildest dreams did I think it’d be as simple as adding 5–6 lines of code to the preexisting model.
After reading this article, I want you to know that anyone can train on accelerated hardware regardless of experience.

Some observations about TPU training

The Neural Machine Translation model I had written (for the TensorFlow Docs) took me less than a minute to train on a TPU. I was astonished because the same model took more than 4 hours to train on a CPU (which probably explains why my laptop crashed and ran out of memory twice).
My model hit a very high accuracy — higher than what I achieved training on a CPU. With performance and speed, the Cloud TPU is second to none when it comes to quality training!
Note: You can find my code here. Have fun!

In a nutshell

Google always comes bearing gifts when it launches new Machine Learning toys that we can tinker and play around with. The TPU processor is certainly a boon to the ML community as it plays a major role in the democratisation of AI — it gives everyone a chance to use accelerated hardware regardless of demographic. With TPUs, research and experimentation are hitting all-time highs and people are now engaging in Machine Learning like never before!
I hope this article has helped you train your models in a much more efficient and elegant manner. If you have any questions about the use of TPUs or want to chat in general about Tech and ML, feel free drop them down below in the comments or catch me on Twitter or LinkedIn. I usually reply within a day.
Until then, I’ll catch you in the next one!
Original article by Rishabh Anand

A call to action…of sorts

Interested in reading about the latest-and-greatest technologies and getting your hands dirty with the recent happenings in Machine Learning, Data Science, and Technology? Do catch my other articles or give me a follow! Your feedback and constant support mean a lot and encourage me to continue writing high-quality content for your learning!
Give me a follow if you love reading about technology, especially Machine Learning and Data Science! You can find me here.

Tags

Comments

More by Rishabh

Topics of interest