One of the challenges of hyperparameter tuning a deep neural network is the time it takes to train and evaluate each set of parameters. If you’re anything like me, you often have four or five networks in mind that you want to try: different depth, different units per layer, etc.
Training and evaluating these networks in series is fine if the dataset or parameter space are small. But what if you’re evolving a neural network with a genetic algorithm, like we did in our previous post? Or what if one train/eval step takes a day to complete? We need parallelism!
Here’s a super simple way to achieve distributed hyperparameter tuning using MongoDB as a quasi pub/sub, with a controller defining jobs to process, and N workers to do the training and evaluating.
It looks like this:
All the code is available on GitHub here: Super Simple Distributed Hyperparameter Tuning
The system starts with a controller that creates the jobs. How you choose to create jobs is up to you. In our simple example, we just randomly create a two-layer MLP with a random number of units per layer. It goes something like this:
def main():
"""Toy example of adding jobs to be processed."""
db = Database('blog-test')
while True:
print("Creating job.")
network = gen_network(784, 10) # mnist settings
add_job('mnist', network, db)
sleep_time = random.randint(60, 120)
print("Waiting %d seconds." % (sleep_time))
time.sleep(sleep_time)
Jobs are stored in a Mongo DB. We create a Database class with the following methods:
The DB needs to be remotely accessible to all the workers if you’re running this truly distributed. In its current state, the code only accepts a host argument since I run this on AWS instances that share a security group. If you want to connect to a remote DB, you’ll need to implement username/password (etc) arguments as well. (Pull requests welcome!)
As in life, the workers in this setup do all the hard work. You can run as many workers as you’d like, though generally you’d want to limit to one worker per instance or GPU. When I use this to train my own models on my data, I use a cluster of eight AWS P2 instances, with one of the instances also running the controller and database.
The workflow is as follows:
Super simple, right? 👍
Want more posts like this? Give me a follow below and take a look at my other posts. Thanks for reading!