We’re proud to announce that we’ve finished our first full-length course called Meeshkan: Machine Learning the GitHub API. The course is available on Udemy and you can follow it free of charge.
Meeshkan: Machine Learning the GitHub API | Udemy_Learn how to plan, deploy and run a Machine Learning problem on AWS and Meeshkan - Free Course_www.udemy.com
The basic idea of the tutorial is the following. Let’s say that you are a venture capitalist or talent hunter or you offer a service to developers and you would like to discover exciting new projects on GitHub to invest in them, or hire the authors, or strike up a partnership. What type of projects will really take off? Maybe Machine Learning can be used to help us hone our intuition about projects on GitHub.
Meeshkan has a crush on the Octocat.
To cut to the chase, by the end of the tutorial, you’ll see that [drumroll] dense, fully connected neural networks learn faster about stars than forks, and models perform surprisingly well when measuring mean-squared-loss between actual and predicted star count in addition to categorical cross entropy for star counts above a certain threshold. This is even after introducing satisficing criteria to account for the skewed distribution of some targets (i.e. a threshold of stars > 80000 ? 1 : 0
will only have a handful of projects in the 1
category while the rest of us mortals hang out with the 0
s). The way we figure this out is by feeding Meeshkan lots of different webhooks that serve data collected from the GitHub API, uploading various Keras models to Meeshkan, and passing the results to a small express server in order to visualize how our models are performing.
Results displayed on our data visualize, which you can find at https://github.com/meeshkan/data-visualizer. Our model learns pretty darn fast but then plateaus. Epochs on the X axis, loss on the Y axis. I ran a first batch on a dataset with around 200 000 datum, which was obviously overkill. The lowest line is predictions for a higher star-count (0 for under 1500, 1 for over), whereas the upper two are for 300 stars.
The course takes a day or so to complete and by the end you’ll not only have done some Machine Learning, but you’ll have deployed an entire environment to Amazon Web Services that will automatically ingest publicly-available API data for Machine Learning. If you want to modify the tutorial to do learning about one of the hundreds of open APIs on the internet, go for it! I’m excited to see what you learn :-)
In this article, I’d like to talk about the making of the tutorial and, specifically, reveal a few neat things I learned about using Meeshkan to analyze results of large data-collection projects on public APIs.
That is the number of EC2 t2.micro
server instances that were spawned on Amazon to analyze and inspect the GitHub API. The batch collected data on 995 977 repos and a whopping 14 329 073 commits. The average lifespan of a server was 203.45 seconds, and we bid 0.0040 USD per hour per server. So, when you do the math, 308550 * 203.45 * 0.0040 / 3600 = $69.75 to collect all that data. The main reason I let the job run so long is because parts of the setup (EC2, GitHub, MySQL, the tutorial’s code) can sometimes flake out unexpectedly, which means that you won’t get the number of commits you need to do machine learning. For example, only 125 082 of the repos in the database have over 50 commits. If I run:
SELECT full_name, stargazers_count, COUNT(*) AS commit_count FROM repos JOIN commits on repos.id = commits.repo_id GROUP BY repos.id ORDER BY commit_count ASC, stargazers_count DESC LIMIT 5;
I get:
+--------------------+------------------+--------------+| full_name | stargazers_count | commit_count |+--------------------+------------------+--------------+| impress/impress.js | 32868 | 1 || Automattic/kue | 6852 | 1 || cdnjs/cdnjs | 5801 | 1 || square/cube | 3871 | 1 || enyojs/enyo | 1941 | 1 |+--------------------+------------------+--------------+5 rows in set (7.51 sec)
In other words, false positives, as none of those repos have only 1 commit. But even when we only study repos with more than 20 commits, 100 000 data points is more than enough to get us started.
Our database is ingesting a blazing 40 repos per second and around 1000 commits per second! You’ll learn how to deploy this setup in a few clicks on the Udemy tutorial.
Also, there are some fun diamonds in the rough. The command:
SELECT full_name, size, stargazers_count FROM repos WHERE size = 2 ORDER BY size ASC, stargazers_count DESC LIMIT 5;
yields:
+-------------------------------+------+------------------+| full_name | size | stargazers_count |+-------------------------------+------+------------------+| atg/chocolat-public | 2 | 197 || edankwan/Jesus.js | 2 | 114 || bancek/django-smtp-ssl | 2 | 84 || boucher/stripe-webhook-mailer | 2 | 81 || tlatsas/bash-spinner | 2 | 50 |+-------------------------------+------+------------------+5 rows in set (0.45 sec)
Yes, https://github.com/edankwan/Jesus.js is really a repo, yes, it has almost nothing in it, and yes, it has managed to rack up 114 stars. Actually, now 115 with one from me :-)
If you’re not following the tutorial on Udemy and have at least a Yellow Belt in JavaScript and AWS infrastructure, you can deploy this all yourself with just a few clicks from https://github.com/meeshkan/github-tutorial-stack. Pull requests are welcome!
This is the number of webhooks that I passed to Meeshkan to get the results for this tutorial. Because the webhook generates the data dynamically based on the path, it is really easy to keep the model the same but change the webhook.
A screencast as I was putting together the tutorial— launching a (batch) job in Meeshkan takes less than 30 seconds!
For example, the same model can be used to analyze how many distinct authors there are over a 3 or 10 commit window just by changing a parameter in the webhook. The model doesn’t change at all. For each webhook, I uploaded between 3 and 5 models, all of which run in parallel, so we see the download icon for the the first results in minutes and the whole thing finishes up within a few hours for the longest jobs at the Meeshkan network’s very reasonable price of 0.08 USD per hour.
This is the number of things I can think of off the top of my head to do after you finish following the course.
tutorials.meeshkan.io/github/80_10_10_/train/...
, I use tutorials.meeshkan.io/github/80_10_10_/validate/...
and voilà, we are on our validation set. This way, we’ll know if we need to regularize our model. Because you are not prematurely regularizing, right? RIGHT??!?No, wait, there’s a fifth thing…you should crack open a nice bottle of Penfolds Grange Hermitage 1951 because you’re going to be making bank with your awesome AI ninja skills after you take our Udemy course :-)
The Meeshkan Public Beta is itching to do your Machine Learning! We are a small company with a big heart that is taking on the likes of Google and Amazon by offering a low-cost Machine Learning sandbox where anyone can explore new ideas. Here are some great things to do once you sign into Meeshkan:
https://tutorials.meeshkan.io/github/...
to get started, where ...
is explained in video 3 of the tutorial, using a model from video 5 of the tutorial.Thank you very much for checking out our Machine Learning service. We think you’ll like it a lot and we are working every day to make it faster, cheaper and easier to use.
Did I mention you get 100 free hours of Machine Learning and a free fifteen-minute consultation to get your ML job up and running? See you on Meeshkan!