My about analyzing my university’s gym crowdedness over the last year using generated a lot of great responses — including: most recent post machine learning How neat! I always wanted an excuse to never go to the RSF [our gym] again. Can you do this for our other campus locations as well? How do I do something like this myself? First, I’m sorry you feel that way about the gym and I sincerely hope the oncoming wave of doesn’t completely crush your dreams of working out forever. New Years Resolutioners Second, yes. I’m in the process of creating more predictive models for the other ten campus locations tracks. Machine learning requires a lot of data, so it may take some time before the models are ready for training. Packd Lastly, great question! In my last post I hinted about how exactly you can do this yourself: I’ve deployed my Random Forest Regressor on to generate predictions as an API call. Algorithmia I’ve since realized this requires a lot more explanation, and it’s the subject of this post. My process may not be the best, but hopefully by illustrating it, I can learn from my mistakes and give you a starting point. The rest of this post assumes you have basic knowledge of machine learning as a concept, you know Python, and you’re familiar with APIs. We’ll use for this tutorial. scikit-learn Developing a Model Locally The first step is to create a machine learning model locally (on your computer) to test things out. The data for this is located for downloading. You’ll need to install: here Numpy (pip install numpy) Pandas (pip install pandas) Scipy (pip install scipy) Scikit-learn (pip install scikit-learn) We’ll need to create a prediction model using the data. The following is taken from my kernel over at : Kaggle import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) RandomForestRegressor train_test_split StandardScaler from sklearn.ensemble import from sklearn.model_selection import from sklearn.preprocessing import df = pd.read_csv("data.csv") # or wherever your data is located data = df.valuesX = data[:, 1:] y = data[:, 0] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) # Extract the training and test data # all rows, no "number_people" # all rows, "number_people" only scaler = StandardScaler()scaler.fit(X_train)X_train = scaler.transform(X_train)X_test = scaler.transform(X_test) # Scale the data to be between -1 and 1 # Create your prediction modelmodel = RandomForestRegressor(n_estimators=140)model.fit(X_train, y_train) Let’s pause here for a second. I’ve breezed over some important details in the code above. First of all, I didn’t have to modify any of my data (do things like convert column values of “yes / no” to 1/0) because I already did that part behind the scenes. Second, choosing the correct model is not a trivial thing, and I seem to have pulled the Random Forest Regressor out of thin air. (I didn’t really, check my for details). Last, why did I set n_estimators to 140? Normally we’d want to do a over our hyperparameters to find the optimal n_estimators, but I did the hard work behind the scenes already (it takes a while). first post Grid Search The last line is the most important — it actually performs the learning part of the algorithm. It changes the state of the variable From here on out, we can throw new data at the model to make predictions: model. test_output = model.predict(X_test) # predicted valuesscore = model.score(X_test, y_test) # accuracy score, around 0.90print(test_output) Here are the predicted number of people for the test data: array([ 44.92142857, 54.22142857, 45.35714286, ..., 47.95714286,63.27142857, 40.22142857]) Great! This isn’t really predicting the future yet — the test data was actually a quarter of the original historical data ( ). If we want to really predict the future, we need to do a lot more work. more on the test/train data split In order to predict how crowded the gym will be in the future, we need a new array of column values where everything except the target label, , is filled in. The details of how to fill these in probably will take another post, but the general idea is to create several datetimes in the future, and for each one, add each feature value (the timestamp, day of the week, will it be a weekend, will it be a holiday, will it be the start of the semester, etc). For weather feature values, I used the API to make forecasts. number_people Dark Sky Let’s say you now have an array of times in the future for which you want to predict how crowded the gym will be. Here’s how you do it: X_predict = generate_prediction_data() # your special call herescaler = StandardScaler()X_predict = scaler.fit_transform(X_predict)predictions = model.predict(X_predict) Congratulations! The variable now holds the predicted number of people for each of your datetimes in the future. Now, if you wanted, you could generate predictions for the rest of the year, or the century, or whenever, and call that your predicted values. However, this seems like it would take a huge amount of memory and probably wouldn’t be very accurate the further in the future you go. It would be much better to the model we created and only bring it out when we want to make predictions. Here’s how in sklearn: predictions store sklearn.externals joblib from import # How to save the model:joblib.dump(model, "prediction_model.pkl") # You can quit your Python process here, even shut off your computer. # Later, if you want to load the saved model back into memory:model = joblib.load("prediction_model.pkl") It’s easy to store already-trained models in scikit-learn. Of course, your users can’t call you on the phone every time they want predictions so you can generate them on your computer. They’ll need to make an API call of some kind to a service you provide that does the predicting. Storing Your Model in the Cloud You could store the file on your server and load it whenever someone makes a request for predictions, but turns out loading this file uses up a lot of memory. In my case, more than I was allotted by my hosting company. It’s better to host your machine learning model on an entirely different server. You have a few options here from big companies, including , , and . prediction_model.pkl Google Microsoft Amazon However, I decided to go with a smaller, lesser known company to try it out. Algorithmia is a relatively new company based in Seattle that specializes in “algorithms as a service”. Users of the platform can create code snippets and host them on Algorithmia and call their code as an API. In addition, these developers can charge users for access to their code. It’s an interesting business idea and I’m eager to see how it works out, but for now I’m just using the hosting service idea. other After you make an account, you’ll want to upload your to the data section. Here, I have several prediction models, each one ending in the .pkl extension: prediction_model.pkl Uploading is as easy as drag and drop, and takes a few minutes if you have larger prediction models (> 500 MiB). The next step is to create an algorithm that uses your data to make predictions. Make Predictions as a Service Call Under the “More” tab at the top of Algorithmia, you’ll find the Add Algorithm button. Next you need to fill in the details of how your algorithm operates. I enabled “full access to the internet” and “can call other algorithms” just in case. This means my algorithm can call external web APIs, like the weather API, and call algorithms I create on Algorithmia. Dark Sky other Algorithmia provides a code editor online for you to edit your algorithm. Once you create your algorithm, you’ll be brought here: Every algorithm consists of an method, that takes one input, conveniently named . The variable is whatever you define it to be, since you’ll be the one calling the algorithm. Usually, will be a 2-dimensional array of your features, where each row is missing the column you want to predict. For me, that’s . Here’s an example : apply input input input number_people input input = [[1.072308379173353,0.772801541291309,0.0,0.0,0.24191249808036538,-0.36311268827583276,0.0,0.0,0.0,1.034839680120056],... ] If we throw at the I created , we’ll get back a list of the predicted for each feature row, just like we want. The only thing left is to mirror this behavior as a service call. Let’s finish the code on Algorithmia. input model locally number_people Before continuing, make sure you add the correct dependencies to your code. Since we’re using sci-kit learn, here’s all we need to add: Finishing up the source code, we get this: It’s very short, but this will work. When a new correctly formatted is passed to this algorithm, the code will find the path of the file we uploaded earlier, load it into memory using joblib, and finally make the prediction and return it. Click “Save”, “Compile”, then “Publish” to finish your algorithm. input prediction_model.pkl You’ll end up on this page, which is the finished result. If you scroll down to “Use this Algorithm” and click on the Python tab, you’ll see how easy it is to make a call to your algorithm. Algorithmia import input = <INPUT>client = Algorithmia.client(<Your Algorithmia Key here>)algo = client.algo('nsrose/testalgo2/') algo.pipe(input) print This will work anywhere, from any computer that has Python and Algorithmia installed (pip install algorithmia). All you have to do is make sure you format your correctly! input Why is This a Good Thing? : By calling and making predictions as an API call, your app can be free of the concerns of machine learning mechanics. All it needs to do is create correctly formatted inputs. Separation of ML concerns from the rest of your app : Loading *.pkl files into memory is a heavy task and probably not suitable for small applications on limited servers. Outsourcing this job to Algorithmia is a good alternative. Ease up on memory consumption : Algorithmia recently added support for Git repos, so you can edit your code offline and push the changes. Source control your ML code Suggestions for Improvement Add debugging tools to the online code editor Add integrations (Slack notifications when someone uses your algorithm perhaps) Thanks to the folks at Algorithmia who helped answer my questions. I think you have a good product and I hope it takes off!
Share Your Thoughts