Software Developer as Data-Scientist

Written by codonomics | Published 2018/09/04
Tech Story Tags: machine-learning | data-science | statistics | software-development | data-analysis

TLDRvia the TL;DR App

One advantage of a “contemporary” Software Developer jumping into the Data Science bandwagon is that for him/her the rate of technology change is a given and no frustrating experience.

Here is an instance to relate:

I have been dabbling in the Data-science space on the Python stack for quite a while now. In an attempt to look for solutions to speed-up my hyper-parameter tuning time — more specifically the scikit-learn’s GridSearchCV, I stumbled upon a youtube video that showed how dask-learn library can come to the rescue and be an almost in-place replacement for scikit-learn’s GridSearchCV.

But when I tried to install the library, I realized the hard way that the library changed its name and moved to a different module and space - Dask-SearchCV. It is further moved and included as part of Dasl-ML.

Zeroing-in on the right library and double-checking to see if it is the latest one that I need to use, I tried putting it to test (Putting it to test? Yeah, as an experienced software developer you learn to trust nothing, not even your own credentials :).

Guess what the test result is? This showed no improvement in computation time of hyper-parameter tuning. So, I checked the video and it was published over a year ago. There is so much that can could have happen during this time, and given scikit-learn’s key-authors credentials, it is safe to assume that they have tuned the library for better performance as well, over the course of this time. If you are keen on the details, scikit-learn can leverage the multiple cores of your system auto-magically by setting its param n_jobs=-1, in its API where possible.

How would you interpret these events? Here is my take, I’m glad in some sense, that my experiments to speed-up failed and felt assured that the scikit-learn team is working hard enough to keep the library in great form that it is today and that I can shift my focus to other ways of speeding up the computation time, one of which could be to delegate this task to a cluster of nodes/computers.

By the way, if you are not a software developer who is jumping into the Data-Science bandwagon where the team you are in, is into a stack on Python or Go or R etc., instead of tool-stack like SAS or SPSS, I would wish you — “Welcome to the chaotic world of modern software development!”. It is chaotic and complex, but then if you (as a team) can tame the beast, you wield the powers that your competitors would envy and fear.

As always, I’m keen to learn from your experience. So do feel free to share your thoughts as comments to this post. At the very least, you can hit the like button and share this post to know how people in your circle respond to this post. Remember, more people engagement implies better ensemble we form together :)


Published by HackerNoon on 2018/09/04