According to a recent StackOverflow analysis Python is the fastest growing programming language of those already in wide use. What’s more, the growth rate is accelerating, and has been consistently for the past few years. While the specific conclusions of the StackOverflow post should probably be taken with a grain of salt, there’s no denying the fact that Python use has exploded over the past five years. That’s great news for people who have long used and enjoyed the language, but it’s still important to ask why Python usage has exploded. And the primary, if not sole facilitator of this growth is a feature of the language you’ve probably never even heard of.
With the rise of Big Data, most industries found themselves in a scary situation: they had spent an enormous amount of time and money building out their Big Data pipeline, but they were seeing little return on their investment. In the breathless race to be able to capture ever-increasing volumes of data, most companies had no firm plan for what to do with the data they captured. At the time, everyone thought that by storing huge amounts of data, analysis would be simple and valuable business insights would almost be self-apparent. It may sound silly today, but most thought that the patterns in the data would become obvious once enough data was captured.
Unfortunately, that’s not what happened.
Instead, the industry collectively realized, almost simultaneously, that the kinds of non-trivial insights they hoped to glean and questions they hoped to answer required rigorous mathematical analysis and validation. SQL queries might uncover the most obvious patterns and trends, but the really juicy stuff required an entirely different skill set. A skill set firmly rooted in statistics and applied mathematics, which no one outside of academia seemed to posses. What’s more, a person charged with analyzing these enormous data sets would need not only a very strong math background, they’d need to be able to write software as well.
It should come as no surprise, then, that the title “Data Scientist” started appearing all over both job sites and resumes, though it would be a few years until anyone would attempt to nail down what exactly a Data Scientist did with any rigor. At the time, it was closer to shorthand for “a person competent in both statistical analysis and programming.”
Rewind a bit further, before Big Data was a real “thing”, and you would have seen a heated battle between Ruby and Python to become “the language of the web”. Both proved well suited for developing web applications. Ruby’s popularity was intimately tied to the Rails framework. Few would argue that most programmers who self-identified as “Ruby programmers” around this time might as well have just said “Rails programmers”. Python was already reasonably well entrenched in academia and a handful of disparate industries. The closest Python equivalent to Rails was Django. Despite being released slightly ahead of Rails, it seemed to lag in popularity by a wide margin.
Many felt that the languages were similar enough in expressiveness and approachability that one would ultimately “win” the web. But there was a fundamental difference in the implications of such an idea: while Ruby’s popularity was closely intertwined with that of Rails, Django represented a comparatively small percentage of an already vibrant Python ecosystem. Ruby, it seemed, needed Rails to “beat” Python to guarantee its continued popularity. And in many ways it did.
It just turned out to be the case that the “web wars” mattered far less than anyone anticipated.
To understand why, we’ll need to go all the way back to 2006, when Travis Oliphant was still an assistant professor at BYU and not co-founder of Anaconda (nee Continuum Analytics), one of the most successful commercial data science platforms built entirely on Python. A year prior, he started the NumPy project loosely based on a previous scientific computing library, Numeric. He would eventually go on to be a founding contributor to SciPy and even served as director of the PSF. But in 2006, he submitted (along with Carl Banks) PEP 3118, a revision to Python’s “Buffer Protocol”.
The buffer protocol was (and still is) an extremely low-level API for direct manipulation of memory buffers by other libraries. These are buffers created and used by the interpreter to store certain types of data (initially, primarily “array-like” structures where the type and size of data was known ahead of time) in contiguous memory.
The primary motivation for providing such an API is to eliminate the need to copy data when only reading, clarify ownership semantics of the buffer, and to store the data in contiguous memory (even in the case of multi-dimensional data structures), where read access is extremely fast. Those “other libraries” that would make use of the API would almost certainly be written in C and highly performance sensitive. The new protocol meant that if I create a NumPy array of ints, other libraries can directly access the underlying memory buffer rather than requiring indirection or, worse, copying of that data before it can be used.
And now to bring this extended trip down memory lane full-circle, a question: what type of programmer would greatly benefit from fast, zero-copy memory access to large amounts of data?
Why, a Data Scientist of course.
So now we see the whole picture:
In the second part of this article, I’ll explain what I’ve been up to, Python-wise, for the past few years and how it directly ties into the story above and the rise of the Data Scientist.
Posted on Sep 15, 2017 by Jeff Knupp
Originally published at jeffknupp.com on September 15, 2017.
Create your free account to unlock your custom reading experience.