How Jupyter Notebooks played an important role in the incredible rise in popularity of Data Science and why they are its future. Nowadays, many individuals and teams are flocking to the tools and techniques that enable them to leverage large amounts of data. What makes Jupyter Notebooks so appealing to data scientists? In this article I will dive into some of the underlying trends that have contributed to the success of Jupyter Notebooks and why I decided to build to leverage and further contribute to its success. Orchest Underlying technologies of data science Something that is less talked about is the connection between the many advances of machine learning and data science, and the underlying technologies that have been developed over the past decades. Specifically I'm talking about programming languages such as , operating systems like , compiler infrastructure like , and version control systems such as . Just to name a few. It's important to realize that . Python Linux LLVM Git fundamental projects like these have enabled the vast growth and advances in machine learning and data science The previously mentioned technologies have, among others, created fertile ground for individuals and companies to start leveraging data science tools and techniques. However, in order to leverage these technologies data scientists need to find a way to use them without requiring a significant time investment. Hiding complexity Technological building blocks are crucial when it comes to dealing with complexity. The modern computing stack has done an outstanding job of layering systems to make sure that whenever you want to perform a task, you are not encumbered with the many lower-level implementation details. Take for example the seemingly simple task of interacting with files. A simple Python snippet executes many low-level operations under the hood in order to give the engineer a high level, easy to use abstraction to interact with files. Having the high-level concept and implementation of files available increases programming productivity by orders of magnitude. file = open('hello.txt', 'w') For data science, high level frameworks such as TensorFlow let you define complex layered neural networks with just a few lines of code: model = tf.keras.models.Sequential([

tf.keras.layers.Flatten(input_shape=( , )),
  tf.keras.layers.Dense( , activation= ),
  tf.keras.layers.Dropout( ),
  tf.keras.layers.Dense( )
]) 28 28 128 'relu' 0.2 10 Jupyter Notebooks are great for hiding complexity by allowing you to interactively run in a contextual environment, centered around the specific task you are trying to solve in the notebook. high level code By ever increasing levels of abstraction data scientists become more productive, being able to do more in less time. When the cost of trying something is reduced to almost zero, you automatically become more experimental, leading to better results that are difficult to achieve otherwise. Experimentation driven development According to Wikipedia [1] Data Science is defined as: "an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data." It is exactly the application of the scientific methods that requires you to be able to run many experiments in order to validate your hypotheses. The value of tools therefore lies in making this as frictionless as possible. When data science is compared to traditional software engineering this point is often overlooked. for which the development process is more planned out and less explorative. The iterative and experimental nature of data science makes it fundamentally different from regular software engineering Interactive computing It is incredibly powerful to get immediate feedback when programming. Computing the outcome of your actions in realtime and consuming them in an easily digestible way, enables you to quickly draw conclusions about what works and what doesn’t. In a great talk by Bret Victor this principle is demonstrated through a clever example of how interactive feedback can help you find the best solution to a game design problem: Power of immediate feedback: visualizing the consequences of your changes. Let the character make the jump by finding the correct y-velocity. [2] I believe that the benefits of immediate feedback have been instrumental to Jupyter Notebooks' rise in popularity. Notebooks enable you to rapidly try ideas and experiment by providing you with immediate feedback when executing snippets of code. Through their cell-based structure and markdown support, they provide a scratchpad for your ideas which facilitates exploratory work even further. The Jupyter open source project has pioneered many of the concepts around interactive programming for data science and has built a great community around its ecosystem. To guarantee Jupyter Notebooks keep improving and to ensure that they are indeed the future of Data Science it's important to collaborate and rally around standardized and open source solutions. How we're contributing with Orchest To contribute to the collection of open source tools in the data science ecosystem my co-founder Yannick Perrenet and I decided to start . Orchest is an open source tool to supercharge your Jupyter workflow. It allows you to create data science pipelines that consist of individual Jupyter Notebooks as pipeline steps, combining the advantages of interactive notebooks with those of data pipelines. Orchest Through our personal experience as data scientists, we have discovered that significant technical complexity arises when doing large scale data science projects. Our mission is to make it painless and simple to leverage Jupyter Notebooks in cloud based environments while collaborating with others. By integrating Jupyter Notebooks in Orchest we believe we can leverage the strengths of notebooks to make them an even better tool for modern data science. As of today we are still at the very beginning of this journey with our just starting to take shape. We very much welcome contributions and suggestions from the wider community to further develop the software for a broad and diverse data science audience. GitHub project In another article we will dive into what exactly our vision is for Orchest. We will give concrete examples of current pain points for data scientists, how we are solving them with Orchest today and how we are planning to address  more challenges in the future. Stay tuned! [1] [2] Bret Victor - Inventing on Principle https://en.wikipedia.org/wiki/Data_science https://vimeo.com/36579366

Why Jupyter Notebooks are the Future of Data Science

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

How I built a spreadsheet app with Python to make data science easier

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

10 Questions to Consider when Setting up a Corporate A.I project

How I built a spreadsheet app with Python to make data science easier

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

10 Questions to Consider when Setting up a Corporate A.I project

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps