Today there is a mass of new software packages and repositories arriving on the scene that has made the data science process more interactive, nuanced, and user-driven than ever before. For evidence of this, just check the Towards Data Science homepage on any given day. In the face of this new wave of choices, it is important to understand the basic structure of development pipelines.
Data Scientists have become newly minted developers in their own right. In becoming developers, it is useful to understand development principles that software engineers use to iteratively test, construct and shape their deployed code.
In this article, we’ll talk about some often-misunderstood development principles that will guide you to developing more resilient, production-ready development pipelines using CI/CD tools. Then, we’ll make it concrete with a tutorial about how to set up your own pipeline using Buddy.
A Representation of the Modern Data Science Workflow. Created with Notability for iPad.
Understanding the components to development is the first step to understanding which pieces go where, and how things fit together. Each of these elements serves as crucial building blocks towards the coveted end-to-end pipeline. In this case, “end-to-end” is jargon for “you make a code-level change, and the end-user experiences the effect”.
In a nutshell, you as the data scientist would use the development pipeline to push changes from your local machine to a version control tool, and have these changes be reflected in the cloud deployment service for your end users.
Next, we’ll break down an example development pipeline step by step.
However, we’re still missing a crucial piece of the puzzle. Uploading code to GitHub and setting up an AWS deployment is great, but if there are changes or upgrades to that codebase, the AWS deployment will not automatically reflect them. Instead, each version for deployment will have to be manually updated. While this is suboptimal in terms of effort and time, there is also a possibility that your flashy new update might break the basic functionality of your original dashboard. This mistake is compounded when working with a team of data scientists to create a product.
To patch this missing puzzle piece, we introduce the concept of Continuous Integration / Continuous Deployment — abbreviated as CI/CD. This tool bridges the gap between development and operation activities through automation. It helps you test and integrates your new changes into existing the body of work. Buddy Works is an excellent option for this tool when setting up your deployment pipeline.
You might be wondering — how is this going to stop my development pipeline from breaking? Let’s explore the value of using Buddy. This CI/CD tool is actually a process, that involves adding testing, automation, and delivery benchmarks to connect your GitHub repository to the cloud configuration.
Buddy functions as a Swiss Army knife when it comes to deployment operations.
Let's examine each element in turn:
Now that we have established the premise of CI/CD and its uses, let’s dive right into a first look of Buddy’s platform and how you can get a basic pipeline off the ground.
First off, head over to Buddy and make an account. It is recommended to use your version control login here, as it will save you the step of connecting your repositories. Either way, connecting your version control software will allow Buddy to connect to any of those repositories.
Buddy conveniently syncs with all of your GitHub repositories, public and private.
We’ll use the demo-uber-nyc-pickups repository for the purposes of this tutorial, which is an interactive dashboard built with Streamlit. After forking the repository on Github, it will show up in our repo list within Buddy. Clicking on the name will lead us to the next screen.
Buddy scans your repo meta-data to recommend a relevant environment setup.
Here, Buddy has already detected that the repository’s contents contain a Python app and shows us more options for setting up the relevant Python environment. At this step, we also have to select how the pipeline should trigger.
I went with the ‘on-push’ trigger to master-branch, so all my latest and greatest changes will be acted on.
After naming the pipeline, we can choose what action will trigger the pipeline. Since we care about deploying new changes to the AWS instance, we can set it to run the pipeline every time a new push is made to the master branch. Alternatively, we can set it to only trigger manually, or even on a timed basis (e.g. every day at 5pm, every Friday, etc).
This is the home for pipeline building. Add new actions to your pipeline either by searching or clicking on icons.
As mentioned, Buddy has detected that our app is written in Python, so we’ll click on that icon first. Here’s where we can configure the environment, choose the relevant Python version (in this case, it’s
python3.7
). A quick look in the README.md of the project tells us the BASH lines needed to get the app up and running:pip install --upgrade streamlit
pip install -r requirements.txt
The first line ensures that we are running the latest version of streamlit, and the
requirements.txt
contains the remaining dependencies we need to be able to run our app.At the bottom, we can also notice the Exit Code Handling section — this allows for a way of helpfully identifying behavior in case of errors at any step in the pipeline. We can either solider on (not recommended for obvious reasons), or stop the pipeline where it broke and send a notification that something went wrong, or try running different commands. Identifying where something has broken is perhaps the most frustrating part of fixing a broken process. Proactively setting error-handling behavior and notifications as a priority will help keep frustrations at a minimum going forward, when some element inevitably breaks.
The build commands allow you to write any BASH or SH scripts you need to get the environment set up right.
Next, we’ll try running the very basic pipeline so far, and see whether it works. Click “Run Pipeline” in the top left, and sit back as Buddy integrates the latest commit on the master-branch, prepares the environment, and executes the BASH setup. On subsequent runs, the cache will continue to only update changes, and this process will run faster over time.
Raw Logs allow you to see exactly what happened during execution, and the timer is a convenient method for estimating runtimes.
Awesome! The build is complete and without errors. If you are following along with the tutorial to this stage and faced errors, check that the Python version is exactly python3.7, because that is required for this particular app’s dependencies.
Testing is a central element in software development, but it is unfortunately not prioritized or taught in most data science curriculums.
“Unit tests give you the confidence that your code does what you think it does”
Adding unit tests can be simple as adding python files in the same repository. In order to run these tests, we’ll return to Step 3: Building the environment, and add in our new line to run the tests here.
Adding in “python run unittest.py” will run your files when the environment is built. If all tests pass, the pipeline will continue.
When the tests have been implemented, this is where we would expect to see the results. In this case, error handling setup becomes particularly important, as Buddy can share notifications if some tests fail.
If all tests pass, the compiler will end with “Build finished successfully”.
Adding in notifications is critical to ensuring we know where breaks in the pipeline occur, or which tests have failed. From the pipeline overview, click on the “Actions Run On Failure” section, where we can decide what actions will run if there is an error anywhere in the pipeline. For our purposes, it will be sufficient to set this up using environmental variables that will indicate which execution or test broke the pipeline execution.
$BUDDY_PIPELINE_NAME
gives us the name of the pipeline that is broken$BUDDY_EXECUTION_ID
gives us the unique identifier of the instance of the pipeline that created an error, including the$BUDDY_FAILED_ACTION_LOGS
will give an extensive overview of the logs of what went wrong, which is convenient because it helps in diagnosing any issues that pop up. It may even help solve the issue by just glancing in the email, fixing the code, and making a new commit to patch the issue — without even needing to visit the CI/CD tool at all.An extensive array of environmental variables are available here, and more can be developed with ease.
Below is a sample message sent to my email, informing me that there was something amiss with the built environment, and sharing the logs here. With the use of additional environmental variables, you can make this output incredibly specific to allow you to hone in on the error.
Here, the issue is clearly that my build environment is deprecated, and thus I need to choose a new python version that is being maintained. In this case, that’s.python 3.7
This is the last step we will take in setting up the pipeline. By connecting this pipeline to a free-tier AWS EC2 machine, we will arrive at an end-to-end pipeline, as per the overall goal.
In order to do this, select the
SFTP
action, and make the connection between Buddy and the public IPv4 address of the EC2 machine.Using the Pipeline Filesystem is super important here because it makes use of the tested files.
Here, I’ve entered my Hostname & Port, and Login information, as well as used my Private SSH key to actually give Buddy access to the EC2 machine. There are two caveats to mention here:
Having completed this final step, we can run the pipeline by simply making a change at the code-level, and then committing the change to the master branch.
git commit "app.py" -m "Buddy cicd test"
git push
Hooray! We correctly set up the environment, sent across changes from the local machine to GitHub. The code was then executed, ran unit tests, and uploaded to the EC2 machine, where the changes were reflected in our visualization.
Let’s take a look at the final product:
You can also visit Streamlit’s version of this app here.
This is the front-end visualization, powered by Streamlit. To review, we’ve taken python code and committed it to a versioning tool (in this case, Github). This repo is then linked to a CI/CD tool (Buddy), which syncs, tests, and integrates our commits to the overall build, hosted on an AWS EC2 machine.
In conclusion, every time we make a new commit to Github, this will trigger a Buddy pipeline execution which will:
SFTP
.In the event of any errors or snags along with the execution, we’ll receive an email highlighting exactly what went wrong. With this level of detail and refinement, Buddy’s CI/CD tool has elevated our deployment of data science platforms and made it easier than ever to maintain user-driven products.
Happy coding.
Full disclosure, this is a sponsored article by Buddy. I do use Buddy CI/CD in my projects, and have leveraged their technology to develop and deliver end-to-end pipelines to a number of data engineering clients.
Saif Bhatti is a data scientist and co-founder of Go Fish Analytics. Say hello on Twitter.
Other pieces of data science creativity:
It’s worth noting that the specifics of the EC2 setup are not in the purview of this article, but some helpful advice and content on setting up Streamlit applications on AWS is available below.
Also published on Medium.