489 reads

How Data Scientists Start Automating Their Tasks With Python

by Joleen BothmaJuly 17th, 2021

Too Long; Didn't Read

The most tedious part of our job is preparing and cleaning data and we should aim to automate these tasks as much as possible. The goal in automation is to get our work done faster and without error so that we can work on stuff that is actually fun and interesting. Python is one of the most popular programming languages used in data science and there is no shortage of resources available online if you need help with just about anything. I hope you will walk away with a good place to start in your automation journey, something I wish I had when I was just getting started.

Company Mentioned

featured image - How Data Scientists Start Automating Their Tasks With Python

Task automation has become an invaluable tool in my arsenal as a data scientist. My only regret is not getting started with automation sooner.

In the past, I have felt intimidated by the process of automation because I always thought it was an intensive, tedious process. I have come to learn that automation has done the exact opposite for me: I am able to get more done and 'work' less because of all the tasks I have automated.

In this blog post, I would like to introduce you to automation with python and give you my top 3 most used code snippets. I hope you will walk away with a good place to start in your automation journey, something I wish I had when I was just getting started.

Why should you automate stuff?

Essentially, automation removes tedious, repetitive tasks from your daily work. As data scientists, the most tedious part of our job is preparing and cleaning data and we should aim to automate these tasks as much as possible. I don't know about you but the thought of spending even an hour a day just formatting and pre-processing data is the very opposite of my definition of work satisfaction.

The goal in automation is to get our work done faster and without error ('cause, you know, we're still human) so that we can work on stuff that is actually fun and interesting.

How do you decide what stuff to automate?

When deciding what tasks should be automated, you first need to determine what tasks you carry out in your job on a daily, weekly, monthly basis. Think carefully about all the tasks, big and small that you do and list them out. Determine which of them are highly repetitive and circle them (these are tasks that contain a series of identical steps that you need to go through every time you carry out the task).

For example, let’s assume that you have a daily task where you need to convert an Excel file to a csv. This is an extremely repetitive task and although it could be really quick to do, it can quickly become tedious if you have 10, 20 or more files that you need to convert to csv. So we can circle this task on our list as a possible candidate for automation.

The stuff I've automated and lessons learned

To help give you some ideas of what you can automate, I have provided a few code snippets below of the common tasks that I tend to automate for new and existing projects. Each of these tasks can be combined into a single Python script or they can be used on their own, it all depends on what you're trying to achieve.

I have included some example code to get you started if you'd like to perform these same tasks in your own work. My code below is written in Python version 3.9.

I always use Python when writing scripts and automating tasks. Python is one of the most popular programming languages used in data science and there is no shortage of resources available online if you need help with just about anything. If you don't know how to automate a tasks, just Google it and I am fairly certain you will find something helpful.

Downloading an attachment from an email (this example is for Gmail only)

In this script, using the ezgmail package you can easily download email attachments and store them in a folder. This is especially useful if you regularly receive reports via email and you need to download them into a specific file path for further analysis.

In this script, search parameters are defined first. The parameters used here are the subject as well as newer_than so only recent emails are located.

Libraries: ezgmail

# Define inbox search parameters
threads = ezgmail.search('subject:EmailSubjectLine AND newer_than:1d')

# Download all attachments to folder in current working directory
threads[0].messages[0].downloadAllAttachments(downloadFolder='MyPath')

Convert to csv

This script will find all the Excel (.xlsx) files in a folder and store each file path in a list. Then, each file is imported, a filename is created for the resulting csv file and finally the file is exported to csv.

I specify the pipe ("|") delimiter for all my csv files because this is least likely to interfere with any text data and it keeps everything consistent. I also make sure to exclude the index because pandas likes to throw that in there.

Libraries: glob, os, pandas

# Obtain list of files in path
files = glob.glob(os.path.join("MyPath", "*.xlsx")

# Import the file and export to csv
for f in files:
	# Import data
	df = pd.read_excel(f, sheet_name = 0) # Take first sheet
	# Create a filename for new csv file
	filename = f.split(".")[-2].split("\\\\")[-1]
	# Export to csv (with pipe delimiter and no index column)
	df.to_csv(os.path.join("MyPath", str(filename)+'.csv'),sep = "|",index=False)

Uploading data to AWS S3

If you store your data in AWS S3 then you can use this script as part of your workflow to upload data to S3. Using the previous 2 scripts together with this last script you can create a workflow that downloads an email attachment (excel file), converts it to csv, and then uploads it to S3.

Pretty neat, right? Just imagine the cumulative time you can save by running this workflow on a daily basis.

Libraries: boto3, glob, os,

# Define S3 Resource
s3 = boto3.resource(
    service_name='s3',
    region_name= "REGION",
    aws_access_key_id = "ACCESSKEY",
    aws_secret_access_key "SECRETKEY"
)

# Obtain list of csv files in path
toupload = glob.glob(os.path.join('MyPath', "*.csv"))

# Upload files to S3 bucket
for f in toupload:
	key_name = f.split("\\\\")[-1]
	s3.Bucket('MyBucket').upload_file(Filename=f, Key='My/Object/Key'+key_name)

A little note: I know I have the access key and secret key stored as strings in this script but this is for demonstration purposes only and you should not expose your secrets in this way.

You can rather store them as variables in a separate .py file that you can exclude from your version control system. Keep it secret. Keep it safe. (Yes, that was a Gandalf quote, just in case you were wondering)

Scheduling Tasks on Windows

Having worked exclusively in Windows virtual machines to schedule my scripts to run automatically, I am just going to give you a little peek into how I do that.

I use Anaconda to create virtual environments for my scripts and because of that, I need to write a little batch script that activates the relevant conda environment and runs the python script.

The batch file looks something like this:

@echo off
cd "Path\\To\\My\\Script\\Directory"
call conda activate my_environment
python "my_python_script.py"
call conda deactivate

Then, inside Windows Task Scheduler I set up a basic task to run a program and enter the path to the batch file and set a regular time for it to kick off and that's it!

Automating tedious tasks is an incredibly simple process. If you are feeling intimidated by automation then I encourage you to just try out the scripts in this blog post and see how it feels to have minutes of your life handed back to you ⌛️

As I end off this post, I leave you with 2 quick tips:

Always write your code in functions that complete as small a task as possible. That way you can easily re-purpose functions for different scripts and different purposes without needing to rewrite any code.
Use the built-in python logging library to create log files so that you can easily identify the cause of errors if they come up and fix them quickly. Check out my blog post on how to get started with the logging module if you'd like to learn more.