Building a Monte Carlo Markov Chain Pipeline Using Luigi
I am an aspiring data scientist and entrepreneur
A few months ago I was accepted into a data science bootcamp - Springboard, for their data science career track. As part of this bootcamp I had to work on Capstone projects that would help build my portfolio, show my ability to extract, clean up data, build models and extract insights from said models. For my first project I opted to build a Monte Carlo Markov Chain pipeline initially with the objective of building a multi-touch attribution model that would help me understand conversion rates from different states in the signup process and use that to understand which channels appeared to deliver the greatest conversion rates for users coming through a given landing page and transitioning through the different signup states defined in my dataset.
Unfortunately as a result of limitations in the data I was using, this proved to be as elusive as an honest politician, but nonetheless with the help of my Springboard mentor, I built a data pipeline using the luigi package
that would carry out tasks required to build this model from scratch and the associated output.
For the reader’s convenience I have attached links to my code on GitHub for each task in my pipeline.
What is Luigi?
Well, not this Luigi.
I’m not a very smart person, on a scale of 1 to 10 I would rank myself as a 5 on a good day, on other days I struggle to not walk into glass doors. In order for someone such as myself to understand most things, they need to be dumbed down and explained in the simplest way possible with relatable analogies to help make it more relatable and devoid of any jargon.
I will try to the best of my abilities to explain what luigi is, how it works and what it can do in the simplest way possible.
Think of a domino set, you can for example build a spiral with different tiles, the distance between each tile must be such that when one tile falls it triggers the next tile to fall and so on until all the tiles have fallen. Each tile represents a task, you need each task to run to ultimately get your desired output.
At least for the purposes I used the package for, I would liken Luigi to a domino set, that allows me to stitch different tasks. When one task runs and is complete I trigger the next task to take output from the previous task and run till completion.
According to the Luigi documentation it ‘helps you build complex pipelines of batch jobs’.
Much like setting up a domino set such that n number of branches appear at a certain point and once the falling tiles reach that point the branches created fall parallel to each other, luigi allows you to parallelize things that need to be done in a given task.
In a given pipeline you may have tasks- which are the domino tiles, you also have task parameters, which is the input the task takes. So for example when you create a function in python, the function may require an argument or n number of arguments. I would liken Luigi parameters to these arguments.
To explain how these parameters are assigned its important to explain that there are typically three functions in a task/class:
- the run function, which has the core ‘business logic of the task’- ie. what is your task supposed to do,
- the requires function, which looks at the dependencies required to run your current task. Since this function runs first, you can use this to dynamically assign values to the parameter of a previous class. Especially if the output of that class is needed for the current task. This is kind of how you start stitching different tasks together,
- the output function where the output of your task/output is defined and produced
Luigi also provides a neat web interface that enables you to view pending, running and completed tasks along with a visualization of the running workflow and any dependencies required for each task in your pipeline. You can access this by running luigid on your terminal and opening the interface through the url: http://localhost:8082/
As explained, the pipeline is multiple tasks that have been stitched together using Luigi. The process starts off by taking input in the form of the csv file extracted from my data source, looks for the unique signup states and creates a separate csv for each unique state identified- these are four in total.
Each csv has data specifically related to activities that occurred under that state. For the purposes of this assignment I named this task ‘separate_csv
’ as it performs its namesake.
Since I want to find out whether a given user moved from one state to another, the next task, ‘state_to_state_transitions1
’, is dependent on the output of the ‘separate_csv’ task.
Because of this, the parameter value for
is assigned in
If you recall, the ‘requires’ function runs first, therefore when
runs it first runs that requires function that assigns original data csv to separate_csv.
This logic is built into all my tasks.
The purpose of
is to create a sequence of marketing sources a given user clicked before completing the signup process, given the change in the objective of the model, this is more of a nice to have table showing the sequence of marketing channels engaged before signup.
The next task, ‘state_to_state_transitions2
’ then the 4 unique state files created and for each unique user checks whether that user moved from the first state to the second state all through the final state returning boolean values dependent on whether or not the transition occurred.
The output produced are three files representing the transitions from one state to another.
After this part of the pipeline I needed to get the probability distribution for each transition. For this I used the gaussian_kde module in SciPy
, which is used to get an estimate of the probability distribution of a given dataset using a density estimator.
’ task produces three pickle files with the probability distribution for each transition.
Since I was dealing with relatively large amounts for rows/user activity I then had to get samples from each pickle file and save each sample as a separate csv.
From the samples produced in the ‘get_samples
’ task, I created a visualization to compare whether the distribution of transitions observed in these sample files matched the distribution in the population data. Since this was a task requiring more visual output, I did not include this in my pipeline.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
files = ['Sessiontolead+sampleprobabs','leadtoopportunity+sampleprobabs','opportunitytocomplete+sampleprobabs']
pickles = ['Sessiontoleadprobabs','leadtoopportunityprobabs','opportunitytocompleteprobabs']
def func(sims, tag):
file_path = 'C:\\Users\\User\\Documents\\GitHub\\Springboard-DSC\\AttributionModel\\Data\\ModelData\\original\\'
sims = pd.read_csv(file_path+sims+'.csv')
path = 'C:\\Users\\User\\Documents\\GitHub\\Springboard-DSC\\AttributionModel\\Data\\ModelData\\pickles\\'
actuals = pd.read_pickle(path + tag + '.pck')
y = np.linspace(start=stats.norm.ppf(0.1), stop=stats.norm.ppf(0.99), num=100)
fig, ax = plt.subplots()
ax.plot(y, actuals.pdf(y), linestyle='dashed', c='red', lw=2, alpha=0.8)
ax.set_title(tag + ' Stats v Actual comparison', fontsize=15)
# sims plot
ax_two = ax.twinx()
ax_two = plt.hist(sims.iloc[:, 1])
return fig.savefig(path + str(tag) + '.png')
for x,y in zip(files,pickles):
The output of this ‘sanity check
’ is the distribution of transitions between states for the population data, represented by the red dotted line and the transitions for the samples, represented by the bars. From these visualizations I got confirmation that the distribution of transitions from the samples matched the population transitions.
The last task, ‘state_to_state_machine
’ then creates simulations based on the probabilities in the probability distribution for each transition. Each simulation can be thought of as a hypothetical user who landed on one of the company’s signup pages and then proceeded to go through the signup process.
We can then simulate users moving from st
ate to state based on different filters, for example if the user was using a particular device, started signing up on a particular day, time of the day or based on a marketing campaign they may have clicked (with the right data). Each simulation creates a new file and once the value of a preceding transition is
, there will also be a
for the following states.
Finally, the pipeline ends with the ‘parent_wrapper
’ that ties every task together by assigning parameter values to the state_to_state_machine task, joining together all files created through the simulation into a single file and running the entire pipeline.
For a more detailed breakdown of the model, visit my GitHub repo
Subscribe to get your daily round-up of top tech stories!