Some of the beauties of Python are its flexibility and simplicity. However, these abilities are a double-edged sword. If you do not put the work in early on, to design reusable, manageable, and testable code, you will run into progress issues as your codebase scales.
When using Python’s Pandas module, it’s easy to move away from an object-oriented style of coding. A common pitfall is to write quick code, that becomes hard to test and messy to scale.
This post shows a design pattern for reusable and low-maintenance code when data-wrangling with Pandas.
🐍Codebase found here.
We’ll be using a free data set from Kaggle containing: “A complete list of unicorn companies in the world.”
Context
“A unicorn company, or unicorn startup, is a private company with a valuation over $1 billion. As of March 2022, there are 1,000 unicorns around the world. Popular former unicorns include Airbnb, Facebook and Google. Variants include a decacorn, valued at over $10 billion, and a hectocorn, valued at over $100 billion.”
📊Kaggle Unicorn Companies dataset found here.
We will build 3 tables of Unicorn statistics using Pandas.
We will be using a combination of Pandas methods that breeze data manipulation and keep our pipeline looking clean.
pandas.DataFrame.explode
or pandas.Dataframe.melt
pandas.DataFrame.groupby
pandas.DataFrame.reset_index
pandas.merge
Firstly, we will calculate the count of Unicorns per country and the average Unicorn valuation (in billions) per country.
country company mean_valuations_billion_usd max_valuations_billion_usd
0 Argentina 1 2.450000 2.45
1 Australia 6 8.433333 40.00
2 Austria 2 3.805000 4.11
3 Bahamas 1 32.000000 32.00
4 Belgium 3 2.983333 5.25
output
The plot below is one example of a plot made with our country_stats
table. We can quickly see the US leading the world in the total number of Unicorn companies.
⏳ Country Level Time Series
For these metrics, we group by country and date_joined columns, to count the number of Unicorns over time and sum the valuations.
💡
Note I previously sorted the dataframe by date_joined.
country founded_year company valuation_billion_usd
0 Argentina 2017-01-01 1 2.45
1 Australia 2004-01-01 1 1.60
2 Australia 2011-01-01 2 2.50
3 Australia 2012-01-01 1 40.00
4 Australia 2015-01-01 2 6.50
output
So far we have only generated time-series metrics at the point in time. However, it’s easier on the eye, to view the cumulative sum over time.
These steps take the generated time_series
table and use an expanding window to calculate a cumulative sum.
country founded_year company valuation_billion_usd company_cumsum valuation_cumsum
0 Argentina 2017-01-01 1 2.45 1.0 2.45
1 Australia 2004-01-01 1 1.60 1.0 1.60
2 Australia 2011-01-01 2 2.50 3.0 4.10
3 Australia 2012-01-01 1 40.00 4.0 44.10
4 Australia 2015-01-01 2 6.50 6.0 50.60
output
The time series plot below is made with our cumulative results, for the number of Unicorns per country. We can see since 2020–2021 the US has reached a trajectory for producing Unicorns that could not be matched by China. Whilst, the India and UK may be just beginning their growth stages.
🧑💼 Investor Stats
Generating investor metrics is more complex. Each company’s investors are stored as a comma-separated string.
For example, “Avant” have select_investors “RRE Ventures, Tiger Global, August Capital”.
We want to reuse the same code format, as with the country-level metrics, to make use of the pandas.DataFrame.groupby
method. This will help us refactor later on.
company select_investor
215 Otto Bock HealthCare EQT Partners
390 Avant RRE Ventures, Tiger Global, August Capital
520 Promasidor Holdings IFC, Ajinomoto
609 Five Star Business Finance Sequoia Capital India, Tiger Global Management...
774 Radius Payment Solutions Inflexion Private Equity
example output
Un-pivoting is key to this design pattern, as we want to make use of the groupby
method on individual investors.
Using pandas.DataFrame.explode
, we generate an additional column for individual investors. Note that we now have multiple rows per company in our table.
💡Here I have use explode
to un-pivot. Another method to check out is melt
.
company select_investors_single
0 Bytedance Sequoia Capital China
1 Bytedance SIG Asia Investments
2 Bytedance Sina Weibo
3 Bytedance Softbank Group
4 SpaceX Founders Fund
output
The next step is to generate simple investor stats, company count, and valuation of the companies, in each investor’s portfolio.
The histogram below shows the distribution of investors by the total number of Unicorns in their portfolio. We see a power-law type distribution where most investors have only 1 unicorn, whilst few have invested in many. This type of distribution can also be found in populational monetary wealth and social networks.
🔧 Pipeline so far
So far we have a metrics pipeline that looks fairly neat, but we are only generating a total of 8 metrics. If we were to extend this to 20–30 metics, our script would start to see a lot of repetition.
The format of our code up to now is a simple python script. Thus our code cannot be isolated and unit-tested.
Our only testing option is to run the entire script and assess the output, in an end-to-end style test. This is not great as it could take a long time to run.
“Open for extension but closed for modification.”
Lets refactor our code to follow the open-close principle as best possible.
pandas.DataFrame.apply
method.pandas.DataFrame.groupby
, with a generate_metrics
function, using Python’s built in getattr
function.generate_metrics
function, with the meta-data required to generate our metrics.
💡Check out SOLID design principles here.
By shifting our metrics to a class, we can isolate the metrics, and build unit tests for each metric. By using the Pandas.DataFrame.apply method, we can add personalized metrics, and leverage other python packages that are not included in Pandas.
Following the open-close principle, if we wanted to add metrics we would create a new class that inherits our class Metrics.
The config file has a list of metrics for each table we want to generate. If we want to add or remove metrics or change naming etc, we simply change the config file. This way we are not editing our codebase itself.
This function takes in our Unicorn data, an instance of our Metrics class, and our metrics config, and returns a metrics dataframe.
Steps:
getattr
to create a pandas.DataFrame.groupby.apply
objectgetattr
to create a Metrics class method object (e.g. Metrics.count
)pandas.DataFrame.groupby.apply
object passing the Metrics method objectFinally, we arrive at our refactored pipeline. We can easily add metrics to existing tables by defining new metrics classes and adding them to our config file.
Benefits
Tradeoffs
🐍 Codebase found here.
📊 Kaggle Unicorn Companies dataset found here.
pandas.DataFrame.explode
pandas.DataFrame.groupby
pandas.DataFrame.reset_index
pandas.merge
getattr
Also Published here