Python design pattern for writing scalable data-wrangling pipelines Some of the beauties of Python are its flexibility and simplicity. However, these abilities are a double-edged sword. If you do not put the work in early on, to design reusable, manageable, and testable code, you will run into progress issues as your codebase scales. When using Python’s Pandas module, it’s easy to move away from an object-oriented style of coding. A common pitfall is to write quick code, that becomes hard to test and messy to scale. This post shows a design pattern for reusable and low-maintenance code when data-wrangling with Pandas. 🪖 Objectives of Post Build a metrics pipeline with Pandas Refactor pipeline to be easily extendable and testable 🐍Codebase found . here 🦄 Dataset We’ll be using a free data set from Kaggle containing: “A complete list of unicorn companies in the world.” Context “A unicorn company, or unicorn startup, is a private company with a valuation over $1 billion. As of March 2022, there are 1,000 unicorns around the world. Popular former unicorns include Airbnb, Facebook and Google. Variants include a decacorn, valued at over $10 billion, and a hectocorn, valued at over $100 billion.” 📊Kaggle Unicorn Companies dataset found . here 🧱 Build Metrics Pipeline We will build 3 tables of Unicorn statistics using Pandas. Country-level metrics Country-level time series metrics Investor metrics 🏆 Winning Combination of Pandas Methods We will be using a combination of Pandas methods that breeze data manipulation and keep our pipeline looking clean. or pandas.DataFrame.explode pandas.Dataframe.melt pandas.DataFrame.groupby pandas.DataFrame.reset_index pandas.merge 🌍 Country Level Stats Firstly, we will calculate the count of Unicorns per country and the average Unicorn valuation (in billions) per country. country company mean_valuations_billion_usd max_valuations_billion_usd 0 Argentina 1 2.450000 2.45 1 Australia 6 8.433333 40.00 2 Austria 2 3.805000 4.11 3 Bahamas 1 32.000000 32.00 4 Belgium 3 2.983333 5.25 output Example Plot using Output Table 📈 country_stats The plot below is one example of a plot made with our table. We can quickly see the US leading the world in the total number of Unicorn companies. country_stats ⏳ Country Level Time Series For these metrics, we group by and columns, to count the number of Unicorns over time and sum the valuations. country date_joined 💡 Note I previously sorted the dataframe by date_joined. country founded_year company valuation_billion_usd 0 Argentina 2017-01-01 1 2.45 1 Australia 2004-01-01 1 1.60 2 Australia 2011-01-01 2 2.50 3 Australia 2012-01-01 1 40.00 4 Australia 2015-01-01 2 6.50 output ➕ Cumulative Time Series So far we have only generated time-series metrics at the point in time. However, it’s easier on the eye, to view the cumulative sum over time. These steps take the generated table and use an expanding window to calculate a cumulative sum. time_series country founded_year company valuation_billion_usd company_cumsum valuation_cumsum 0 Argentina 2017-01-01 1 2.45 1.0 2.45 1 Australia 2004-01-01 1 1.60 1.0 1.60 2 Australia 2011-01-01 2 2.50 3.0 4.10 3 Australia 2012-01-01 1 40.00 4.0 44.10 4 Australia 2015-01-01 2 6.50 6.0 50.60 output 📈 Example Plot Using time_series Output Table The time series plot below is made with our cumulative results, for the number of Unicorns per country. We can see since 2020–2021 the US has reached a trajectory for producing Unicorns that could not be matched by China. Whilst, the India and UK may be just beginning their growth stages. 🧑💼 Investor Stats Generating investor metrics is more complex. Each company’s investors are stored as a comma-separated string. For example, “Avant” have “RRE Ventures, Tiger Global, August Capital”. select_investors We want to reuse the same code format, as with the country-level metrics, to make use of the method. This will help us refactor later on. pandas.DataFrame.groupby company select_investor 215 Otto Bock HealthCare EQT Partners 390 Avant RRE Ventures, Tiger Global, August Capital 520 Promasidor Holdings IFC, Ajinomoto 609 Five Star Business Finance Sequoia Capital India, Tiger Global Management... 774 Radius Payment Solutions Inflexion Private Equity example output 🏗 Un-pivoting Investors Un-pivoting is key to this design pattern, as we want to make use of the method on individual investors. groupby Using , we generate an additional column for individual investors. Note that we now have multiple rows per company in our table. pandas.DataFrame.explode 💡 . Here I have use explode to un-pivot. Another method to check out is melt company select_investors_single 0 Bytedance Sequoia Capital China 1 Bytedance SIG Asia Investments 2 Bytedance Sina Weibo 3 Bytedance Softbank Group 4 SpaceX Founders Fund output The next step is to generate simple investor stats, company count, and valuation of the companies, in each investor’s portfolio. 📈 Example Plot using investors_stats The histogram below shows the distribution of investors by the total number of Unicorns in their portfolio. We see a power-law type distribution where most investors have only 1 unicorn, whilst few have invested in many. This type of distribution can also be found in populational monetary wealth and social networks. 🔧 Pipeline so far So far we have a metrics pipeline that looks fairly neat, but we are only generating a total of 8 metrics. If we were to extend this to 20–30 metics, our script would start to see a lot of repetition. The format of our code up to now is a simple python script. Thus our code cannot be isolated and unit-tested. Our only testing option is to run the entire script and assess the output, in an end-to-end style test. This is not great as it could take a long time to run. 🚧 Pipeline refactor 🚪Open-close Principle “Open for extension but closed for modification.” Lets refactor our code to follow the open-close principle as best possible. Move metrics functions to a Metrics class and enlist the use method. pandas.DataFrame.apply Remove the repeat calls to , with a function, using Python’s built in function. pandas.DataFrame.groupby generate_metrics getattr Create a metrics config file, which is passed to our function, with the meta-data required to generate our metrics. generate_metrics 💡Check out SOLID design principles . here 🧑🎓 Metrics Class By shifting our metrics to a class, we can isolate the metrics, and build unit tests for each metric. By using the Pandas.DataFrame.apply method, we can add personalized metrics, and leverage other python packages that are not included in Pandas. Following the open-close principle, if we wanted to add metrics we would create a new class that inherits our class Metrics. 📂 Config File The config file has a list of metrics for each table we want to generate. If we want to add or remove metrics or change naming etc, we simply change the config file. This way we are not editing our codebase itself. ⚙️ Generate Metrics Functions This function takes in our Unicorn data, an instance of our Metrics class, and our metrics config, and returns a metrics dataframe. Steps: Uses to create a object getattr pandas.DataFrame.groupby.apply Uses to create a Metrics class method object (e.g. ) getattr Metrics.count Calls the object passing the Metrics method object pandas.DataFrame.groupby.apply 🪄 Refactored Pipeline Finally, we arrive at our refactored pipeline. We can easily add metrics to existing tables by defining new metrics classes and adding them to our config file. ✅ 🔀 Benefits and Tradeoffs Benefits Code is modular and more manageable for scaling up our metrics. A config file gives us more flexibility to add and takeout metrics without touching co Easier to test the functionality of our code as it has been isolated. Tradeoffs More code to maintain for a small pipeline. The readability of our code has been reduced. Harder to debug for a small pipeline. 📚 Resources 🐍 Codebase found . here 📊 Kaggle Unicorn Companies dataset found . here Third-party libraries: pandas.DataFrame.explode pandas.DataFrame.groupby pandas.DataFrame.reset_index pandas.merge Python’s built in getattr Also Published here