paint-brush
Unicorns vs Failures: Constructing Comprehensive Datasets for Predictive Modelingby@exitstrategy

Unicorns vs Failures: Constructing Comprehensive Datasets for Predictive Modeling

by ExitStrategyAugust 7th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Successful companies were defined as those achieving IPO, acquisition, or unicorn status, with specific valuation and funding thresholds applied. An extensive dataset was compiled, including timelines of key investment rounds. Unsuccessful companies were identified by filtering out those with successful outcomes and applying additional criteria, resulting in a dataset of 32,760 unsuccessful and 1,989 successful companies for model training.
featured image - Unicorns vs Failures: Constructing Comprehensive Datasets for Predictive Modeling
ExitStrategy HackerNoon profile picture

Authors:

(1) Mark Potanin, a Corresponding ([email protected]);

(2) Andrey Chertok, ([email protected]);

(3) Konstantin Zorin, ([email protected]);

(4) Cyril Shtabtsovsky, ([email protected]).

Abstract and 1. Introduction

2 Related works

3 Dataset Overview, Preprocessing, and Features

3.1 Successful Companies Dataset and 3.2 Unsuccessful Companies Dataset

3.3 Features

4 Model Training, Evaluation, and Portfolio Simulation and 4.1 Backtest

4.2 Backtest settings

4.3 Results

4.4 Capital Growth

5 Other approaches

5.1 Investors ranking model

5.2 Founders ranking model and 5.3 Unicorn recommendation model

6 Conclusion

7 Further Research, References and Appendix

3.1 Successful Companies Dataset

In this research, a company is deemed successful if it achieves one of three outcomes: Initial Public Offering (IPO), Acquisition (ACQ), or Unicorn status (UNIC), the latter being defined as a valuation exceeding $1 billion. To assemble a list of successful companies, we initially filtered for IPOs with valuations above $500M or funds raised over $100M, yielding 363 companies. For acquisitions, we applied filters to eliminate companies with a purchase price below the maximum amount of funds raised or under $100M, resulting in 833 companies. To select unicorns, we searched for companies with a valuation above $1 billion, utilizing both Crunchbase data and an additional table of verified unicorns, which led to a total of 1074 unicorns.


The final dataset contains a timeline of all crucial investment rounds leading to the success event (i.e., achieving unicorn status, IPO, or ACQ), with the index of this event specified in the success_round column. This approach ensures that the dataset accurately represents the history and progress of each successful company, facilitating effective analysis.

3.2 Unsuccessful Companies Dataset

To supply the model with examples of ’unsuccessful’ companies, we collected a separate dataset. We excluded companies already present in the successful companies dataset by removing those that had IPO, ACQ, or UNIC flags. We also eliminated a considerable number of actual unicorns from the CrunchBase website [16] to avoid overlap. We excluded companies that have not attracted any rounds since 2016. Additionally, we excluded companies that are subsidiaries or parent companies of other entities. Furthermore, we used the jobs dataset to exclude companies that have hired employees since 2017.


Additionally, we applied extra filters to exclude companies with valuation above $100 million, as they reside in the "gray area" of companies that may not be clearly categorized as successful or unsuccessful. By applying these filters, we constructed a dataset comprising 32,760 companies, denoted by the label ’0’ for unsuccessful, and 1,989 companies, denoted by the label ’1’ for successful.


This paper is available on arxiv under CC 4.0 license.