paint-brush
How We Used Crunchbase Data to Predict Startup Successby@exitstrategy

How We Used Crunchbase Data to Predict Startup Success

by ExitStrategyAugust 7th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

We utilized Crunchbase's daily CSV export from June 2022 to create a labeled dataset for training a deep learning model to classify startup success. The focus was on companies established from 2000 onwards, across various categories. Ambiguous funding rounds were included if they occurred after Series B to ensure comprehensive data for model training.
featured image - How We Used Crunchbase Data to Predict Startup Success
ExitStrategy HackerNoon profile picture

Authors:

(1) Mark Potanin, a Corresponding ([email protected]);

(2) Andrey Chertok, ([email protected]);

(3) Konstantin Zorin, ([email protected]);

(4) Cyril Shtabtsovsky, ([email protected]).

Abstract and 1. Introduction

2 Related works

3 Dataset Overview, Preprocessing, and Features

3.1 Successful Companies Dataset and 3.2 Unsuccessful Companies Dataset

3.3 Features

4 Model Training, Evaluation, and Portfolio Simulation and 4.1 Backtest

4.2 Backtest settings

4.3 Results

4.4 Capital Growth

5 Other approaches

5.1 Investors ranking model

5.2 Founders ranking model and 5.3 Unicorn recommendation model

6 Conclusion

7 Further Research, References and Appendix

3 Dataset Overview, Preprocessing, and Features

We used daily Crunchbase database export (Daily CSV Export) as the primary data source, which is also supported by a well-documented API. The main goal of this research was to collect a labeled dataset for training a deep learning model to classify companies as either successful or unsuccessful.


The analysis was based on the Daily CSV Export from 2022-06-14, and only companies established on or after 2000-01-01 were taken into account. To refine the focus of the research, only companies within specific categories were included, such as Software, Internet Services, Hardware, Information Technology, Media and Entertainment, Commerce and Shopping, Mobile, Data and Analytics, Financial Services, Sales and Marketing, Apps, Advertising, Artificial Intelligence, Professional Services, Privacy and Security, Video, Content and Publishing, Design, Payments, Gaming, Messaging and Telecommunications, Music and Audio, Platforms, Education, and Lending and Investments.


This research is focused on investment rounds occurring after round B. However, in the Crunchbase data glossary, rounds such as series_unknown, private_equity, and undisclosed, possess unclear characteristics. To incorporate them into the company’s funding round history, we only included these ambiguous rounds if they occurred after round B.


This paper is available on arxiv under CC 4.0 license.