Data Sets Are The New Server Rooms

When I built my first company starting in 1999 it cost $2.5 million in infrastructure just to get started and another $2.5 million in team costs to code, launch, manage, market & sell our software. So it’s unsurprising that typical “A rounds” of venture capital were $5–10 million. We had to buy Oracle database licenses, UNIX servers, a Sun Solaris operating system, web servers, load balancers, EMC storage, disk mirrors for redundancy and had to commit to a year-long hosting agreement at places such as Exodus.

- Mark Suster, in his post “Understanding Changes in the Software & Venture Capital Industries”

Over the course of the last 12 years or so, we’ve seen an evolution from large traditional VC firms investing $5-10M per company in the first round of financing to the emergence of “micro” VC firms investing in rounds $1M-$3M dubbed “seed rounds”. This evolution has also spawned even smaller firms investing in rounds of several hundreds of thousands of dollars as well in a stage referred to as “pre-seed”.

As Mark Suster wrote in his post linked above, the emergence of open source software and cloud computing completely eviscerated the costs and barriers to starting a company, leading to deflationary economics where one or two people could start their company without the large upfront costs that were historically the hallmark of the VC industry.

These lower barriers to entry has led to a “cambrian explosion” of startups but hasn’t necessarily changed the rules of business. Without a defensible moat, it’s just about impossible to create a large company with sustainable profits.

The reason to take VC money isn’t to start a business but rather to deploy capital today at a loss in order to generate significant profits in the future, hence the power law. This delineation has become blurred, especially because it seems lost on some that in the past, upfront costs in some cases created a barrier to entry since capital wasn’t nearly as abundant as it is today.

However, the same technology platforms (mainly cloud computing) that created the lower barrier to starting up previously, may also be providing an opportunity to build a moat today. In this scenario, startups can raise large amounts of money early on, not for servers and databases but rather to collect the necessary data to improve their algorithms in order to create a defensibility over the long term.

Consider the following company financings:

In June 2014, Affirm raised a $45M Series A round, the first round of financing according to Crunchbase
After raising a $2.1M seed round in May 2014, x.ai raised a $9.2M Series A round in January 2015 (a total of $11.3M in 8 months time)
In April 2015, Clarifai raised $10M in the first round of financing according to Crunchbase
After raising a $1.5M seed round in February 2015, Textio raised an $8M series A round in December 2015 (a total of $9.5M in 10 months time)
Just last week, Hangar Technology announced a $6.5M seed round

What do all of these companies have in common? If your first thought is that they all harness either machine learning, artificial intelligence or computer vision to their advantage, you’d be right. However, at its core the benefit to machine learning is the positive feedback loop it provides which is commonly referred to as data network effects.

It is likely that over the long term algorithms will become a commodity. Since the real value then is in the proprietary data set collected, a startup is at a disadvantage on day 1. This is where first mover advantage actually matters.

As a startup collects the data necessary to feed their ML algorithms, the value the product/service provides improves, allowing them to access more customers/users that provide more data and so on and so forth.

This extra capital can be deployed in numerous ways depending on the data set the startup is looking to collect and the company’s strategy and target market.

I met with a company a few weeks ago that was raising money in order to create a physical studio that would allow them to set up cameras and sensors to collect visual data on the human body. The data they collect will allow them to build a computer vision-enabled body scanner that can understand the different features and nuances of the human body.

For Affirm, as Max Levchin points out on this podcast, the company will lose money in the beginning (the base rate for the credit card industry can be as 50% of all the capital they lend early on). However, over time as Max states, the only way to build a defensible business in the category is to learn from your own data and to collect enough of that data that your underwriting continues to improve over time to a point where the business will become more and more profitable as time goes on. So, by raising a large amount of money early on, Affirm can acquire customers to ramp up transaction volume and absorb the impact of these early losses in order to collect a data set proprietary to them.

I’m not privy to the strategy all of these companies have/will rely on to collect unique data sets but it wouldn’t be too much of a stretch to make assumptions about some possibilities. For example, x.ai launched a free beta product to users and rolled it out slowly. I’d assume that this was so that their “AI trainers” could review the product’s interactions and label the training data to improve the product for future users. By rolling the product out to more users and continuing to learn from the labelled training data, the algorithms could continue to improve until the product could get to a point where it could be deployed openly to the public and one that users might pay for.

Another point worth making is that collecting real world data is increasingly a hardware problem which could lead to even larger rounds (examples include autonomous vehicles, robotics, drones, etc).

If the datasets collected early on at a loss can provide large profits in the future, then it makes sense why many of these companies either raise a large amount of capital at the seed stage or go out to raise a larger round of financing within a year of the first round of funding. I think we’ll see more rounds of this nature over the course of the next few years as we generate more data and the use cases for harnessing that data becomes more apparent.

So while startups leveraging data to build a superior product may need to raise more money early on to get the company off the ground, it may also lead to superior returns without raising massive amounts of capital across stages named for the back end of the alphabet. If this is the case, then the throwback to the early days of VC would be a welcome one.

Thanks to Mike Dempsey for reading and providing feedback on this post.

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising &sponsorship opportunities.

To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!