Srinath Perera

@srinathperera

History of Big Data: A Technical Comedy

The Italian Comedy(Antoine Watteau (1684–1721))

Prologue

This is a Big data story told in a different style. Why? When we tell stories, we only connect things that are regular, familiar, and easier on our minds. If facts do not match, we overlook.

What happens when we look at all possibilities and ask how decisions connect to each other? A great example of this is this excerpt from Tolestory’s War and Peace about Napoleon. The essay raises many questions. It is easy to ignore. If you did not, if you accept the cognitive dissonance, and if you think, you will understand Napoleon much better.

Following is Big data looked through the same lens. I am an not opponent of Big data; rather I am a believer. Hope this will help you understand Big data deeper just like it did help me.

Act 1: Google doesn’t like Databases

All analytics were done with databases. All data were stored in databases. It was a well-known anti-pattern to store data worthy of a database in a flat file.

Google did not like databases. Maybe they tried it, and it did not work. Maybe it did work, but the company that can make it work ask for too much money. Maybe they knew it does not work. End of the day, they used files. Being Google, they build a big ( distributed) file system. Of course, they wrote a new shiny distributed file system.

Google wanted to query their big files. Many geniuses got together, and for once, came up with the most simple solution. Where the solution came from is not clear.

Maybe they took the map and reduce operations from functional programming, throw away other operators. Then mash these two and create a new operator. Maybe they took MPI, threw away all 100+ dominant operators and kept two. Maybe one of genius dreamt it up. We do not know.

They created the most straightforward solution to a complicated problem. This is unheard of and not worthy of such geniuses. They did another thing no one has done.

They told everyone about it. While making much money, Google told everyone about part of the secret sauce. Why? We do not know. Maybe they were thinking about increasing human knowledge. Maybe they wanted others to think in the way they did so they can easily hire others. Maybe, they wanted the world to know that they do serious stuff. Maybe they were so far ahead, and sure others can’t catch up. Maybe they did know MapReduce does not work in the long run so want to send their competition in a wild goose chase. In this world of bluff and double-bluff, Who knows.

Act 2: New Dreams

However, all hell broke loose. People felt like when Prometheus brought fire from heaven. Well yes, we do not know how it felt then, but I am sure it was something like this. Few people got together and implemented it in opensource. Yahoo, the competitor to Google, help fund some of it, Hadoop is born.

People did not have use cases like Google. They did not have data like Google. Most did not have data to fill even a MySQL database. Yet everybody loved MapReduce. They dreamed up about a lot of data and created Big Data.
They dreamed how one can collect data about the world, make sense of it., and change the world. Then they counted words with it. Some dig in and found some data that few gigabytes big but others couldn’t even do that. So they dream of when they have a lot of data. Others figured scientists were handling big data for a long time calling it scientific computing. Everyone marveled at what scientists were doing all along. Now it is much easier to get research grants, so scientists did not mind either. Now we really have big data ( which we had all along).

Act 3: I hate you .. sorry I love you SQL

Since Google had beef with databases, someone figured the problem is SQL. They created a new kind of storage and called it NoSQL. Soon they figured they need a way to query there storage. Whatever they did, queries look like SQL. So they change the name to Not Only SQL.

Mike Stonebraker, ten years before receiving his Turing award, spoke out. He told in his humble and spear-like prose that “Guys all you do is counting and grouping. SQL can do all this and more. Just make SQL work with your glorified big files”. Of course, nobody listened.

Academics and Investors went crazy and threw their brains and money into Hadoop.

Act 4: Continuing the legacy

Soon comes Spark, which beat Hadoop 10–20X performance. Bye, bye Hadoop. Wait wait, what happens to all investor money to build Hadoop companies. They got together and huddled Spark, so tightly. Now it is hard to tell where Hadoop ends and Sparks starts. They explained both are MapReduce technologies ( although MapReduce is less than 1/10 of Spark does) and Spark is the future. Everyone is happy.

Meanwhile, Google dropped Hadoop but did not bother to tell us. To be fair, they talked about all technologies they build instead but did not help put the 2+2 together. It is not like many were paying attention.

Thanks to Spark, Machine Learning (ML) takes off big. Soon data science is born. Years of old ML research comes back, new things were found, improvements made, new techniques discovered or rediscovered. Soon it turns out Spark does not work that well with deep learning. Google and others had to create new techniques. This did not matter. Most data is small. So we can do the data science with R and Python in a single machine and be mysterious about how we can run it in scale. GAFA(Big four tech companies) kept running machine learning in large scale and told us about it. That is enough to keep mystery and aura going. Also, GAFA hiring everyone who can do machine learning also helped.

Act 5: Show me the money

All this is said and done; money is in the enterprise. They already had data warehouses and BI. Big data goes there and fold it in. Well, I did not say replace. Sometime BI and data warehouses were just folded in and counted as analytics. Sometimes, upgrading the current product took you from old technology to new technology. Sometimes old technology is replaced.
Meanwhile, SQL and NoSQL databases are merging. NoSQL databases are supporting full SQL or coming close. SQL databases are supporting NoSQL features. Someone should have listened to Mike, but he has grown tired of saying “I told you so.” He does not say anything.

Act 6: Big data has it all

Now big data/ analytics/ AI has it all. Huge markets, use cases, customers, investors. All dreams have came through.

All is not well. It is tough to find people who can build these systems. It is even harder to find architects who can think it through and make it usable. Almost no one thinks about usability yet, and we are just waking up to problems like bias. However, some systems are up, somebody must be making use of it, and somebody is getting some benefits. Who knows?

Big companies that were renamed as Big data companies are not growing. All that promised growth must have gone to Blockchain. Open source companies are growing 50%, but they are too small. In current rate, they might catch up in about 10–20 years.

Act 7: Onward Ho

Everyone is busy. There are AI, and singularity is coming, and robots are coming. What will happen when they take over our jobs? Bots are in raging too. Most bots break within two sentences. Among the rest, most are not usable, and it is much easier to use a UI to get the same thing done. Some turn into a shame within hours. However, this is details, who has time to read the details. Who cares whether big data works or not?

There might still be time to make it work. Maybe hard work is done in already. Programmers have been trained. The new generation is being thought. More and more is asking for analytics. We can get the end to end stories right, get usability right, and get the tools right. It can work.

Keep up the good work; I will check back when I am bored with blockchain.

Hope this was useful. If you enjoyed this post you might also like my other posts Mastering the 4 Balancing Acts in Microservices Architecture and Introduction to Anomaly Detection.

More by Srinath Perera

Topics of interest

More Related Stories