On Big Data

“Mister Wayne, if you don’t want to tell me exactly what you are doing, when i’m asked, I don’t have to lie. But don’t think of me as an idiot.” — Lucius Fox to Batman in ‘Batman Begins’

Batman is one of the most relatable superheroes for me. For what does he not posses that mortals can’t aspire to? It is possible to earn vast sums of money which will enable you to develop or buy the best in technology. It is possible to gain skills in psychology, martial arts and have a sharp intellect. Despite his general lack of ‘superpowers’, it somehow feels like he can defeat any hero/villain out there (yes we can have this debate sometime later). And yet Batman’s most prized asset wouldn’t be the bat-mobile, the bat-cave or even Alfred. I believe it is something more than just fancy, over the top symbolic gadgetry. It is rather his access to information, vast quantities of it that makes Batman so powerful. We will call each bit of this information data.

FYI, a single bit of information is called ‘datum’, although no one uses that term anymore. Everything around you can be more or less expressed as a bit or a collection of data. Everything creates data or is data itself. We are all information and information processors at the same time. You reading this are processing the information or data that I presented to you which was condensed from reading (books are condensed data sources created by other data processors i.e. people) and past life experiences. You sharing this post is creating a copy of this data. All this forms a network of data and data processors, computing devices only happen to be a part of this complex network. It is true that computers have sped up the dissemination and processing of information exponentially, however they are only a part. The universe is nothing more than matter, energy and information. Everything is measurable, almost everything can be datafied.

Datafication is the conversion of information into a useful format such as numbers, visuals, etc. It isn’t like we weren’t able to do this earlier, before the advent of computers. Like I said, computers sped this process up exponentially. Speed is a significant improvement for the quick shall inherit the earth. What computers also enabled is the storage of vast amounts of information in a tiny amount of space. How tiny? It is theoretically possible to fit 10.5 Terabytes of data on a disk that is a centimeter in diameter and about a millimeter thick, enough to hold 500 copies of the Encyclopedia Britannica and a million photos more than ten times over (link). We can also store information in the structure of the DNA. More on this later.

The universe is nothing more than matter, energy and information. Everything is measurable, almost everything can be datafied.

Data is something that allows itself to be recorded in some form, something that can be analyzed and something that can be ordered or reorganised. Data is most often transient in nature. It can be moved and most times recombined to create more data. Vast quantities of data about you and me are collected and stored each day through our interactions over social media. Each person is creating an autobiography with each online interaction. We will have documented the lives of a large part of the generation Z through the rampant baby pictures, posts and videos posted by this groups’ parents and soon these guys will too have a presence they would call their own. We would be able to see the trajectory of their lives documented by the humans of this generation and their social circles like never before. By telling the world so much about its existence, they are indeed making a point. Everything matters. And now that we have access to everything, why not use it all?

All

Those who have even taken an introduction to statistics know the concept of a population and a sample. A sample is a subset of a population. When someone refers to a sample, it most probably is a random sample. Simply put, if a drug company wants to release a new kind of a drug, they test it on a random group of people. If the drug is effective for this random sample set, it is assumed that it will be effective for the entire population of the target group. Random sampling reduces a Big Data problem to a small data problem. This has been an acceptable method since it is unviable to test the drug or product on the entire population. A digital medium completely changes this paradigm. We suddenly have access to all data points available. We no longer need a sample. Our sample can be the entire population. All trumps some in the world of Big Data.

Perfection is a myth, messiness is real

For far too long we have been obsessed with the idea of accuracy of data. This demand for meticulousness was fine when the datasets we were dealing with were small. However, when it comes to Big Datasets, it is almost impossible and extremely time consuming to get accurate data all the time for these datasets are often generated by humans typing, clicking or sensors sensing. Billions of data points are generated every day and it would be foolish to assume that all this data is accurate to the ‘T’. When dealing with these large datasets, we need to let go of old notions of accuracy and embrace the inherent messiness of the data and the world at large. We need to forgive and be comfortable with the inaccuracies of the data.

Inaccurate however, does not mean incorrect. Think of it as the data for the population of India. It would be almost impossible or far expensive in the current scenario to know the exact population at any given time. Despite this, we are comfortable with the knowledge and the slight inaccuracy with the information available to us, for there is not much difference between saying that the population of India is 1,210,569,573 or 1.2 Billion. We are fine with this estimate since with the increase in scale, knowing the exact numbers becomes far less important. Big Data is in some sense the representation of the world we live in. This world is intricate, complex, imperfect, imprecise and extremely messy. Accuracy is expensive and time consuming. Big problems need to be dealt with their inherent messiness in mind. We need to embrace the chaos.

This world is intricate, complex, imperfect, imprecise and extremely messy. Accuracy is expensive and time consuming. Big problems need to be dealt with their inherent messiness in mind. We need to embrace the chaos.

Link, don’t copy

Metadata is data about data. When you take a photograph of a landscape, then the picture you click is the data. Metadata is the additional information about the picture like the date, time, geo-location, people in the picture, etc. A large portion of this data is traditionally stored in a tabular structure often referred to as relational databases. Think of an excel table with each column representing a metadata type. This is a structured system of rows and columns. With the constant improvement in the formats in which data is available, we have been adding additional metadata. Each metadata added, adds a column to this database. However, the old data often does not contain this additional metadata and the cell for this is left empty or what is known as a null value. This null cell also consumes some minuscule space on the database and is largely redundant in nature. Now imagine this null cell over billions of data points, suddenly we aren’t being too efficient. Relational databases are used in a major portion of the digital systems we use and almost entirely in online monetary transactions. A major drawback with such a data storage format is the redundancy that it brings in. Suppose you construct a building on a plot of land which is of a mixed use in nature, i.e. It is primarily residential but there are shops and offices on the ground and first floor. Do you create a separate category for such a building or do you make multiple entries for the same building in the database? Copying the same data multiple times is not only a waste of computing power but also requires additional storage space. Think billions of data points. Also how many categories can you create and what if you decide to convert a portion of the building into a community space? Do you create another category? Imagine a single building having hundreds of use cases in the near future.

The underlying value of Big Datasets lies in correlation, overlapping and the subsequent pattern generation. Hierarchical database systems like SQL work well and much more efficiently when the number of data points are small. However, in the context of Big Data where there is a constant inflow of millions of data points every second and in many instances these data points include additions to the underlying metadata, these hierarchical systems start becoming highly redundant in nature.

Although it is still possible to use these systems for Big Data, there is a simpler method for storing data points based on the idea of correlation rather than categorization. The tabular styled databases that we saw earlier are known as Relational databases, while this newer system is simply the opposite and we call it Non-relational databases. It is also sometimes referred to as NoSQL. Each data point acts as an independent unit. A common format is JavaScript object notation or JSON, pronounced as the name ‘Jason’. It is a system based on tagging data points to each other rather than categorising them in a hierarchy. Tagging enables a creation of a node and a link, each node can have multiple links. The advantage of each data point being an independent unit is that future data points can contain an infinite amount of metadata and this does not affect the performance of the system in any way. Theoretically, this system is infinitely scalable. A building can now have any number of use cases and we can incorporate those without it affecting any other building’s data, without having to categorise, simply by tagging or linking.

Traditional hierarchical systems vs Tagging

Twitter is a great example of using links and tags. When you tweet using certain hashtags or tag other users on a tweet for example “happy new year @kanikakaul22 #2017 #newyear”, you are creating or adding to multiple nodes/categories like ‘2017’ and ‘newyear’. The above tweet comes under the category of ‘2017’ as well as ‘newyear’. The tweet now has multiple identities depending on where you want to see/search it from. Linking to the user (@kanikakaul22) is akin to a special type of a category; it is a representation of the relationship between you and the person you sent the tweet to on a public platform. Again, tagging and linking so many data points can get messy and this system does not appear to be as clean or organized as Relational tabular systems. This seemingly chaotic system I believe is a representation of the natural chaos of the world. Things, people don’t fit in boxes. A person can be a programmer and an artist and a chef and a poet, all at the same time.

P.SThis post is the fifth part in a series. Here are links the previous posts:

References:Big Data by Kenneth Cukier and Viktor Mayer-Schönberger