Geospatial has always been the biggest of big data. This data deluge is not just from sensors either. Each day, personal location data volume is greater than all data created by social media, a fact that surprises many people. If you believe even a fraction of the hype around smart sensors, mobile, and autonomous platforms then in the near future location-aware data volumes will rapidly grow beyond the exabytes companies already collect.
So where are all the big data platforms for analyzing geospatial, sensor, and location data?
They largely do not exist.
It turns out that analyzing the physical world requires different computer science than analyzing the virtual world. And Hadoop, Spark, and myriad other big data platforms were purpose-built for the requirements of the virtual world. Still, billions of dollars are being invested in markets where the vision is predicated on the ability to do spatiotemporal analytics at massive scales. Few people have noticed the dearth of capable platforms.
The Unfortunate Web Bias Of Big Data
Big data platforms were originally developed to make sense of the web. The web at its core is a collection of documents containing links to other documents. For the purposes of most web analytics, a number is assigned to each document, and also to each link or word or image contained within a document, uniquely identifying each of these entities so that they can be counted and relationships sorted. Some of the largest companies in the world were built around their ability to collect these numbers and analyze the relationships among them.
Geospatial and sensor data models, on the other hand, are not reducible to numbers in this way. Their essential data relationships are built around shapes. The name “Hurricane Sandy” can be reduced to a number but the representation of its physical manifestation is a complex shape that changes and moves in space and time. To analyze this data, we must be able to quickly and directly analyze complex relationships between moving, changing shapes. Even when using data that does not have shape per se, like the coordinates of my location, analysis often involves constructing shapes from sets of those coordinates, such as my path through a city.
And therein lies the problem. Platforms created to analyze the Internet only work well for data models where the basic elements can be represented by unique numbers. In fact, at the time platforms like Hadoop were invented, the computer science did not exist to create a platform for analyzing shape relationships at massive scales. When the computer science was developed a few years later, it was shown that the ability to do this type of shape analysis could not be retrofitted onto platforms that were not purpose-built for it. Inertia is a powerful thing and the big data ecosystem has not adapted.
Bringing Big Data Into The Sensor Age
Addressing the underlying computer science problem, which was solved in 2007, is necessary but not sufficient. I’ve been designing infrastructure for geospatial analytics for a long time, starting with real-time sensor layers for Google Earth a decade ago. There are two additional requirements, the lack of which has killed as many geospatial and sensor-driven applications as the absence of a suitable platform architecture.
First, you need a database engine underneath the platform that can continuously index and store high-velocity sensor data at wire speed while being queried in real-time. This performance is available in commercial disk-based systems using inexpensive hardware but is rare in open source, and in-memory platforms are too small for the data volume. Open source could do much better here and this adversely impacts the economics.
Second, you need a geometry engine — the software that computes the mathematical relationships between shapes — that has correctness, precision, and performance suitable for large-scale geospatial analytics rather than the looser requirements of making maps. As my own experience unfortunately shows, ignoring this leads to an astounding range of insidious analysis bugs. This is not a trivial challenge and existing implementations in GIS databases are demonstrably deficient. We built an analytics-grade geometry engine at SpaceCurve, possibly the only one that exists, but it took a team of experts to research and develop it.
The good news is that it is possible to create a big data platform that is purpose-built for analyzing the physical world, as products like SpaceCurve ably demonstrate. However, I cannot point to a single open source big data platform, mature or otherwise, that adequately addresses even one of the three issues raised. Yet many new companies are being created under the assumption that scalable spatiotemporal analytics is a hurdle that can be cleared with little effort using tools they already know. In fact, there is an enormous gulf between what is required and the capabilities actually available in popular big data platforms that will rapidly become apparent.