Saving the planet, one dataset at a time
Blue Sky Analytics is a big data and AI start-up that uses geospatial data to monitor environmental parameters. Our goal is to become the Bloomberg of Environmental data for environmental monitoring, ESG (environment, social, and governance) due-diligence and climate risk assessment.
In other words, we aim to be the go-to source for all environmental data to drive sustainable decision making. We believe that safeguarding the global market against climate risk and other climate-change-induced threats requires temporally and spatially continuous high-resolution datasets that inform us of pollution levels, water quality, emissions, fires, changes in soil composition, etc.
We asked ourselves, “what if there was a platform that presented environmental data the way we have access to financial data?” Sounds like an impossible dream, right? That’s why we need a technology stack that’s built to power our mission and a team too young to not know what’s impossible (our average age is 24.8).
…and, our air quality and farm fire crisis solutions have won us MIT Solve (Healthy Cities Solutions) and the ‘Space Oscar’ from European Space Agency, so there’s power in believing in and pursuing what others may deem impossible.
In this post, we’ll outline our technology — which handles historical as well as continuous data — and talk a bit about how we’ve architected our platform to give spatial-temporal context to pre-existing data, what we’ve learned along the way, and what’s next. In short, you’ll get a glimpse of how we are making sense of terabytes of raw data and turning them into environmental insights. Our goal? You’re inspired to follow your dreams and build your own “impossible” projects.
What we do (and why we do it)
We started this journey by creating a high-resolution database for monitoring air pollution in India, called BreeZo, and we aim to cover surface water monitoring in 2020, land composition and urban informatics in 2021, and other environmental parameters in the coming years.
Currently, despite the alarming air pollution levels in India, monitoring is inadequate. There are around 200 government monitors across India, which are unequally distributed, and, as a result, large parts of the country go unmonitored.
Our products aim to bridge this data gap by using satellite data and public monitors to provide a more comprehensive picture of the air pollution crisis.We make our data available to customers via accessible APIs and platforms; our approach centers on ensuring public access (democratising high-resolution environmental information) and facilitating policy-making and enterprise decisions (allowing corporations and government entities to understand current environmental situations as they craft initiatives).
But, what we can’t monitor, we can’t solve. Climate change is a global challenge that we must come together to fix; it poses a huge risk and will have grave effects on public health, ecology and the financial system. Thus, it’s vital for governments, industries, and scientists to collaborate and come up with innovative solutions — leveraging AI, big data, and space technology — to prevent and mitigate negative outcomes.
That’s where we come in: we create a platform that brings the much-needed high-quality data individuals need to make data-driven sustainable decision making, such as ways to transition from carbon-intensive economies to low-carbon approaches.
The Tech Stack: How we do it and why our time-series database matters to us
In order to create a large-scale environmental monitoring platform from geospatial data, we needed a database (specifically a time-series database (TSDB)) that could handle enormous quantities of spatial time-series data. We looked at various options, and landed on Timescale DB.
The most common way to solve our “problem” is to use NoSQL databases, which can be treated as streams. However, that would mean that we would need to write our logic to implement the spatial and temporal context into our data.
As we monitor environmental parameters using satellite data, the ability to handle and fill null values is especially critical. We had a lot of hacked together ways to query data with null filling and deal with data irregularities, using multiple cron jobs for each source and code at the application layer to do spatial queries. Essentially, we were spending a lot of time managing our time-series data and not enough time analysing it.
For example: we wrote one script for null-filling, which was very slow to begin with — and we had to write it in bash, separate from our main app, to ensure it was performant enough and didn’t stop the main thread.
So, NoSQL databases weren’t enough. But, what could we use? We looked at options like Amazon Timestream, but found that, while it works for IoT data, it didn’t work for our scenario: we handle not just IoT, but satellite data as well.
For us, TimescaleDB proved to be particularly unique among TSDBs, allowing us to power spatial queries on time-series data and having its roots in Postgres. Some important features for our adoption were continuous aggregation, gap-filling null values, time bucketing, and, most importantly SQL support.
Timescale has brought a change in our development paradigm, as it serializes everything in a timely order, and we can store, query, and analyze our temporal and spatial data using SQL (directly from the database).
We have a database table where we store all the pixels at 10 kilometer grid for all of India, and we run spatial queries for individual states (taking into account cloud cover, which act as NULL values, and averaging them over that time).
India is 3.3. million square kilometers, which comes out to be 33,000 rows of daily emission (with a 10 kilometer grid). Now, we can run a spatial join and average it over time.
3.3*1000000/10*10 = 33,000
Measurements (Raw satellite data), from multiple satellite sources downsampled to common denominator
-- Measurements table CREATE TABLE measurements ( u_wind NUMERIC(10, 4), v_wind NUMERIC(10, 4), albedo NUMERIC(10, 4), aod469 NUMERIC(10, 4), aod550 NUMERIC(10, 4), aod670 NUMERIC(10, 4), aod865 NUMERIC(10, 4), aod1240 NUMERIC(10, 4), aod_s5p NUMERIC(10, 4), blh NUMERIC(10, 4), temperature NUMERIC(10, 4), recorded_at DATE, grid GEOMETRY );
Maps of districts, states & region
-- Enum for shape types CREATE TYPE shapes_type AS ENUM ('Country', 'State', 'District', 'Region'); -- Create shape table CREATE TABLE shapes ( id UUID NO-T NULL, name VARCHAR(255), TYPE shapes_type, shape GEOMETRY );
So, how would someone actually apply or use this information? Let’s say we’re a District Magistrate who wants to set weekly KPIs for emissions in an area where farmers burn crops on a daily basis. To get a baseline to inform our decision, we’d want to know the weekly emissions average emissions in each district over the last 3 months, focusing on those in our state. This is a pretty complex (and powerful) geospatial query, but it’s just one example of the type of analysis we do on a daily basis.
Crunching 33,000,330 data points
SELECT name, -- Aggregates all the points in one district json_agg(json_build_object('datetime', datetime, 'u_wind', u_wind, 'v_wind', v_wind, 'albedo', albedo, 'aod469', aod469, 'aod550', aod550, 'aod670', aod670, 'aod865', aod865, 'aod1240', aod1240, 'aod_s5p', aod_s5p, 'temperature', temperature, 'blh', blh)) AS pollutants FROM ( -- Selects all the districts in Punjab SELECT name, shape FROM "shapes" WHERE TYPE = 'District' -- Selects all the `Districts` in `Punjab` state AND ST_WITHIN(shape, ( SELECT shape FROM "shapes" WHERE name = 'Punjab'))) AS Districts LEFT JOIN ( -- Select all the points in last week with daily average and within Punjab SELECT time_bucket_gapfill ('1 day', recorded_at, NOW() - interval '1 week', NOW()) AS datetime, grid, avg(u_wind) AS u_wind, avg(v_wind) AS v_wind, avg(albedo) AS albedo, avg(aod469) AS aod469, avg(aod550) AS aod550, avg(aod670) AS aod670, avg(aod865) AS aod865, avg(aod1240) AS aod1240, avg(aod_s5p) AS aod_s5p, avg(temperature) AS temperature, avg(blh) AS blh FROM "measurements" WHERE -- Get the points within 1 week recorded_at < NOW() AND recorded_at > NOW() - interval '1 week' -- Get the point only in Punjab AND ST_WITHIN(grid, ( SELECT shape FROM "shapes" WHERE name = 'Punjab')) GROUP BY grid, datetime) AS Records -- Join the points based on geometry ON ST_Within(grid, Districts.shape) -- finally group them together GROUP BY name
As mentioned previously, we collect an enormous amount of data from various sources, namely 1000+ ground monitors across India and satellite missions, like Sentinel 5P. To power our platform, we use TimescaleDB built-in functions to build chunks that allow us to store historical data in an easily accessible, scalable way. For example, we might have 12 values for March 1, 2014, but we only need one value for that day for historical analysis.
When we started:
Post the TimescaleDB transition:
Simply put, no more reading 100s of raster files or writing custom scripts for each analysis; we just dump into our database, and use SQL to analyze and find solutions. Compression of data and storage complexities are areas we no longer need to stress about.
Moreover, we don’t have to worry about storing data in a time-series fashion, or optimization on that time-series data. And, as mentioned earlier, we’re able to efficiently handle null filling, data aggregation, and time slicing, making our development process significantly less complicated.
All of this is to say that, if you’re building a data-intensive solution like we are or combining various types of data, TimescaleDB is a great place to start working with time-series data. It’s helped us expedite our mission of commanding the environmental data space, and the possibilities are endless: you can pace it according to your needs, as well as infinitely scale as you grow.