Data is at the centre of Blue Sky Analytics. After all, we ARE building a catalogue of environmental datasets.
While working at Blue Sky, sometimes things can get quite hectic. From an outsider's perspective, it may seem like a team member is juggling various tasks at any given time. Even my colleagues and I, who make up the team of data scientists at Blue Sky, are rarely tied down to a single kind of assignment.
Some days, I am working on building PoCs (Proof of Concepts) on the Jupyter notebook, while on other days, I am deploying these POCs into production. Some days are amazingly productive, where I end up devoting my entire time on the backend work of our geospatial refinery, while on other days, I am left scratching my head over non-working code resulting from a single comma use! But one thing is for sure - there is never a dull day at Blue Sky.
Being a data scientist at Blue Sky has enabled me to embrace flexibility, have an open mind, and solve complex problems by using simple solutions. At the end of the day, I have learned that if you can achieve the desired result with just a Python or SQL query, do it!
As Occam's Razor states, "The simplest explanation is usually the right one".
If solving complex problems is not enough of an incentive, working at Blue Sky allows me to solve climate change - a clear and present threat to humanity's collective well-being. I believe Earth observation & geospatial analysis will play a crucial role in the coming decade.
Geospatial data not only provides visual proof of what's happening around the globe but also links all kinds of physical, social, and economic indicators that help us understand what the past, present, and future would look like.
Now that you know what I do at Blue Sky let me share how we do this.
At Blue Sky, individuals from diverse backgrounds actively collaborate on various projects, sharing their sector expertise and varied experiences.
This level of diversity also demands a lot of coordination in the workflow of a project. To achieve this, we follow a 5 step workflow that runs through the entire project timeline: Scoping, Research & Development, Data Hunt, Coding & Deployment and Generating Insight.
On any given project, a data scientist might be involved in any one of these five steps. Below is a preview of what they look like.
We start with discussions and brainstorming sessions to broadly define the product and the possible datasets that can be a part of the product. These collaborations happen asynchronously on Notion or synchronously on a Tandem call.
Next, we run the possible datasets through several feasibility tests, including preliminary R&D. This process is focused on developing and testing the methodology to develop the product defined in the first step. It involves a thorough literature review and other forms of research to understand the existing body of work around the product we are developing.
This is an important step in the grand scheme of things as it helps us evaluate the viability of developing a product. We are able to answer some of the critical questions/decisions regarding feasibility, data availability, ground-truthing & validation. All this requires extensive documentation, for which we again turn to Notion.
After understanding the science behind the problem, we start hunting for different types of input data that we would need to develop an MVP(Minimum Viable Product). Being a geospatial data company, while satellite data occupies a large part of our input data, it is not limited to just that. We also work with other types of data such as IoT data, weather data etc. At times, researching and aggregating the right kind of input data at desired spatial & temporal resolution can be a tedious task.
While hunting for your data sources, one important thing to remember is that raw geospatial image files are notoriously large, making them hard to store and visualise (unless you like extremely slow loading dashboards). That's why we store them in Cloud Optimized GeoTIFFs (COGs). To learn more about COGs, you can watch this video.
The next step is building a workflow that automates the data processing pipeline. Generally, the workflow contains standard GIS functions that turn the data into a standardised format. For e.g. it could involve projecting, merging, clipping etc. Our entire data workflow runs on AWS, which means that a data scientist at Blue Sky also needs to be familiar with AWS.
In terms of our data stack, the complete data backend is written in Python because of its robustness to handle different kinds of datasets (geospatial or otherwise). Besides Python, we use Docker to test code in different environments and YAML to write configuration files for AWS.
We also use Github for collaboration, tracking all our source code, and providing tools for task management.
Post a complex process involving various tools, once everything is set up - which means the data has been processed and cleaned for exploratory data analysis and modelling - it's time to put that data to use!
This is where we perform data exploration, apply statistical and machine learning models to derive valuable insights. We experimented extensively with our methods before coming up with the best solution for each particular use case.
Through machine learning, we not only have the power to monitor ongoing events but also have the ability to anticipate future trends.
We also maintain the highest standards when it comes to maintaining the accuracy of the model. The model before deployment goes through a number of cross-validation & refinement stages, helping to ensure high accuracy and flagging any anomalies early on.
And finally, we reach the end of the tunnel with a product that can help change the world for the better. We currently have two data families - BreeZo and Zuri - that provide real-time insights on air quality and fires emissions data respectively.
At Blue Sky, the plan is to fight climate change one product at a time.