The world of data has changed dramatically over the past decade. Traditional databases, which were designed to store information in a structured format, have evolved into massive warehouses of unstructured data that sit on multiple servers across different locations. Not too long ago, we were used to seeing monolithic systems dominated by behemoths, the likes of Oracle and IBM. If you are an analyst or business user who needs access to this type of data—and who doesn't?—it meant slow-moving systems that were incredibly difficult to manage.
The increasing complexity of systems eventually drove the need for modern software stacks that could help organizations run complex applications while managing to stay cost-effective. The open source movement helped with this, by dramatically decreasing the cost of putting complex applications together such as Elastic Search for full-text search and PyTorch for modeling. Robust packaging and operations of the software improved usability, stability, and economics of the system.
The Modern Data Stack (MDS), which has seen a lot of traction over the past decade, builds on the open source movement and is a collection of ideas, tools and methodologies intended to build the enterprise data stack.
In the 2010s, we saw rapid adoption of open source tools within the MDS. However, post their initial success a lot of organizations’ initiatives around these ran into challenges when it came to scaling them:
Points #1 and #2 have played a major role in adding to the stress levels in the industry, and also limiting the talent available for adopting and using technologies. We’ve seen a similar trend in the DevOps space, with the supply of developer talent not meeting the demand for new digital services. Tyler Jewell of Dell Capital has been quite vocal about this problem - which has been leading to high burnout, and the average career span of a professional developer being less than 20 years. He recently posted a thread where he did a deep-dive into the complexity in the developer-led landscape, and we can’t help but notice several parallels between what he claims and the MLOps space.
Points #3 and #4 highlight the plight of today's data folks–if solving problems wasn’t enough, they end up spending more time trying to figure out “how” they can proceed and solve problems without being able to give much thought to what needs to be done, or the expected outcome.
We’re seeing a shift in the data tools used by organizations, driven by an increased recognition that many of them have no choice but to rely on third-party vendors for their infrastructure needs. This is not only due to budget constraints but other constraints as well, such as data security and provenance.
In addition, there’s an increased demand for automated processes that allow enterprises to easily migrate workloads from one provider to another without disrupting operations or causing downtime. We’re seeing the effects of these within industries like financial services where data management is often critical for success (for example, credit rating agencies).
As a result of all these as well as the challenges listed above, there have been several developments in the community:
Consolidation of tools and platforms, simpler platform developments, and the use of managed services is happening across the industry. This is stemming from the need for businesses to cope with complexity. It’s an exciting time to be a part of this space, and I can’t wait to see how the landscape evolves over the course of the year.
At Scribble Data, (the company that I have co-founded) we are keenly aware of this evolution as it is happening. We focus on one specific problem - feature engineering for advanced analytics and data science use cases. This problem space has steadily grown in terms of importance and has evolved in ways that are consistent with the above points. With the right technology mix and solution focus, it is possible to align product value to use cases, while achieving 5x faster time to value (TTV) for each use case.
Also Published Here