Ronen Korman is a former commander with the elite tech units of Israel’s army and Founder and Co-CEO at Metrolink.ai.
Some 95 percent of businesses say they struggle with processing complex data, such as unstructured records and multi-source flows, according to TechJury. At the same time, about 90 percent of all data out there is unstructured. This is in itself indicative of the data conundrum facing businesses today. Data is growing more complex by the day, and to keep up with it, companies are forced to build, operate, and constantly upgrade a sophisticated low-latency pipeline infrastructure.
Building even a simple Extract, Transform, Load (ETL) pipeline requires tireless hours of coding—several weeks on average. Furthermore, ETL requires efforts from multiple teams. First, data engineers must build data collection tools and ensure data validity. Software engineers must then develop the pipelines to pull the data into warehouses or lakes for analysts, the end-users, to make sure it is aligned with the business question. If it isn’t, as it happens way too often, the process goes back to square one, bleeding the company’s budget dry.
The resulting workflow is linear and rigid: developers are usually forced to wait before data engineers are done, and analysts, for their part, have to wait for them both. Analysts cannot generate insights without enough data, which in turn requires a pipeline up and running. So what ends up happening is too many teams have nothing to do as the expenses continue to tick up. Even worse, in more dynamic environments, data collected with so much effort and investment can simply become outdated before the set is even complete.
Furthermore, in my own experience, even the most business-minded tech enthusiasts are seldom excited about data. Developers would much rather focus on the product’s functionality than the boring data collection itself. Data engineers prefer handling data integrity, their core task, to the company’s informational plumbing. And as for data scientists, I imagine few of them get their fix spending hours loading and cleaning the data.
The flaws of this approach stretch beyond the realms of wasted time and high costs. Today’s data-management philosophy is code-centric, meaning everyone involved, including the analysts, must be a coder. Data and software engineers write code to start capturing and preprocessing data. Then, analysts and data scientists must write code to fetch it and make the extra transformations needed for analysis or training models. Even in the rare case where all of them are outstanding coders, the result is still sluggish due to the use of Python and Java, interpreted languages that are generally slower than compiled ones.
Another implication is that the legacy companies trying to tap data, the data immigrants, face an increasingly uphill struggle. They often lack the tech aptitude of their new-age counterparts, and the colossal price tag and time associated with developing data infrastructure will only weigh them down more. By the same account, companies who want to tap new data sources have to go through the full process as well, which slows down their progress.
Tougher still, nobody is really able to innovate with data simply because even the most basic analysis requires substantial investment. Innovation takes time and requires all hands on deck, but with today’s DataOps workflow, both are a luxury. Day-to-day tasks simply take all of that away, leaving data teams no room for experimentation and more creative ideas.
The problem comes down to the reliance on the ETL mantra. Building a rigid data infrastructure made up of pipelines tailored to very specific business questions with limited transformation abilities only perpetuates the conundrum at hand. It is heavy on resources and hampers productivity and innovation while failing to drive real progress.
To truly unlock their potential, companies looking to tap data must embrace a whole new design and management paradigm. The simplistic ETL just doesn’t cut it. Businesses need data playgrounds, solutions that will make the necessary processed data instantly accessible to those in the enterprise who need it.
At its core, a data playground is a strategy that moves the focus from code to the data itself, making it easily accessible to relevant teams with little need for further development. It is rooted in the idea of collaborative design that fosters cooperation between teams instead of locking them into a rigid linear sequence of tasks. While all teams retain their overall functions, the process breaks up the dependencies between them, allowing all of them to focus on their own tasks.
At the core of the strategy is the playground itself, a software platform powering all of the company’s data operations. Software engineers will use this platform to develop Data Apps, which would feature both extraction and transformation capabilities. Data engineers would then use these Apps to produce validated data sources, ready to fuel the analysis by data scientists, who will be able to make extra transformations using the App’s built-in functions, if needed. The approach leaves the code-heavy tasks on the backend to the engineers, while the analysts are free to use the low-code to no-code front-end to build up new pipelines and transform data on the go.
The playground brings more than the basic ETL to the table since it also allows the analysts to configure, redesign, and create a data flow on the go, within hours, not weeks. By collecting the backlog of Data Apps and data sources, the company can conduct integrated analysis to streamline the data operations and workflows, enriching the playground in an offline manner.
Businesses looking to scale up with data need to understand success is not just about collecting as much data as possible. It is just as much, if not even more, about how you manage what you have collected, and whether you can make changes and amendments on the go without having to rebuild the entire pipeline and waiting for more data. Granted, it’s a careful balancing act. But to be flexible, innovative, and ultimately effective, businesses must manage their DataOps in a way that plays to the employees’ strengths and makes data easy for them to access, not gathering it for months and years on end with no use.