2022 saw the data space grow by leaps and bounds. There was consolidation around analytic data warehouses like Snowflake and Redshift, SQL tools like dbt, and dashboarding tools like Looker and Mode.
More generally, there was an enhanced focus on governance, transparency, and data quality.
What’s in store for this year? Here are the top 9 things our team of data experts expects to see in 2023.
Data Reliability Engineering (DRE) helps align processes, tooling, and people to keep data – like dashboards and ML models — reliable. It’s a term inspired by Google’s Site Reliability Engineering (SRE).
Typically, DRE is done by data engineers, data scientists, and analytics engineers. Historically, these roles don’t have the standard tools and processes in place that DevOps and engineering teams rely on.
That’s why data reliability work involves spot-checking, late-night backfills, and hand-rolling some SQL-into-Grafana monitoring.
Teams invested in DRE are working to create scalable, repeatable processes like incident management and quality monitoring. They do so by borrowing from SRE and DevOps.
Typically, data teams have run similarly to IT organizations. A request or question comes in, and an answer comes out. That means data teams are largely reactive. But that state of affairs brings a sense of overwhelm, and one-off requests clog up the day-to-day.
Innovative data teams will move toward a different model this year. Instead of servicing ad-hoc requests, they’ll proactively build toward making better business decisions through data products.
With these changes in place, data teams will have a clearer product vision, a deeper customer understanding, higher revenue, and better business outcomes.
They will staff up a multidisciplinary team with engineers, analysts, designers, and technical writers, marketing the “features” they build just like product teams would.
What do we mean by “data product”? That could be any information people use to make decisions. That could include any data that flows between people, systems, and processes, any analysis that teams produce, and any analytics tools.
Data governance has not historically been a priority for many teams. Data governance covers things like the discovery of data assets, data history and lineage, and general context around data and table status.
As data models grow more complex, data governance can no longer be ignored. Companies like LinkedIn, Lyft, Airbnb, Spotify, Netflix, and Uber have created their own internal in-house solutions.
And it’s likely that more will follow, reiterating on open-source models. We predict stricter, more thorough data governance than ever before this year.
Data contracts are API-like agreements between data producers and data consumers. They help teams export high-quality, resilient data.
With data contracts, service owners decide which data gets exposed to consumers. Methodically, they expose it in a structured fashion (similar to an API endpoint).
As a result, data quality shifts from being the responsibility of the data scientist/analyst to the responsibility of the software engineer.
Take a ride-sharing application, for example. Production microservices write information about each trip into the "rides", "payments", "customers", and "trip request" database tables. As the business runs promos and expands, the schemas evolve.
Without intervention, all of these production tables end up in a data warehouse. Any machine learning engineer or data engineer consuming the analogous tables will have to make sense of everything and rewrite data transformations on top of schema changes.
The paradigm shifts with data contracts. In this case, data analysts and scientists don’t consume near-raw tables in data warehouses. Instead, they consume from an API that has already munged the data to produce a human-readable event, like a “trip request."
The trip request metadata will be attached (pricing, yes/no surge pricing, promo, payment details, and reviews). In 2023, more teams will adapt data contracts in order to more efficiently consume data.
Currently, most data infrastructure uses batching operations (polling and job scheduling, for example). In 2023, companies will build for use cases that need streaming/real-time infrastructure (process automation or operational decision-making, for example).
Snowflake has spearheaded this trend through streams functionality, and other major data warehouses are also moving in this direction. For example, Bigquery and Redshift both offer materialized views.
There are also startups building in the space. Meroxa offers change data capture from relational data stores and webhooks. Materialize is a Postgres-compatible data store that natively supports near-real-time materialized views.
Continual learning is the process of iterating on ML models after they deploy to production.
Production data helps improve models as they change in the real world. Most machine learning models today are retrained on an ad-hoc basis. Continual learning, on the other hand, periodically retrains the models or retrains them after specific triggers (like performance degradation).
In 2023, continual learning will expand, as ML adopts data observability best practices. There will be an increased push to monitor tables in data warehouses, as well as direct user outcomes and feedback.
The "Extraction" part of ETL is handled by middlemen services like Fivetran and Stitch. They extract data from SaaS APIs (Salesforce, Shopify, LinkedIn, Zendesk) and put them into the data warehouse.
In 2023, some SaaS apps will change the current model by striking up direct partnerships with data warehouses to deliver their service data. As a result, SaaS apps will be more diligent about updating data partners on API changes.
Customers will find fewer data extraction errors and will likely have to spend less money.
Data analysts love SQL and largely use it to write transformations in data warehouses. But SQL isn’t ideal for all data processing. For example, ML model training and other complex transformation logic are more easily handled with Python.
Data warehouses will start supporting more languages (like Python) in their processing engine.
For example, Snowflake recently announced Snowpark, an API that lets you build data processing applications right in Snowflake, without moving data to the system where the application code runs.
T-shaped monitoring tracks fundamentals across all your data and only applies a deeper level of monitoring on supercritical datasets, like those used for financial planning, machine learning models, or executive-level dashboards.
T-shaped monitoring is a philosophy that helps teams avoid a large data observability issue: bad alerts. As data teams learn to prioritize their monitoring and map it directly to business outcomes, T-shape monitoring will be a handy tool.