When new things come along, I like to take a critical view of them and see where the problems might be. It’s important to be honest in your evaluations, and try to anticipate where you are going to run into problems. I like to sometimes write thought pieces on those critical views, and that’s what this article is about.
“Data Mesh” has been the buzzword du jour in the big data, data lake, and data warehouse space for the past year, although it has been bouncing around for a few years now. This latest buzzword in the space was invented by Zhamak Dehghani. She has a book out on the topic that came out around mid-2022, and the description reads in part:
“author Zhamak Dehghani introduces data mesh, a decentralized sociotechnical paradigm drawn from modern distributed architecture that provides a new approach to sourcing, sharing, accessing, and managing analytical data at scale.”
I haven’t read the book, at over $50 (list price $70), it’s too rich for my blood, but I’ve watched a lot of webinars with her and others discussing the topic, so let’s dig in.
At this stage, it is really more of a philosophy than it is available technology. To briefly describe the philosophy, the idea is that the data creators would essentially package and publish their data for consumers to subscribe to. Kind of reminds me of the Kafka pub/sub model. Zhamak describes it in this article as:
1) domain-oriented decentralized data ownership and architecture
2) data as a product
3) self-serve data infrastructure as a platform
4) federated computational governance
I appreciate that Zhamak has given this so much thought and acted on it, however, I think she is ignoring something, and maybe I’ve missed her talking about it, but how did we get to the point where we’re thinking about this? I think there are a couple of main contributors, one goes much further back than the other.
1) Many of the systems we are using today like Facebook or Uber or one of these hyper-scale types of companies, were often developed quickly. They were building something that hadn’t necessarily been done before, so they were anticipating behaviors without knowing exactly what would resonate.
The speed at which the systems had to be developed also meant that the amount of planning was more a seat of the pants activity. Every group was building their part, often in isolation, with different people that had different favorite tools.
Going back to MySpace, if I recall correctly, they wrote the original version in less than a month. So, that led to real spaghetti systems that then required the development of other tools to deal with the mess the system was to avoid having to rewrite the system.
That gave us things like Hudi, Presto, etc., so the initial problem was a lack of planning to begin with.
2) The rise of cloud computing, while super convenient, has created an entire ecosystem around how to keep your costs down. The data lake is in part a response to that since storage is cheaper than computing, so by leaving it out of a database, you have a lower cost.
Data egress fees are also a big deal, so you want to push down your queries to resolve as much as possible before the data comes back to you. Leaving it in the data lake gave rise to table formats like Hudi, Iceberg, and Deltalake, and indexing systems like Varada.
So, now we have another mess to deal with and are inventing clever tools to work with it instead of rewriting the core systems.
And that is the rub right there. To move to a data mesh philosophy, you have to rewrite everything. No one rewrote it to deal with the original mess; they came up with clever technology to manage the messes, so are they going to rewrite it for this philosophy? Probably not.
I think the closest you get at the moment would be her point 4, “federated computational governance”. I see Starburst talking a lot about data mesh, and they basically satisfy that point.
The company is founded by the inventors of Presto at Facebook, which they have evolved into Trino, but a big advantage to their tech is their data connectors for federated queries. You can join all sorts of things, all over the place, on-prem, on the cloud, you name it.
Sounds like a mesh to me, and I think that’s as close as you are going to get in the real world.
Let’s be honest though, most companies aren’t these massive organizations with high-velocity data and a need for near real-time analysis.
They don’t have sophisticated data analysts that are going around to their company-published data stores to find what they need. The companies that are in that space, are not in a position to rewrite their systems.
It’s possible that you are building something new and want to design around that concept, but again, there aren’t necessarily the tools to do it.
I think you should spend your time doing a really solid job of designing your system, to begin with. Understand it as best you can and take the time to build it so it will be extensible, flexible, and reliable.
It’s worth being aware of what is going on in the space, but don’t overcomplicate your systems with a bunch of black boxes that can fail and no one understands.