Enterprise data solutions can quickly become expensive. NetApp reports that organizations see 30% data growth every 12 months and risk seeing their solutions fail if this growth isn’t managed.
Gartner echoes the concerns, stating that data solutions costs tend to be an afterthought, not addressed until they’re already a problem.
Being a data-driven organization, you already know this and understand the critical need for cost optimization in your data solutions.
So what can you do? In this article, I’ll look at the factors that increase costs, what you can do to contain them, and how stream processing can be a major source of savings.
Through it all, I’ll help you create an effective plan that leads to operational efficiencies in data solutions.
First, let’s look at the factors that contribute to the increasing costs of data solutions. You probably already face many, if not all, of these situations:
What can enterprises do to keep these costs down? There are many strategies, including:
Let’s add one more to this list, the solution that we’ll look at in more detail in this article: stream processing.
Stream processing, as the name suggests, is a data management strategy that involves ingesting (and acting on) continuous data streams—such as the user’s clickstream journey, sensor data, or sentiment from a social media feed—in real time.
By using stream processing, organizations can greatly reduce their data solutions costs and see an increased ROI over common frameworks for managing data, such as Splunk, Apache Spark, and others.
Stream processing solves many of the issues encountered with other data solutions, including rising costs, by giving you:
Stream processing is powerful and opens many use cases, but managing the implementation is no easy task! Let’s look at some challenges associated with implementing stream processing from scratch.
We can solve these issues by using a stream processing framework, such as Meroxa’s Turbine. A stream processing framework connects to an upstream resource to stream data in real-time, receive the data, process it, then send it on to a destination.
Meroxa has a good write-up on what this looks like at a high level. If we were to look more closely at how we might use Go to interact with Turbine, we could look at code from this example application:
func (a App) Run(v turbine.Turbine) error {
source, err := v.Resources("demopg")
if err != nil {
return err
}
// a collection of records, which can't be inspected directly
records, err := source.Records("user_activity", nil)
if err != nil {
return err
}
// second return is dead-letter queue
result := v.Process(records, Anonymize{})
dest, err := v.Resources("s3")
if err != nil {
return err
}
err = dest.Write(result, "data-app-archive")
if err != nil {
return err
}
return nil
}
func (f Anonymize) Process(records []turbine.Record) []turbine.Record {
for i, r := range records {
hashedEmail := consistentHash(r.Payload.Get("email").(string))
err := r.Payload.Set("email", hashedEmail)
if err != nil {
log.Println("error setting value: ", err)
break
}
records[i] = r
}
return records
}
func consistentHash(s string) string {
h := md5.Sum([]byte(s))
return hex.EncodeToString(h[:])
}
In the above code, we see the following steps in action:
Create an upstream source (named source
), from a PostgreSQL database (named demopg
).
Fetch records (from the user_activity
table) from that upstream source.
Call v.Process
, which performs the stream processing. This process iterates through the list of records and overwrites the email of each record with an encoded hash.
Create a downstream destination (named dest
), using AWS S3.
Write the resulting stream-processed records to the destination.
As we can see, you only need a little bit of code for Turbine to process new records and stream them to the destination.
Using a framework such as Turbine for stream processing brings several benefits, including:
It integrates any source database to any destination database by leveraging Change Data Capture (CDC) that receives real-time streams and publishes them downstream. The Meroxa platform handles data transformation or processing logic and orchestration through the Turbine app, sparing the developers from worrisome schema management or scalability issues.
Ultimately, this allows developers to focus their time and effort on your core business needs. It also reduces the cost of migration associated with additional infrastructure.
Data-driven organizations are betting high on data assets to fuel their transformation journeys. But they need better—and more cost-efficient—ways to manage the fast-growing datasets.
Stream processing provides the necessary benefits: improved data quality, faster decision-making, and more. But it requires the right plan and the right stream processing platform to handle the data efficiently. By following the tips in this article, you should be well on your way!