Below you can find the article of my colleague and Big Data expert Boris Trofimov.
Big Data solutions and their accessibility for small and medium companies.
One of the biggest myths that still remains is that only big companies can afford Big Data driven solutions and that it is only appropriate for massive data volumes and costs a fortune. That is no longer true and there have been several revolutions that have changed this state of mind.
The first revolution is related to maturity and quality. It is no secret that ten years ago big data technologies required a certain amount of effort to make things work and make all the pieces work together.
Picture 1. Typical stages, growing technologies pass-through
There were countless stories in the past from developers who wasted 80% of their time trying to overcome silly glitches with Spark, Hadoop, Kafka, or others. Nowadays these technologies have become sufficiently reliable and they have eliminated childhood diseases and learned how to work with each other.
There is a much bigger chance of having infrastructure outages than catching internal bugs. Even infrastructure issues can be tolerated in most cases gently as most big data processing frameworks are designed to be fault-tolerant. In addition, those technologies provide stable, powerful, and simple abstractions through calculations that allow developers to be focused on the business side of development.
The second revolution is happening right now -- myriads of open source and proprietary technologies have been invented in recent years -- Apache Pino, Delta Lake, Hudi, Presto, Clickhouse, Snowflake, Upsolver, Serverless, and many more. Creative energy and the ideas of thousands of developers have been converted into bold and outstanding solutions with great motivating synergy around them.
Picture 2. Big Data technology stack
Let’s address a typical analytical data platform (ADP). It consists of four major tiers:
Every tier has sufficient alternatives for any taste and requirement. Half of those technologies appeared within the last 5 years.
The important thing about them is that technologies are developed with the intention to be compatible with each other. For instance, typical low-cost small ADP might consist of Apache Spark as a base of processing components, AWS S3 or similar as a Data Lake, Clickhouse as a Warehouse and OLAP for low latency queries and Grafana for nice dashboarding (see Pic. 3).
Picture 3. Typical low-cost small ADP
More complex ADPs with stronger guarantees could be composed in a different way. For instance, introducing Apache Hudi with S3 as a Data Warehouse can ensure a much bigger scale while Clickhouse remains for low-latency access to aggregated data (see Pic. 4).
Picture 4. ADP on a bigger scale with stronger guarantees
The third revolution is made by clouds. Cloud services became real game changers. They address Big Data as a ready-to-use platform (Big Data as a Service) allowing developers to focus on feature development, letting cloud care about infrastructure.
Picture 5 shows another example of ADP which leverages the power of serverless technologies from storage, processing till presentation tier. It has the same design ideas while technologies are replaced by AWS managed services.
Picture 5. Typical low-cost serverless ADP
Worth saying that the AWS here is just an example, the same ADP could be built on top of any other cloud provider.
Developers have an option to choose particular technologies and a degree of serverless. More serverless it is, more composable it could be, however more vendor-locked it becomes as a down side. Solutions being locked on a particular cloud provider and serverless stack can have a quick time to market runway. Wise choice between serverless technologies can make the solution cost effective.
This option though is not quite useful for startups as they tend to leverage typical $100K cloud credits and jumpings between AWS, GCP and Azure is quite an ordinary lifestyle. This fact has to be clarified in advance and more cloud-agnostic technologies have to be proposed instead.
Usually, engineers distinguish the following costs:
Let’s address them one by one.
Cloud technologies definitely simplify engineering efforts. There are several zones where it has a positive impact.
The first one is architecture and design decisions. Serverless stack provides a rich set of patterns and reusable components which gives a solid and consistent foundation for solution’s architecture.
There is only one concern that might slow down the design stage -- big data technologies are distributed by nature so related solutions must be designed with thought about possible failures and outages to be able to ensure data availability and consistency. As a bonus, solutions require less efforts to be scaled out.
The second one is integration and end-to-end testing. Serverless stack allows creating isolated sandboxes, play, test, fix issues, therefore reducing development loopback and time.
Another advantage is that cloud imposes automation of the solution's deployment process. Needless to say this feature is a mandatory attribute of any successful team.
One of the major goals that cloud providers claim to solve was less effort to monitor and keep production environments alive. They tried to build some kind of ideal abstraction with almost zero devops involvement.
The reality is a bit different though. With respect to that idea, usually maintenance still requires some efforts. The table below highlights the most prominent kinds.
But beside it, the bill depends a lot on infrastructure and license costs. Design phase is extremely important as it gives a chance to challenge particular technologies and estimate its runtime costs in advance.
Another important side of big data technologies that concerns customers — cost of change. Our experience shows there is no difference between Big Data and any other technologies. If the solution is not over-engineered then the cost of change is completely comparable to a non-big-data stack. There is one benefit though that comes with Big Data. It is natural for Big Data solutions to be designed as decoupled. Properly designed solutions do not look like monolith, allowing to apply local changes within short terms where it is needed and with less risk to affect production.
As a summary, we do think Big Data can be affordable. It proposes new design patterns and approaches to developers, who can leverage it to assemble any analytical data platform respecting strongest business requirements and be cost-effective at the same time.
Big Data driven solutions might be a great foundation for fast-growing startups who would like to be flexible, apply quick changes and have short TTM runway. Once businesses demand bigger data volumes, Big Data driven solutions might scale alongside with business.
Big Data technologies allow implementing near-real-time analytics on small or big scale while classic solutions struggle with performance.
Cloud providers have elevated Big Data on the next level providing reliable, scalable and ready-to-use capabilities. It’s never been easier to develop cost-effective ADPs with quick delivery. Elevate your business with Big Data.
Previously published at https://sigma.software/about/media/can-big-data-be-affordable