Do you know why microservice design is so popular within the development of BI tools? The answer is clear: it helps to develop scalable and flexible solutions. But microservice architecture has a great drawback. Its performance usually requires great improvements.
The FreshCode team also faced the problem and I’ve decided to show how we coped with it. The article is written together with FreshCode CTO and based on our recent case of development reporting microservice. You will find here its tech scheme, estimates, as well as a list of tools for on-premise and SaaS products.
If you wonder why is microservice style so popular, you should think about the recent IT trends. The demand for Agile and DevOps practices led to microservice popularity. Today such great players as Uber, Airbnb, Netflix use microservices to solve their business problems.
The best way to explain what does microservice design mean is to compare it with a common monolith app. The monolithic system uses one processor for all the logic. Meanwhile, microservice includes a few separate processors. They usually are:
UI
database
server
Any change in the system leads to the deployment of a new version of the server part of the system. Let’s consider the concept in detail.
Microservice design means a set of services, but the definition is vague. I can single out 4 features that a microserver usually has:
the decentralized control of languages and data
responsibility for a specific business need
automatic deployment
endpoints
On the picture below you can see microservice design compared to a monolith app.
One of the main benefits of the microservice design is its scalability. You can scale several services without changing the whole system. So, you save resources and keep the app less complex. One of the most famous cases that prove this fact is Netflix user base. The company had to cope with the growing subscribers’ database. The microservice design was a great solution for scaling it.
Each microservice needs its own database. Otherwise, you can’t use all the benefits of the modularization pattern. But the variety of databases leads to challenges in the reporting process. We will discuss the problem later.
Microservice design speeds up app development and allows to launch the product earlier. Each part can be rolled out separately. So, the deployment of microservices is quicker and easier.
1. The possibility of convenient horizontal system scaling
2. Increased development team members productivity
3. Simplification of the debugging and maintenance processes
4. The ability to work in smaller teams and use an Agile approach
5. Flexibility in continuous integration and deployment
Despite all these benefits, microservice architecture has its own drawbacks. I mean the necessity of operating many systems and completing various tasks in the distributed environment. So, the main microservice pitfalls are:
1. The complexity of microservice design makes developer plan and act more carefully.
2. The external API communication in microservice architecture leads to more significant risks of attacks.
3. Sometimes it’s difficult to switch between them in the development and deployment processes.
We worked on a legacy EdTech project. The system was very complex and included many microservices. Its main parts were:
sophisticated financial and billing system
multi-organisation structure for large group entities
workflow management tool for business processes
integrated bulk email, SMS and live chat
online system for surveys, quizzes, examination
flexible assessment and learning management system
FreshCode worked on the project on the stage of migrating to a new interface. The product was preparing for the global launch. The microservice system was supposed to process great amounts of data. As for the app target audience, it was developed for
large education networks that manage 100s of campuses
governments that have up to 200k schools, colleges and universities
Meanwhile, the EdTech app design was convenient both for great education networks and a small school of about 100 students.
So, FreshCode development team faced the problem of managing and improving the performance of the complex microservice architecture. It should be mentioned that the client wanted to build both SaaS and self-hosted systems. So, we have chosen the technical solutions keeping this fact in mind.
The process of generating reports required engagement with different services. Thus, it caused performance issues. That’s why Freshcode team decided to optimize the app architecture by creating a separate reporting microservice. It received data from all the databases. Then, it saved them and transformed into custom reports.
On the picture below you can see the scheme of reporting microservices system and technologies for its implementation.
Yellow color marks all microservices in the system. Each of them has its own database. The reporting module tracks all changes in them with the help of a messaging system. Then, it stores the new data in its own report database.
Let’s look at the 6 main part of the reporting system, technologies that can be used and the best solutions.
CDC tracks every single change (insert, update, delete) and performs some logic on it. There were 3 possible tools for the first step of implementing the microservice reporting system.
1. Apache NiFi
It allows creating simple CDC without coding at all. Apache NiFi has a lot of built-in processors and supports data routing, transformation and system mediation logic.
Pros:
Support of cluster mode and easy scaling
Built-in PutToKafka and PutToKinesis activities
Implementation of custom activities on any JVM language
User-friendly UI
Cons:
No predefined data format for messaging between activities
Supports only JVM languages
The quality of default activities isn’t perfect
No Oracle CDC activity
2. StreamSets Data Collector
Popular open source solution for continuous big data ingestion in a microservice reporting system. Its main advantages are simple creation of data pipelines and support of many widespread technologies.
Pros:
Built-in AWS S3, Kinesis, Kafka, Oracle, Postgres processors
Open source software can be adjusted for your needs
Simple and convenient UI
Support of most of the popular tools
Cons:
It’s a new solution that is still actively developing
It’s a little bit difficult to start working with StreamSets Data Collector
3. Matillion
The innovative ELT architecture has an easy-to-use interface. It is built specifically for Amazon Redshift, Google BigQuery and Snowflake.
Pros:
A proprietary tool
Support of the development team
Well-tested solution
Cons:
Only several databases can be used with this tool
ELT architecture doesn’t match to all projects
Oracle was the main database of our microservice reporting system. So, we choose StreamSets Data Collector, because of Oracle CDC support out of the box.
It allows sending messages between computer systems, as well as setting publishing standards for them.
1. Apache Kafka
One of the most famous tools for real-time analytics. Apache Kafka has high throughput and reliability characteristics.
Pros:
High throughput, fault tolerance, durable
Great scalability, high concurrency
Batch mode, native computation over stream
A great choice for on-premise microservice reporting system
Cons:
Requires DevOps knowledge for correct setup
No built-in monitoring tool
2. AWS Kinesis
It simplifies collecting, processing, analyzing streaming data. Amazon Kinesis offers key capabilities for the cost-effective process at any scale.
Pros:
Easy to manage and scale
Great integration with other AWS services
Almost no DevOps effort
Built-in monitoring and alert system
Cons:
Needs some cost optimizations
No way to use for on-premise software
Although Apache Kafka required a bit more effort to deploy and setup, we used it as a cost-efficient on-premise solution.
The high-performance computer system analyzes multiple data streams from many sources. It helps to prepare data before ingestion. So, it’s possible to denormalize/join them and add any info if needed.
1. Spark Streaming
Brings Apache Spark’s language-integrated API for stream processing. So, it allows writing streaming jobs the same way we write batch jobs.
Pros:
Stateful exactly-once semantics out of the box
Fault-tolerance, scalability
In-memory computation
Cons:
Pretty expensive to use
Manual optimization
No built-in state management
2. Apache Flink
It is useful for stateful computations over unbounded and bounded data streams. Apache Flink suits for all common cluster environments and performs computations at in-memory speed.
Pros:
Exactly once state consistency
SQL on Stream & Batch Data
Low latency, scalability, fault-tolerance
Support of very large state
Cons:
Requires high programming skills
Complicated architecture
Flink community is less than Spark but growing
3. Apache Samza
The scalable data processing engine for real-time analytics that can be used in a microservice reporting system.
Pros:
Can maintain a large state
Low latency, high throughput, mature and tested at scale
Fault-tolerant and high performance
Cons:
At-least-once processing guarantee
Lack of advanced streaming features (watermarks, sessions, triggers)
4. AWS Kinesis Services
The set of tools includes Data Firehose, Data Analytics, and Data Streams. As a result, it helps to build powerful stream processing without implementing any custom code.
Pros:
Pay only for what you use
The easiest way to process data streams in real time with SQL
Handle any amount of streaming data
Cons:
No way to use on-premise
Complicated to customize
AWS provides a great set of tools for ETL and data procession. It’s a good start point. But there is no way to deploy it on custom servers. That’s why it doesn’t fit for on-premise solutions.
Apache Flink is the most feature reach and performant solution. It allows storing large application state (multi-terabyte). But it requires more developers to be involved and should be deployed by yourself.
The central repository of integrated data from one or more disparate sources. It stores current and historical data in one single place. So, we can use them for creating analytical reports, machine learning, etc.
1. AWS S3
The object storage service offers industry-leading scalability, data availability, security and performance.
Pros:
Easy to integrate with other AWS services
Designed for 99.999999999% (11 9’s) of data durability
Cost-effective for rarely accessed data
Has an open source implementation with full API support
Cons:
High network pricing
Previously S3 met availability issues, but it’s not a problem for a Data Lake
2. Apache Hadoop
The primary data storage system used by Hadoop applications. It allows storing and processing large amounts of data.
Pros:
Efficiently works with huge amounts of data
Integration with many analytical and operational tools (Impala, Hive, HBase, etc)
Cons:
Complicated to deploy and manage
Needs to set up monitoring and high availability
We decided to start with AWS S3. It has an open source implementation. That’s why we could integrate it to the on-premise microservice reporting system.
1. AWS Aurora
It is up to 5 times faster than standard MySQL databases and 3 times faster than PostgreSQL databases.
Pros:
Pretty fast SQL database
High Availability and Durability
Fully Managed
Easy to scale
Cons:
Bad performance for analytical reports in case of big data projectsThe minimally available instance is too big, but we can easily replace it by plain PostgreSQL
2. AWS Redshift
Redshift delivers 10 times faster performance than other data warehouses. It is using machine learning, massively parallel query execution and columnar storage on high-performance disk.
Pros:
May run queries on external S3 files
Easy to set up, use and manage
Columnar storage
Cons:
Doesn’t enforce uniqueness
Can’t be used as a live app database
It’s mostly useful for run aggregation on a large amount of data
3. Kinetica
The vectorized, columnar, memory-first database designed for analytical (OLAP) workloads. Kinetica automatically distributes any workload across CPUs and GPUs for optimal results.
Pros:
Pretty fast aggregation performance, run on GPU and CPU
Supports materialized join views, and can update them incrementally
Cons:
GPU instances still cost a lot
No way to join data between different partitions
4. Apache Druid
It generally works well with any event-oriented, clickstream, time series, or telemetry data, especially streaming datasets from Apache Kafka. Druid provides exactly once consumption semantics from Apache Kafka and is commonly used as a sink for event-oriented Kafka topics.
Pros:
Druid can be deployed in any *NIX environment on commodity hardware
Best for interactive dashboards with full drill-down capabilities
Stores only pre-aggregated data
Cons:
Isn’t perfect for custom reports that may be built by users
Works only on time series data
No full join support
All of these databases are amazing. But our client’s goal was to create reports based on all data from all microservices. So, the development team considered AWS Aurora as the best choice for this task. It simplified the workflow a lot.
The report microservice was responsible for storing information about data objects and relations between them. It also stood for managing security and generating reports itself. Since these reports were based on the chosen data objects.
We prepared 2 variants of the technological stack for the microservice reporting system. As for the SaaS product on AWS, we used:
StreamSets for CDC
Apache Kafka as a messaging system
AWS S3 Data Lake
AWS Aurora as a reporting database
AWS ElasticCache as an in-memory data store
The reporting microservice was written in NodeJS. You can see rough estimates for SaaS solution on the table below.
Note: These are calculations for production deployment. The development process required much smaller infrastructure.
Such infrastructure was the most appropriate for the client’s requirements. Its main advantage was the easy way to replace AWS services with self-hosted solutions. It allowed us to avoid code/logic duplication for different deployment schemas.
For on-premise one we used Minio, PostgreSQL, Redis accordingly. Their APIs were fully compatible. So, we didn’t have any significant problems in the microservice reporting system at all.
Our team solved the clients’ technical challenges. The reporting microservice module was effective and convenient. It was capable of:
Generating clear and convenient reports
Providing many standard reporting templates
Adding a large number of filters
Customizing report interface
FreshCode client improved the microservice reporting system and achieved these goals:
to update the app’s architecture and design
to improve the product by adding new features
to optimize performance, increase flexibility and scalability
If you are interested in solving the same problem or have any other technical challenges, contact our team. We provide free expert advice for startups, small business and enterprises. Check FreshCode portfolio to find out other interesting projects.
Would you like to read more case-based articles? Let me know in the comments below and stay in touch!
***
The original article was published on FreshCode blog