Ingestion and Processing of Data For Big Data and IoT Solutions

Introduction

In the era of the Internet of Things and Mobility, with a huge volume of data becoming available at a fast velocity, there must be the need for an efficient analytics system.

Also, the variety of data is coming from various sources in various formats, such as sensors, logs, structured data from a RDBMS, etc. In the past few years, the generation of new data has drastically increased. More applications are being built and they are generating more data at a faster rate.

Earlier, Data Storage was costly and there was an absence of technology which could process the data in an efficient manner. Now the storage costs have become cheaper, and the availability of technology to process Big Data is a reality.

What is Big Data

According to the Author Dr. Kirk Borne, Principal Data Scientist, Big Data Definition is Everything, Quantified and Tracked. Let’s pick that apart -

Everything — Means every aspect of life, work, consumerism, entertainment, and play is now recognized as a source of digital information about you, your world, and anything else we may encounter.
Quantified — Means we are storing those “everything” somewhere, mostly in digital form, often as numbers, but not always in such formats. The quantification of features, characteristics, patterns, and trends in all things is enabling data mining, machine learning, statistics, and discovery at an unprecedented scale on an unprecedented number of things. The Internet of Things is just one example, but the Internet of Everything is even more awesome.
Tracked — Means we don’t simply quantify and measure everything just once, but we do so continuously. This includes — tracking your sentiment, your web clicks, your purchase logs, your geo-location, your social media history, etc. or tracking every car on the road, or every motor in a manufacturing plant or every moving part on an airplane, etc. Consequently, we are seeing the emergence of smart cities, smart highways, personalized medicine, personalized education, precision farming, and so much more.

All of these quantified and tracked data streams will enable

Smarter Decisions
Better Products
Deeper Insights
Greater Knowledge
Optimal Solutions
Customer-Centric Products
Increased Customer Loyalty
More Automated Processes, more accurate Predictive and Prescriptive Analytics
Better models of future behaviors and outcomes in Business, Government, Security, Science, Healthcare, Education, and more.

Big data Defines three D2D’s

Data-to-Decisions
Data-to-Discovery
Data-to-Dollars

The 10 V’s of Big Data

Big Data Framework

The Best Way for a solution is to “Split The Problem”. Big Data solution can be well understood using Layered Architecture. The Layered Architecture is split into different Layers where each layer performs a particular function.

This Architecture helps in designing the Data Pipeline with different requirements of either Batch Processing System or Stream Processing System. This architecture consists of 6 layers which ensure a secure flow of data.

Data Ingestion Layer — This layer is the first step for the data coming from variable sources to start its journey. Data here is prioritised and categorised which makes data flow smooth in further layers.
Data Collector Layer — In this Layer, more focus is on the transportation of data from ingestion layer to rest of data pipeline. This is the Layer, where components are decoupled so that analytic capabilities may begin.
Data Processing Layer — In this layer main focus is to specialize the data pipeline processing system or we can say the data we have collected in the previous layer is to be processed in this layer. Here we do some magic with the data to route them to a different destination, classify the data flows and it’s the first point where the analytic may take place.
Data Storage Layer — Storage becomes a challenge when the size of the data you are dealing with, becomes large. There are several possible solutions that can rescue from such problems. Finding a storage solution is very much important when the size of your data becomes large. This layer focuses on “where to store such a large data efficiently”.
Data Query Layer — This is the layer where strong analytic processing takes place. Here main focus is to gather the data value so that they are made to be more helpful for the next layer.
Data Visualization Layer — The visualization, or presentation tier, probably the most important tier, where the data pipeline users may feel the VALUE of DATA. We need something that will grab people’s attention, pull them into, make your findings well-understood.

1. Data Ingestion Layer

Data ingestion is the first step for building Data Pipeline and also the toughest task in the System of Big Data. In this layer we plan the way to ingest data flows from hundreds or thousands of sources into Data Center. As the Data coming from Multiple sources at variable speed, in different formats.

That’s why we should properly ingest the data for the successful business decisions making. It’s rightly said that “If starting goes well, then, half of the work is already done”

1.1 What is Big Data Ingestion?

Big Data Ingestion involves connecting to various data sources, extracting the data, and detecting the changed data. It’s about moving data — and especially the unstructured data — from where it is originated, into a system where it can be stored and analyzed.

We can also say that Data Ingestion means taking data coming from multiple sources and putting it somewhere it can be accessed. It is the beginning of Data Pipeline where it obtains or import data for immediate use.

Data can be streamed in real time or ingested in batches, When data is ingested in real time then, as soon as data arrives it is ingested immediately. When data is ingested in batches, data items are ingested in some chunks at a periodic interval of time. Ingestion is the process of bringing data into Data Processing system.

Effective Data Ingestion process begins by prioritizing data sources, validating individual files and routing data items to the correct destination.

1.2 Challenges Faced with Data Ingestion

As the number of IoT devices increases, both the volume and variance of Data Sources are expanding rapidly. So, extracting the data such that it can be used by the destination system is a significant challenge in terms of time and resources. Some of other challenges faced by Data Ingestion are -

When numerous Big Data sources exist in the different format, it’s the biggest challenge for the business to ingest data at the reasonable speed and further process it efficiently so that data can be prioritized and improves business decisions.
Modern Data Sources and consuming application evolve rapidly.
Data produced changes without notice independent of consuming application.
Data Semantic Change over time as same Data Powers new cases.
Detection and capture of changed data — This task is difficult, not only because of the semi-structured or unstructured nature of data but also due to the low latency needed by certain business scenarios that require this determination.

That’s why it should be well designed assuring following things -

Able to handle and upgrade the new data sources, technology and applications
Assure that consuming application are working with correct, consistent and trustworthy data.
Allows rapid consumption of data
Capacity and reliability — The system needs to scale according to input coming and also it should be fault tolerance.
Data volume: Though storing all incoming data is preferable, there are some cases in which aggregate data.

1.3 Data Ingestion Parameters

Data Velocity — Data Velocity deals with the speed at which data flows in from different sources like machines, networks, human interaction, media sites, social media. The flow of data can be massive or continuous.
Data Size — Data size implies enormous volume of data. Data is generated by different sources that may increase timely.
Data Frequency (Batch, Real-Time) — Data can be processed in real time or batch, in real time processing as data received on same time, it further proceeds but in batch time data is stored in batches, fixed at some time interval and then further moved.
Data Format (Structured, Semi-Structured, Unstructured) — Data can be in different formats, mostly it can be structured format i.e. tabular one or unstructured format i.e. images, audios, videos or semi-structured i.e. JSON files, CSS files etc.

1.4 Big Data Ingestion Key Principles

In order to complete the process of Data Ingestion, we should use right tools for that and most important that tools should be capable of supporting some of the key principles written below -

Network Bandwidth — Data Pipeline must be able to compete with business traffic. Sometimes traffic increases or sometimes decreases so Network bandwidth scalability is biggest Data Pipeline challenge. Tools are required for bandwidth throttling and compression capabilities.
Unreliable Network — Data Ingestion Pipeline takes data with multiple structures i.e. images, audios, videos, text files, tabular files data, XML files, log files etc and due to the variable speed of data coming, it might travel through the unreliable network. Data Pipeline should be capable of supporting this also.
Heterogeneous Technologies and System — Tools for Data Ingestion Pipeline must be able to use different data sources technologies and different operating system.
Choose Right Data Format — Tools must provide data serialization format, that means as data comes in the variable format so converting them into single format will provide an easier view to understand or relate the data.
Streaming Data — It depends upon business necessity whether to process the data in batch or streams or real time. Sometimes we may require both processing. So, tools must be capable of supporting both.

1.5 Data Serialization

Different types of users have different types of data consumer needs. Here we want to share variable data, so we must plan how the user can access data in a meaningful way. That’s why a single image of variable data optimize the data for human readability.

Approaches used for this are -

Apache Thrift — It’s an RPC Framework containing Data Serialization Libraries.
Google Protocol Buffers — It can use the special generated source code to easily write and read structured data to and from a variety of data streams and using a variety of languages.
Apache Avro — The more recent Data Serialization format that combines some of the best features which previously listed. Avro Data is self-describing and uses a JSON-schema description. This schema is included with the data itself and natively support compression. Probably it may become a de facto standard for Data Serialization.

1.6 Data Ingestion Tools

1.6.1 Apache Flume — Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

It has a simple and flexible architecture based on streaming data flows. It is robust and faults tolerant with tunable reliability mechanisms and many failovers and recovery mechanisms.

It uses a simple extensible data model that allows for an online analytic application. Its functions are -

Stream Data — Ingest streaming data from multiple sources into Hadoop for storage and analysis.
Insulate System — Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination
Scale Horizontally — To ingest new data streams and additional volume as needed.

1.6.2 Apache Nifi — Apache Nifi provides an easy to use, the powerful, and reliable system to process and distribute data. Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. Its functions are -

Track data flow from beginning to end
Seamless experience between design, control, feedback, and monitoring
Secure because of SSL, SSH, HTTPS, encrypted content.

1.6.3 Elastic Logstash — Elastic Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously transforms it, and then sends it to your “stash” i.e. Elasticsearch.

It easily ingests from your logs, metrics, web applications, data stores, and various AWS services and done in continuous, streaming fashion. It can Ingest Data of all Shapes, Sizes, and Sources.

2. Data Collector Layer

In this Layer, more focus is on transportation data from ingestion layer to rest of Data Pipeline. Here we use a messaging system that will act as a mediator between all the programs that can send and receive messages.

Here the tool used is Apache Kafka. It’s a new approach in message oriented middleware.

2.1 Apache Kafka

It is used for building real-time data pipelines and streaming apps. It can process streams of data in real-time and store streams of data safely in a distributed replicated cluster.

Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data.

2.2 What is Data Pipeline?

Data Pipeline the main component of Data Integration. All transformation of data happens in Data Pipeline.
It is a Python based tool that streams and transforms real-time data to service that need it.
Data Pipeline Automate the movement and transformation of data. Data Pipeline is a Data Processing engine that runs inside your application.
It is used to transform all the incoming data in a common format so that we can prepare it for analysis and visualization. Data Pipeline is built on Java Virtual Machine (JVM).
So, a Data Pipeline is a series of steps that your data moves through. The output of one step in the process becomes the input of the next. Data, typically raw data, goes on one side, goes through a series of steps.
The steps of a Data Pipeline can include cleaning, transforming, merging, modeling and more, in any combination.

2.2.1 Functions of Data Pipeline

Ingestion — Data Pipeline Helps in bringing data into your system. It means taking unstructured data from where it is originated into a system where it can be stored and analyzed for making business decisions
Data Integration — Data Pipeline also helps in bringing different types of data together.
Organization — Organizing data means an arrangement of data, this arrangement is also done in Data Pipeline.
Refining the data — It’s also one of the processes where we can enhance, clean, refine the raw data.
Analytics — After refining the useful data, Data Pipeline provides us the processed data on which we can apply the operations on raw data and can make business decisions accurately.

2.2.2 Need Of Data Pipeline

A Data Pipeline is software that takes data from multiple sources and makes it available to be used strategically for making business decisions.

Primarily reasons for the need of data pipeline is because it’s very hard to monitor Data Migration and manage data errors. Other reasons for this are below -

Certain Business — Critical Analysis is only possible when combining data from multiple sources. For making business decisions we should have a single image of all the data coming.
Connections — All the time data keeps on increasing, new data came and old data modified, so, each new integration can take anywhere from a few days to a few months to complete.
Accuracy — The only way to build trust with data consumers is to make sure that your data is auditable. One best practice that’s easy to implement is to never discard inputs or intermediate forms when altering data.
Latency — The fresher your data, the more agile your company’s decision-making can be. Extracting data from APIs and databases in real-time can be difficult, and many target data sources, including large object stores like Amazon S3 and analytics databases like Amazon Redshift, are optimized for receiving data in chunks rather than a stream.
Scalability — Data can be increased or decreased with time we can’t say for on Monday data will come less and rest of days comes a lot for processing. So, usage of data is not uniform. What we can do is making our pipeline so scalable that able to handle any amount of data coming at variable speed.

2.2.3 Use cases for Data Pipeline

Data Pipeline is useful to a number of roles, including CTOs, CIOs, Data Scientists, Data Engineers, BI Analysts, SQL Analysts, and anyone else who derives value from a unified real-time stream of user, web, and mobile engagement data. So, use cases for data pipeline are given below -

For Business Intelligence Teams
For SQL Experts
For Data Scientists
For Data Engineers
For Product Teams

2.3 Apache Kafka is Good for 2 Things

Building Real-Time streaming Data Pipelines that reliably get data between systems or applications
Building Real-Time streaming applications that transform or react to the streams of data.

2.3.1 Common use cases of Apache Kafka -

Stream Processing
Website Activity Tracking
Metrics Collection and Monitoring
Log Aggregation

2.3.2 Features of Apache Kafka

One of the features of Kafka is durable Messaging.
Apache Kafka relies heavily on the filesystem for storing and caching messages: rather than maintain as much as possible in memory and flush it all out to the filesystem, all data is immediately written to a persistent log on the filesystem without necessarily flushing to disk.
Apache Kafka solves the situation where the producer is generating messages faster than the consumer can consume them in a reliable way.

2.3.3 How Apache Kafka Works

Kafka System design act as Distributed commit log, where incoming data is written sequentially on disk. There are four main components involved in moving data in and out of Apache Kafka -

Topics — Topic is a user-defined category to which messages are published.
Producers — Producers publish messages to one or more topics
Consumers — Consumers subscribe to topics and process the published messages.
Brokers — Brokers that manage the persistence and replication of message data.

3. Data Processing Layer

In the previous layer, we gathered the data from different sources and made it available to go through rest of pipeline.

In this layer, our task is to do magic with data, as now data is ready we only have to route the data to different destinations.

In this layer main focus is to specialize Data Pipeline processing system or we can say the data we have collected by the last layer in this next layer we have to do processing on that data.

Processing can be done in 3 ways i.e.

3.1 Batch Processing System

A pure batch processing system for off-line analytic. For doing this tool used is Apache Sqoop.

3.2 Apache Sqoop

It efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores.

Apache Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

3.2.1 Functions of Apache Sqoop are -

Import sequential data sets from mainframe
Data imports
Parallel data Transfer
Fast data copies
Efficient data analysis
Load balancing

3.3 Near Real Time Processing System

A pure online processing system for on-line analytic. For this type of processing tool i.e. used is Apache Storm. The Apache Storm cluster makes decisions about the criticality of the event and sends the alerts to the alert system (dashboard, e-mail, other monitoring systems).

3.3.1 Apache Storm — It is a system for processing streaming data in real time. It adds reliable real-time data processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations.

3.3.2 Features of Apache Storm

Fast — It can process one million 100 byte messages per second per node.
Scalable — It can do parallel calculations that run across a cluster of machines.
Fault-tolerant — When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.
Reliable — Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
Easy to operate — It consist of Standard configurations that are suitable for production on day one. Once deployed, Storm is easy to operate.
Hybrid Processing system — This consist of Batch and Real-time processing System capabilities. For this type of processing tool used is Apache spark and apache Flink.

3.4 Apache Spark

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared data set in Hadoop.

3.5 Apache Flink

Flink is an open-source framework for distributed stream processing that Provides results that are accurate, even in the case of out-of-order or late-arriving data. Some of its features are -

It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state.
Performs at large scale, running on thousands of nodes with very good throughput and latency characteristics.
It’s streaming data flow execution engine, APIs and domain-specific libraries for Batch, Streaming, Machine Learning, and Graph Processing.

3.5.1 Apache Flink Use Cases

Optimization of e-commerce search results in real-time
Stream processing-as-a-service for data science teams
Network/Sensor monitoring and error detection
ETL for Business Intelligence Infrastructure

4. Data Storage Layer

Next, the major issue is to keep data in the right place based on usage. We have relational Databases, that were a successful place to store our data over years.

But with the new big data strategic enterprise applications, you should no longer be assuming that your persistence should be relational.

We need different databases to handle the different variety of data, but using different databases creates overhead. That’s why there is an introduction to the new concept in the database world i.e. the Polyglot Persistence.

4.1 Polyglot Persistence

Polyglot persistence is the idea of using multiple databases to power a single application. Polyglot persistence is the way to share or divide your data into multiple databases and leverage their power together.

It takes advantage of the strength of different database. Here different types of data are arranged in different ways. In short, it means picking the right tool for the right use case.

It’s the same idea behind Polyglot Programming, which is the idea that applications should be written in a mix of languages to take advantage of the fact that different languages are suitable for tackling different problems.

4.1.1 Advantages of Polygon Persistence -

Faster response times — Here we leverage all the features of databases in one app, which makes the response times of your app very fast.
Helps your app to scale well — Your app scales exceptionally well with the data. All the NoSQL databases scale well when you model databases properly for the data that you want to store.
A rich experience — You have a very rich experience when you harness the power of multiple databases at the same time. For example, if you want to search on Products in an e-commerce app, then you use ElasticSearch, which returns the results based on relevance, which MongoDB cannot do.

4.2 Tools used for Data Storage

4.2.1 HDFS

HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers.
HDFS holds a very large amount of data and provides easier access.
To store such huge data, the files are stored across multiple machines. These files are stored in a redundant fashion to rescue the system from possible data losses in case of failure.
HDFS also makes applications available to parallel processing. HDFS is built to support applications with large data sets, including individual files that reach into the terabytes.
It uses a master/slave architecture, with each cluster consisting of a single NameNode that manages file system operations and supporting DataNodes that manage data storage on individual compute nodes.
When HDFS takes in data, it breaks the information down into separate pieces and distributes them to different nodes in a cluster, allowing for parallel processing.
The file system also copies each piece of data multiple times and distributes the copies to individual nodes, placing at least one copy on a different server rack
HDFS and YARN from the data management layer of Apache Hadoop.

4.2.1.1 Features of HDFS

It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of the cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.

4.2.2 Gluster file systems (GFS)

As we know good storage solution must provide elasticity in both storage and performance without affecting active operations.

Scale-out storage systems based on GlusterFS are suitable for unstructured data such as documents, images, audio and video files, and log files.GlusterFS is a scalable network filesystem.

Using this, we can create large, distributed storage solutions for media streaming, data analysis, and other data- and bandwidth-intensive tasks.

It’s Open Source.
You can deploy GlusterFS with the help of commodity hardware servers.
Linear scaling of performance and storage capacity.
Scale storage size up to several petabytes, which can be accessed by thousands for servers.

4.2.2.1 Use Cases For GlusterFS include

Cloud Computing
Streaming Media
Content Delivery

4.2.3 Amazon S3

Amazon Simple Storage Service (Amazon S3) is object storage with a simple web service interface to store and retrieve any amount of data from anywhere on the web.
It is designed to deliver 99.999999999% durability, and scale past trillions of objects worldwide.
Customers use S3 as primary storage for cloud-native applications; as a bulk repository, or “data lake,” for analytics; as a target for backup & recovery and disaster recovery; and with serverless computing.
It’s simple to move large volumes of data into or out of S3 with Amazon’s cloud data migration options.
Once data is stored in Amazon S3, it can be automatically tiered into lower cost, longer-term cloud storage classes like S3 Standard — Infrequent Access and Amazon Glacier for archiving.

5. Data Query Layer

This is the layer where strong analytic processing takes place. This is a field where interactive queries are necessaries and it’s a zone traditionally dominated by SQL expert developers. Before Hadoop, we had a very limited storage due to which it takes long analytics process.