paint-brush
Is Your Apache Ni-Fi Ready for Production?by@temirlan100

Is Your Apache Ni-Fi Ready for Production?

by Temirlan AmanbayevAugust 23rd, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Apache NiFi cluster can process up to 50 GB of data per day. Apache NiFi can provide a balance between performance and cost-effectiveness. The main configuration files that need to be set up for the cluster to run include: nifi.properties and login-identity-providers.
featured image - Is Your Apache Ni-Fi Ready for Production?
Temirlan Amanbayev HackerNoon profile picture

Before you say that the production cluster is ready to work, there are a number of requirements that are useful to consider, so in the future, do not panic when something goes wrong! =)


Let's start with the first point - Cluster infrastructure and architecture. Taking into account that this is the first launch and we are supposed to process up to 50 GB of data per day, we can optimally configure the Apache NiFi cluster, providing a balance between performance and cost-effectiveness.

Number of Nodes

For our data volume at launch, two or three nodes in an Apache NiFi cluster will be sufficient, preferably 3 nodes. This will allow for load balancing, and fault tolerance, and avoid bottlenecks as the data volume grows. A three-node configuration is preferred to ensure stability and minimize downtime.


  • Two Nodes: Suitable for start-ups with small data volumes and limited budgets.
  • Three Nodes: The optimal solution for fault tolerance and performance improvement.

Division of Roles in the Cluster

NiFi nodes can perform both work tasks and coordination tasks. ZooKeeper is used to coordinate the work of the nodes.


  • Working Nodes: Basic data processing tasks including reading, transforming, and writing data.


  • Coordination nodes (ZooKeeper): Manage coordination between NiFi nodes. It is generally recommended to configure a separate ZooKeeper cluster to improve reliability. For smaller clusters, you can use the built-in ZooKeeper, but when scaling up, it is better to dedicate it to separate servers.

Disk and I/O

Disk system performance is key, especially for storing content and data streams, as NiFi makes heavy use of the disk subsystem for temporary file storage.


  • SSD disks: Recommended for directories where content (repository content) and data streams (flowfiles) are stored. This will provide high read/write performance.


  • Separate disks for directories: Configure separate disks or partitions for different types of data (content, flowfiles, databases) to reduce competition for resources.

Network Connection

A reliable and fast network is essential for proper cluster operation.

  • Network Requirements: Stable network connection with minimal latency and high bandwidth (at least gigabit connection between nodes, may be worth increasing x10).


  • Network Infrastructure Planning: Consider using a dedicated network or VLAN for communication between NiFi and ZooKeeper nodes to improve security and performance.


Let's move on to the next step! Now, let's take a detailed look at the Apache NiFi configuration to prepare the cluster for production operation.


The main configuration files that need to be set up for the NiFi cluster to run include:

  • nifi.properties: The main configuration file for NiFi.
  • zookeeper.properties: Configuration for ZooKeeper (if a dedicated ZooKeeper is used).
  • authorizers.xml: Configuration for access control.
  • login-identity-providers.xml: Authentication configuration.

nifi.properties.


The nifi.properties file contains the basic parameters that need to be configured for the cluster to work.


Cluster Configuration

nifi.cluster.is.node=true: Enables cluster mode for the NiFi node.
nifi.cluster.node.address=HOSTNAME: Set the IP address or hostname of the node.
nifi.cluster.node.node.protocol.port=PORT: Set the port on which the node will accept cluster requests. For example, 9999.
nifi.zookeeper.connect.string=ZK_HOST1:2181,ZK_HOST2:2181,ZK_HOST3:2181: Specify the addresses of the ZooKeeper nodes, separated by commas.
nifi.cluster.flow.election.max.wait.time=5 mins: Wait time to synchronize flows between nodes.
nifi.cluster.flow.election.max.candidates=1: Specifies the number of nodes that are allowed to change the flow configuration at the same time. Normally this is 1.


Web Interface Configuration

nifi.web.http.host=HOSTNAME: Set the IP address or hostname for the web interface.
nifi.web.http.port=8080: Port for the web interface (if using HTTP).


For HTTPS:

nifi.web.https.host=HOSTNAME
nifi.web.https.port=8443
nifi.security.keystore=/path/to/keystore.jks
nifi.security.keystoreType=JKS
nifi.security.keystorePasswd=your_keystore_password
nifi.security.keyPasswd=your_key_password
nifi.security.truststore=/path/to/truststore.jks
nifi.security.truststoreType=JKS
nifi.security.truststorePasswd=your_truststore_password


Repository directories

nifi.content.repository.directory.default=./content_repository: Path to the directory for content storage.
nifi.flowfile.repository.directory=./flowfile_repository: Path to the directory for storing flow metadata.
nifi.provenance.repository.directory.default=./provenance_repository: Path to the directory to store provenance data.


Provenance Settings

nifi.queue.swap.threshold=20000: The number of FlowFiles after which they will be offloaded to disk.
nifi.provenance.repository.rollover.time=30 secs: The time after which new Provenance repository files will be created.
nifi.provenance.repository.max.storage.time=24 hours: Maximum storage time for Provenance.
nifi.provenance.repository.max.storage.size=10 GB: The maximum amount of data in Provenance. Adjust based on available disk space.


Configuration for reliability and fault tolerance

nifi.cluster.protocol.heartbeat.interval=5 secs: Interval for sending heartbeat between nodes.
nifi.cluster.protocol.is.secure=true: Enables encryption of traffic between nodes.


zookeeper.properties.

Using an external ZooKeeper cluster, it should be configured to provide secure coordination between nodes.

tickTime=2000: The basic time interval in milliseconds that ZooKeeper uses to determine state.
initLimit=10: The maximum number of ticks (tickTime) for which nodes should synchronize.
syncLimit=5: Maximum number of tickTimes during which nodes can be unsynchronized.
server.1=HOST1:2888:3888: Defines the ZooKeeper nodes. The first port is used for communication between nodes, the second port is used for leader election.


authorizers.xml.

This file manages access to NiFi resources, including users and groups.


Access Management: Define roles and groups as well as appropriate access policies to secure the system.

<authorizers>
    <userGroupProvider>
        <identifier=file-user-group-provider</identifier>
        <class>org.apache.nifi.authorization.FileUserGroupProvider</class>
        <property name="Users File">./conf/users.xml</property>
    </userGroupProvider>
    <accessPolicyProvider>
        <identifier=file-access-policy-provider</identifier>
        <class>org.apache.nifi.authorization.FileAccessPolicyProvider</class>
        <property name="Authorizations File">./conf/authorizations.xml</property>
    </accessPolicyProvider>
</authorizers>


login-identity-providers.xml

This file is used to configure authentication. You can configure LDAP, Kerberos, or other authentication mechanisms.

<loginIdentityProviders>
    <provider>
        <identifier="ldap-provider"/>
        <class>org.apache.nifi.ldap.LdapProvider</class>
        <property name="Authentication Strategy">SIMPLE</property>
        <property name="Manager DN">cn=admin,dc=example,dc=com</property>
        <property name="Manager Password">admin_password</property>
        <property name="Url">ldap://ldap.example.com:389</property>
        <property name="User Search Base">ou=users,dc=example,dc=com</property>
        <property name="User Search Filter">uid={0}</property>
        <property name="Identity Strategy">USE_USERNAME</property>
    </provider>
</loginIdentityProviders>

After configuring the configuration files, you must restart all NiFi nodes for the changes to take effect. Verify that all nodes have successfully connected to the cluster and synchronized the configuration.


The hardware is in place and the configs are in place, now let's delve into security...


Security in Apache NiFi covers several areas: encryption, authentication, authorization, and access control. Let's look at them in order.

Encryption

Transport Encryption (TLS/SSL)

  • HTTPS for web interface: Secure the web interface, including encrypting all connections to NiFi via HTTPS. Configure certificates in nifi.properties.


  • Middleware Encryption: Enable encryption of traffic between cluster nodes and ZooKeeper to prevent data interception.


To configure TLS/SSL:

  • Create and install SSL certificates for all nodes.
  • Configure the nifi.security.keystore, nifi.security.truststore, nifi.security.keystorePasswd, nifi.security.truststorePasswd, and nifi.security.keyPasswd parameters in nifi.properties.
  • Enable encryption for all cluster communications by setting nifi.cluster.protocol.is.secure=true.

Encrypt Data on Disk

  • Encrypt repositories: Consider encrypting NiFi repositories such as flowfile repository, content repository, and provenance repository to protect data at rest.


  • Encrypting with HSM: Use hardware security modules (HSMs) or file system encryption systems (such as LUKS) to increase data protection.

Authentication

It is recommended to use proven and reliable mechanisms to authenticate users and systems that connect to NiFi.


LDAP

An organization uses LDAP, you can integrate it with NiFi to manage users and groups. LDAP configuration is defined in login-identity-providers.xml

<loginIdentityProviders>
    <provider>
        <identifier="ldap-provider"/>
        <class>org.apache.nifi.ldap.LdapProvider</class>
        <property name="Authentication Strategy">SIMPLE</property>
        <property name="Manager DN">cn=admin,dc=example,dc=com</property>
        <property name="Manager Password">admin_password</property>
        <property name="Url">ldap://ldap.example.com:389</property>
        <property name="User Search Base">ou=users,dc=example,dc=com</property>
        <property name="User Search Filter">uid={0}</property>
        <property name="Identity Strategy">USE_USERNAME</property>
    </provider>
</loginIdentityProviders>


There is also a possibility via Kerberos, but I'd like to talk about that in the next article =)

Authorization and Access Control

Once authentication is configured, it is important to properly configure authorization - access to resources and operations in NiFi.

Access Policies

Let's define access policies for users and groups to restrict access to critical NiFi components. Roles and policies are configured in authorizers.xml.


Example of a policy definition:

<authorizers>
    <userGroupProvider>
        <identifier=file-user-group-provider</identifier>
        <class>org.apache.nifi.authorization.FileUserGroupProvider</class>
        <property name="Users File">./conf/users.xml</property>
    </userGroupProvider>
    <accessPolicyProvider>
        <identifier=file-access-policy-provider</identifier>
        <class>org.apache.nifi.authorization.FileAccessPolicyProvider</class>
        <property name="Authorizations File">./conf/authorizations.xml</property>
    </accessPolicyProvider>
</authorizers>

After security configuration, it is recommended to test the system for vulnerabilities and perform load testing. This will help ensure that all aspects of security are working properly and the system is ready for use.

Summary of Remaining Items

  • Monitoring and Logging: Set up monitoring tools (Victoria Metrics, Grafana) to track resource status and performance. It is important to organize centralized logging and alerting.


  • Backup and recovery: Set up regular backups of all configuration files and data. Make sure you have a disaster recovery plan in place.


  • Performance and Scalability: Optimize data flows and configure the system so that you can easily scale the cluster as data volume grows.


At this point, your Apache NiFi cluster setup should be complete and ready to run in a production environment. If you have questions or need additional details, I'm here to help!