AWS data pipelines are one of the best mechanisms to transfer data from one storage to another storage with a different data type. While transferring data from pipelines, there are several techniques which can be used to optimize the process of copying data. In this article, the scenario would be copying 3 CSV format files which are stored in S3 bucket, to 3 Dynamodb tables.
A hive activity is selected to transfer the data from S3 to Dyanmodb tables. First, a hive script is created with the necessary input and output data nodes as follows with the query. As the resource for the Hive Activity, we added an M4.large for the EMR cluster.
Scenario 1: After writing the script, we added writing capacity as 25 in the Dynamodb tables and with defining 3 hive scripts under one pipeline we ran the first pipeline.
With 3 Hive scripts running and increasing the Dynamodb write capacity to 25, there is a gap between the provisioned and consumed count. The performance is not up to what is expected.
As for the reason, we assume that even though Dynamodb is provisioned, the data is input from the EMR cluster which we have created during the Hive Activity. Even though we added an m4.large instance type as the EMR cluster, which does have enough capability than required for the Hive Activity, performance is lagging.
Scenario 2: Adding the write throughput percent
After the performance issue, we found a script line which can be used under the Hive script to increase the write throughput percentage from the EMR cluster. By default, EMR cluster provides half of the write throughput percentage. The script change is as below;
Dynamodb throughput write percent can be set up from 0.5 to 1.5. By setting the write throughput to 1.5 percent would overprovision the write throughput of Dynamodb.
After the hive script change, the performance increased up to 8–15 range in the Dynamodb write capacity but did not increase as expected.
Scenario 3: Create 3 pipelines for 3 hive activities
Increasing the write throughput capacity of the Dynamodb would increase the cost of the process. But creating 3 pipelines would cost less than increasing Dynamodb write capacity. Therefore, we tried creating 3 data pipelines for 3 activities with scenario 2 settings for the hive script and with the same write capacity on Dynamodb (25).
With the changes, the performance of the was up to expectations as above. The write throughput exceeded in some areas due to overprovision settings in the hive script. And the provisioned capacity has been consumed well. Even though it is not a constant performance according to the Documentation, the results are satisfying at this level.
The outcome to increase performance:
1. Run 3 pipelines for 3 hive activities
2. Add the write throughput percent to the hive script as 1.5
3. Increase Dynamodb write capacity
Thank you for reading, have a good day!