"Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group".
The above message was from the log when our microservice take a long time to before committing the offset. Let's explain the context first to help you get some background information about the issue. I have service A dedicates for calling REST API exposed by service B. When ever A receives message from Kafka, it calls service B's API. If B's is down, A will keep trying more several attempts. When the retry take too long, the Kafka consumer of service A will be removed from the consumer group.
This is a common scenario we might encounter when using Kafka. In this article, I will walk you through the cause as well as the approach I use to tackle this issue.
I spent a while to read the source of the java Kafka client, and come up with this diagram to brief how the Kafka consumer works.
There is a component so-called Group Coordinator which manages consumers/members of consumer groups. If the last
is more than
, then the Group Coordinator will believe that the consumer was dead and it will remove the consumer out of the group. That's why when the retry take longer than the
x > max.poll.interval.ms
, the consumer cannot commit the offset. Even worse, another consumer might poll the same message and falling to the same issue.
Some solutions I thought could resolve the problem:
Solution 1: Do not commit the offset of the message and
the entire consumer group for a while. Then create a cron job to enable those paused consumer group to retry again. But the unsuccessful delivered message never come again because each consumer group also has its own local offset repository which keeps track of the last read message.
Solution 2: simple and straightforward than solution 1, make sure that the retry will not take longer than
. For example: if retry 5 attempts, and each attempt has socket timeout as 5000 ms, then the
value should be greater than 5*5000 ms. The safe value could be 27000 ms. But what if the service B's outage takes 20 mins? Of course, I should not set the
as 20 mins.
Finally, I get a came up with the idea to enqueue the message with a header to determine when to stop the retry. The header named as FIRST_RETRY_TIMESTAMP which has the timestamp before first retry takes place. (The RETRY_COUNT is fine also)
To add a backoff between retries, instead of enqueue the message to the same topic, message should be send to a delay topic and a cron job will forward those messages to the topic.