Handling Database Failures in a Distributed System with RabbitMQ Workers
I have a worker that processes tasks from RabbitMQ and inserts data into a database. The system operates at high scale, handling thousands of messages per second, which makes proper failure handling crucial to avoid overwhelming the system.
The problem arises when the database crashes for a period of time. If the worker has already pulled a message from the queue, it disappears unless handled properly.
Proposed Solution: 1. Use ACK Mechanism Properly • If the database is available, insert the data and send an ACK to remove the message from the queue. • If the database is down, send a NO_ACK so that RabbitMQ requeues the message for another attempt. 2. Prevent Infinite Retry Loops • If we simply send NO_ACK, the message will immediately return to the queue, leading to a tight retry loop that overloads the system. • To solve this, implement a progressive retry delay: • The worker should sleep for a few seconds before retrying the same task: • 1s → 3s → 5s → 30s → 60s (capped at a max delay). 3. Limit Retry Attempts • Introduce a retry counter (e.g., 5 attempts). • After 5 failures, move the message to a Dead Letter Queue (DLQ) instead of retrying indefinitely.
Alternative Approach:
Instead of relying on RabbitMQ’s NO_ACK retry cycle, an alternative would be to keep the message in memory and attempt 5 internal retries within the worker itself: 1. The worker tries to insert the data into the DB. 2. If the DB is down, it retries up to 5 times, with a sleep interval between attempts. 3. If all retries fail, move the message to the DLQ.
Questions: • Which approach is preferable? Should I rely on RabbitMQ to handle retries, or should I manage them within the worker itself? • Are there better practices for handling failures in a high-scale distributed system with RabbitMQ and a database backend?
I'm not familiar with RabbitMQ but this is how I built a queue for LiteGUI using Redis.
Firstly, every job gets stored in an activity table with a retry and result column.
If the job should be run instantly without queueing, no queue entry is needed, the retry column is 0 and the result column is filled.
If the job fails for whatever reason, a queue entry containing the activity ID is added to let the queue worker process it later on. When the worker processes the job, the retry column will be incremented no matter of the outcome. If the worker succeeds, the activity will be updated with the result.
If the worker fails to process it and the retry number is less than 5, another queue entry will be added (as the current one is already removed from the queue). When to process that queue entry depends on the retry number, they can be 1m, 5m, 25m, 2.1h, 10.4h, 2.2 days using the following formula:
$interval = 300 * pow(5, ($retry - 1));
This approach also helps in case of queue failure, you can just rebuild the queue from the activity table for entries with no result (or status) and retry number is less than 5.
To be honest, I don't work with queue regularly but I have to implement it anyway. I'm sharing this approach so we all can improve it.
Proposed solution 1 seems sensible and using acknowledgements and dlq is a fairly common pattern. You might also want to monitor the size of the DLQ and if it reaches a certain limit stop processing altogether. You can also alert based on the size of the queue.
Don't forget a mechanism to redrive back from the DLQ and consider if order is important (might be if you're using a FIFO queue, but unsure as to whether Rabbit supports that)
Proposed Solution 1 is preferable in that it accounts for both DB outage, slowness, and worker crash, and you describe additional safety mechanisms to prevent queue becoming blocked by poison/invalid messages.
Proposed Solution 2 without ACKs would be vulnerable to message loss if a worker were to crash before successful message delivery.
This, I think. I also didn't see how your solution 2 recovers from worker crashes, although I was sympathetic to distributing the retry to workers.
Not your original question, but you should add some random jitter to your exponential-backoff-like delay intervals.
In solution 1, the consumer should cancel its consumption until the database is back up.
Solution 1 DLQ that you retry but what’s reason for the database failure?