Dead Letter Queues: Handling Failed Jobs Gracefully
Not every job succeeds. APIs go down, data is malformed, and bugs slip through. The question isn’t whether jobs will fail - it’s how you handle them when they do. bunqueue’s Dead Letter Queue (DLQ) gives you a systematic way to deal with failures.
What Is a Dead Letter Queue?
A DLQ is a holding area for jobs that have exhausted their retry attempts. Instead of being silently discarded, failed jobs are preserved with their full context so you can:
- Inspect why they failed
- Fix the underlying issue
- Retry them after the fix
- Purge jobs that are no longer relevant
When Jobs Enter the DLQ
Jobs move to the DLQ when they meet specific failure conditions:
| Condition | Reason | Example |
|---|---|---|
| Max attempts exhausted | max_attempts | Job failed 3 times with exponential backoff |
| Processing timeout | timeout | Job exceeded its timeout setting |
| Stall limit reached | stalled | Worker died 3 times while processing this job |
| Manual discard | manual | Explicitly moved via API |
const queue = new Queue('payments', { embedded: true });
// Add a job with 3 retry attemptsawait queue.add('charge', { amount: 99.99 }, { attempts: 3, backoff: { type: 'exponential', delay: 1000 }, timeout: 30_000,});
// If all 3 attempts fail, the job moves to DLQ// with reason: 'max_attempts'Configuring the DLQ
DLQ behavior is configurable per queue:
queue.setDlqConfig({ autoRetry: true, // Automatically retry DLQ jobs autoRetryInterval: 300_000, // Every 5 minutes maxAutoRetries: 3, // Max 3 auto-retry cycles maxAge: 604_800_000, // Expire entries after 7 days maxEntries: 10_000, // Cap at 10,000 entries});Inspecting Failed Jobs
Query the DLQ to understand what’s failing:
// Get DLQ statisticsconst stats = queue.getDlqStats();console.log(stats);// {// total: 47,// byReason: { max_attempts: 30, timeout: 12, stalled: 5 },// retriable: 42,// expired: 5,// oldestEntry: 1707000000000// }
// List DLQ entries with filteringconst entries = queue.getDlq({ reason: 'timeout', // Only timeout failures olderThan: Date.now() - 86_400_000, // Older than 24h limit: 20, offset: 0,});
for (const entry of entries) { console.log({ jobId: entry.job.id, queue: entry.job.queue, data: entry.job.data, reason: entry.reason, error: entry.error, enteredAt: new Date(entry.enteredAt), attempts: entry.job.attempts, });}Retrying Failed Jobs
Once you’ve fixed the underlying issue, retry DLQ entries:
// Retry a specific jobqueue.retryDlq('job-id-123');
// Retry all retriable jobsqueue.retryDlq();
// Retry with a filterqueue.retryDlqByFilter({ reason: 'timeout', newerThan: Date.now() - 3_600_000, // Only last hour});When a job is retried from the DLQ:
- Its attempt counter is reset
- It’s placed back in the waiting queue
- Its original data and options are preserved
- Workers will pick it up normally
Purging the DLQ
For jobs that are no longer relevant:
// Purge all DLQ entriesqueue.purgeDlq();The maxAge and maxEntries config values also handle automatic cleanup during the DLQ maintenance cycle (runs every 60 seconds).
Monitoring DLQ in Production
Watch for growing DLQ sizes as an early warning signal:
// Periodic health checksetInterval(async () => { const stats = queue.getDlqStats();
if (stats.total > 100) { console.warn(`DLQ growing: ${stats.total} entries`); // Send alert to monitoring system }
// Log breakdown by reason for (const [reason, count] of Object.entries(stats.byReason)) { console.log(`DLQ ${reason}: ${count}`); }}, 60_000);DLQ + Webhooks
Combine DLQ with webhooks for real-time alerts:
// Get notified when jobs enter the DLQawait queue.add('critical-task', data, { attempts: 3, backoff: { type: 'exponential', delay: 2000 },});
// Set up a webhook for failed events// (via TCP protocol or HTTP API)Best Practices
- Always configure a DLQ - don’t let failed jobs vanish silently
- Set
maxAge- old DLQ entries are rarely useful, expire them - Monitor DLQ size - it’s your canary in the coal mine
- Use
autoRetryfor transient failures - API outages resolve themselves - Set
maxEntriesto prevent unbounded growth - Inspect before retrying - understand why jobs failed before blindly retrying them