Skip to content

Stall Detection — Recover Unresponsive Jobs Automatically in Bun

Stall detection automatically identifies and recovers jobs that become unresponsive during processing.

  1. Workers send periodic heartbeats while processing jobs
  2. The queue manager checks for jobs without recent heartbeats
  3. Stalled jobs are either retried or moved to the DLQ
import { Queue } from 'bunqueue/client';
const queue = new Queue('my-queue', { embedded: true });
queue.setStallConfig({
enabled: true, // Enable stall detection (default: true)
stallInterval: 30000, // Job is stalled after 30s without heartbeat
maxStalls: 3, // Move to DLQ after 3 stalls
gracePeriod: 5000, // Grace period after job starts
});
OptionDefaultDescription
enabledtrueEnable/disable stall detection
stallInterval30000Time (ms) without heartbeat before job is stalled
maxStalls3Max stalls before moving to DLQ
gracePeriod5000Initial grace period after job starts

Workers automatically send heartbeats:

const worker = new Worker('queue', processor, {
embedded: true,
heartbeatInterval: 10000, // Heartbeat every 10 seconds
});

The heartbeatInterval should be less than stallInterval to avoid false positives.

When a job stalls, one of these actions is taken:

  1. Retry - Job is re-queued with incremented stall count
  2. Move to DLQ - Job exceeds maxStalls and is moved to Dead Letter Queue

When a job is retried after a stall or lock expiry, its internal counters (queued count, shard stats) are updated correctly and waiting workers are notified immediately. This means requeued jobs are picked up without delay.

import { QueueEvents } from 'bunqueue/client';
const events = new QueueEvents('my-queue');
events.on('stalled', ({ jobId }) => {
console.log(`Job ${jobId} stalled`);
});

For jobs that take a long time, increase the stall interval:

// Queue for video processing (may take hours)
const videoQueue = new Queue('video-processing', { embedded: true });
videoQueue.setStallConfig({
stallInterval: 300000, // 5 minutes
maxStalls: 2,
gracePeriod: 60000, // 1 minute grace
});
// Worker with frequent heartbeats
const worker = new Worker('video-processing', async (job) => {
for (const chunk of video.chunks) {
await processChunk(chunk);
// updateProgress() also sends a heartbeat to reset the stall timer
await job.updateProgress(chunk.progress);
}
}, {
embedded: true,
heartbeatInterval: 30000, // Automatic heartbeat every 30 seconds
});

Check stall-related stats:

const stats = queue.getDlqStats();
console.log('Stalled jobs in DLQ:', stats.byReason.stalled);

Filter DLQ by stalled reason:

const stalledJobs = queue.getDlq({ reason: 'stalled' });

SandboxedWorker automatically sends heartbeats in both embedded and TCP mode. In embedded mode, heartbeatInterval defaults to 5000ms, keeping lastHeartbeat fresh so long-running jobs are not falsely detected as stalled.

const worker = new SandboxedWorker('heavy-jobs', {
processor: './processor.ts',
timeout: 0, // Disable worker-level timeout for long jobs
heartbeatInterval: 5000, // Default in embedded mode (keeps stall detection happy)
});