Self-hosted docs
Jobs and Queues
Reliable job processing and queue management for critical workflows.
Deterministic retry policies
Jobs use exponential backoff with jitter to handle transient failures while avoiding thundering herd problems.
- Alert dispatch: 3 attempts with 5s, 15s, 45s delays
- Digest build: 2 attempts with 10s, 30s delays
- Watcher fetch: 4 attempts with 3s, 10s, 30s, 60s delays
- All retries respect queue latency SLOs and circuit breaker thresholds
Dead-letter handling
Failed jobs move to dead-letter queues with full context for manual review and reprocessing.
- Dead-letter queues use workspace-scoped naming: dlq-{workspace}-{queue}
- Messages include original payload, attempt count, and final error
- Manual reprocessing available via API and dashboard
Queue naming conventions
Consistent queue naming across staging and production environments.
- Staging: {queue}-staging
- Production: {queue}-prod
- Dead-letter: dlq-{queue}-{env}
- Priority: {queue}-{env}-priority for high-priority workflows
Idempotency key strategy
Jobs use idempotency keys to prevent duplicate processing of the same logical operation.
- Format: {workspace}-{job_type}-{entity_id}-{timestamp}
- Example: prod-chat-alert-dispatch-inc-123-202403151430
- TTL: 24 hours for most jobs, 7 days for critical operations
- Storage: Redis with workspace-scoped keys
Job observability fields
Standard fields included in all job logs and metrics for observability.
- job_id: Unique identifier for the job execution
- job_type: Type of job (alert-dispatch, digest-build, etc.)
- workspace_id: Workspace context
- attempt: Current attempt number
- status: current | retry | success | failed | dead-letter
- duration_ms: Execution duration
- timestamp: Start time of execution
Queue latency SLO
Target latency for job queue processing to ensure timely execution.
- Target: 95% of jobs processed within 10 seconds of queue time
- Measurement: Time from queue entry to job start
- Exclusions: Throttled queues, paused workflows
Queue throughput SLO
Target throughput for job queue processing to handle workload spikes.
- Target: Sustain 100 jobs/minute per queue during normal operations
- Burst: Handle 500 jobs/minute for 5-minute bursts
- Measurement: Successful job completions per minute