Safe Fail-Overs for Workers
Hyrex workers coordinate via Postgres leases and heartbeats. If a worker disappears (crash, deploy, eviction, network split), its in-flight tasks are detected and safely handed off to healthy workers.
The runtime uses time-bounded locks, exponential backoff, and idempotency keys to keep processing correct and to avoid duplicates while still guaranteeing at-least-once delivery.
How it works
- Leased reservations: A worker claims a task with a short lease in Postgres. The lease contains owner and expiry.
- Heartbeats: While running, the worker refreshes the lease before it expires (lightweight update).
- Automatic takeovers: If a lease expires (no heartbeat), another worker can safely acquire and resume/redo work.
- Deterministic retries: Failures record attempt counts and next-run times with exponential backoff.
Failure scenarios covered
- Process/node crash: Lease expires; work is re-queued or immediately picked up by another worker.
- Rolling deploys: Graceful shutdown drains in-flight tasks; stragglers fail over via lease expiry.
- Kubernetes eviction/OOM: Missing heartbeat triggers takeover with the recorded retry/backoff policy.
- Network partition: Only the side that can refresh leases keeps ownership; the other side’s leases expire.
Correctness guarantees
- At-least-once processing: A task will be processed, even after failures or restarts.
- Idempotency keys: Use idempotency to protect external side effects on retries and fail-overs.
- Bounded work duplication: Overlaps are minimized by short leases and rapid detection of stale ownership.
Operational behavior
- Self-healing workers: New workers can join at any time to pick up stranded or backlogged tasks.
- Zero shared locks: No external systems required—coordination happens in Postgres you already run.
- Fair rebalancing: Workers periodically rebalance claims to keep throughput high during spikes.
Example: resilient task with retries
Define a task with backoff and rely on fail-over to recover from crashes automatically.
src/hyrex/tasks.py
1# Python
2from hyrex import HyrexRegistry
3
4hy = HyrexRegistry()
5
6@hy.task(
7 max_retries=5, # deterministic retries
8 retry_backoff=lambda a: 2**a # exponential backoff
9)
10def process_invoice(payload: dict):
11 """
12 Resilient task. Hyrex coordinates leases/heartbeats so if a worker
13 disappears mid-flight, another healthy worker can resume after the lease expires.
14 Use idempotency (e.g., invoice_id) to make retries/fail-overs safe.
15 """
16 invoice_id = payload["invoiceId"]
17
18 # Ensure external side effects are idempotent across retries/fail-overs
19 charge_customer(invoice_id)
20 mark_paid(invoice_id)If the worker running processInvoice dies mid-flight, another worker will acquire the task once the lease expires and continue processing using the same retry/backoff policy.