Safe Fail-Overs for Workers
Hyrex workers coordinate via Postgres leases and heartbeats. If a worker disappears (crash, deploy, eviction, network split), its in-flight tasks are detected and safely handed off to healthy workers.
The runtime uses time-bounded locks, exponential backoff, and idempotency keys to keep processing correct and to avoid duplicates while still guaranteeing at-least-once delivery.
How it works
- Leased reservations: A worker claims a task with a short lease in Postgres. The lease contains owner and expiry.
 - Heartbeats: While running, the worker refreshes the lease before it expires (lightweight update).
 - Automatic takeovers: If a lease expires (no heartbeat), another worker can safely acquire and resume/redo work.
 - Deterministic retries: Failures record attempt counts and next-run times with exponential backoff.
 
Failure scenarios covered
- Process/node crash: Lease expires; work is re-queued or immediately picked up by another worker.
 - Rolling deploys: Graceful shutdown drains in-flight tasks; stragglers fail over via lease expiry.
 - Kubernetes eviction/OOM: Missing heartbeat triggers takeover with the recorded retry/backoff policy.
 - Network partition: Only the side that can refresh leases keeps ownership; the other side’s leases expire.
 
Correctness guarantees
- At-least-once processing: A task will be processed, even after failures or restarts.
 - Idempotency keys: Use idempotency to protect external side effects on retries and fail-overs.
 - Bounded work duplication: Overlaps are minimized by short leases and rapid detection of stale ownership.
 
Operational behavior
- Self-healing workers: New workers can join at any time to pick up stranded or backlogged tasks.
 - Zero shared locks: No external systems required—coordination happens in Postgres you already run.
 - Fair rebalancing: Workers periodically rebalance claims to keep throughput high during spikes.
 
Example: resilient task with retries
Define a task with backoff and rely on fail-over to recover from crashes automatically.
src/hyrex/tasks.py
1# Python
2from hyrex import HyrexRegistry
3
4hy = HyrexRegistry()
5
6@hy.task(
7    max_retries=5,                 # deterministic retries
8    retry_backoff=lambda a: 2**a   # exponential backoff
9)
10def process_invoice(payload: dict):
11    """
12    Resilient task. Hyrex coordinates leases/heartbeats so if a worker
13    disappears mid-flight, another healthy worker can resume after the lease expires.
14    Use idempotency (e.g., invoice_id) to make retries/fail-overs safe.
15    """
16    invoice_id = payload["invoiceId"]
17
18    # Ensure external side effects are idempotent across retries/fail-overs
19    charge_customer(invoice_id)
20    mark_paid(invoice_id)If the worker running processInvoice dies mid-flight, another worker will acquire the task once the lease expires and continue processing using the same retry/backoff policy.