Safe Fail-Overs for Workers

Hyrex workers coordinate via Postgres leases and heartbeats. If a worker disappears (crash, deploy, eviction, network split), its in-flight tasks are detected and safely handed off to healthy workers.

The runtime uses time-bounded locks, exponential backoff, and idempotency keys to keep processing correct and to avoid duplicates while still guaranteeing at-least-once delivery.

How it works

Leased reservations: A worker claims a task with a short lease in Postgres. The lease contains owner and expiry.
Heartbeats: While running, the worker refreshes the lease before it expires (lightweight update).
Automatic takeovers: If a lease expires (no heartbeat), another worker can safely acquire and resume/redo work.
Deterministic retries: Failures record attempt counts and next-run times with exponential backoff.

Failure scenarios covered

Process/node crash: Lease expires; work is re-queued or immediately picked up by another worker.
Rolling deploys: Graceful shutdown drains in-flight tasks; stragglers fail over via lease expiry.
Kubernetes eviction/OOM: Missing heartbeat triggers takeover with the recorded retry/backoff policy.
Network partition: Only the side that can refresh leases keeps ownership; the other side’s leases expire.

Correctness guarantees

At-least-once processing: A task will be processed, even after failures or restarts.
Idempotency keys: Use idempotency to protect external side effects on retries and fail-overs.
Bounded work duplication: Overlaps are minimized by short leases and rapid detection of stale ownership.

Operational behavior

Self-healing workers: New workers can join at any time to pick up stranded or backlogged tasks.
Zero shared locks: No external systems required—coordination happens in Postgres you already run.
Fair rebalancing: Workers periodically rebalance claims to keep throughput high during spikes.

Example: resilient task with retries

Define a task with backoff and rely on fail-over to recover from crashes automatically.

src/hyrex/tasks.py

1# Python
2from hyrex import HyrexRegistry
3
4hy = HyrexRegistry()
5
6@hy.task(
7    max_retries=5,                 # deterministic retries
8    retry_backoff=lambda a: 2**a   # exponential backoff
9)
10def process_invoice(payload: dict):
11    """
12    Resilient task. Hyrex coordinates leases/heartbeats so if a worker
13    disappears mid-flight, another healthy worker can resume after the lease expires.
14    Use idempotency (e.g., invoice_id) to make retries/fail-overs safe.
15    """
16    invoice_id = payload["invoiceId"]
17
18    # Ensure external side effects are idempotent across retries/fail-overs
19    charge_customer(invoice_id)
20    mark_paid(invoice_id)

If the worker running processInvoice dies mid-flight, another worker will acquire the task once the lease expires and continue processing using the same retry/backoff policy.