Resilience Engineering and Failure Management

LLM provider fallback, HTTP 429 rate-limit handling, and resilient pagination in automation workflows.

API fallback design

Relying on a single LLM provider is operational risk. The recommended pattern in n8n/Make:

Primary path — OpenAI (or preferred provider) with main credentials.
Error branch — IF/Switch node catches HTTP 4xx/5xx or timeout.
Fallback path — Anthropic (or second provider) with an adapted equivalent prompt.
Observability — structured log indicating which path was used (provider: anthropic-fallback).

Keep interface parity: normalize input/output into a { text, tokens_used, provider } object before the next node. This avoids conditional logic downstream.

Rate limit handling (HTTP 429)

LLM APIs return 429 Too Many Requests when quotas or rate limits are exceeded. Standard strategy:

Detection — check statusCode === 429 or the Retry-After header.

Exponential backoff — wait min(Retry-After, 2^n * base_ms) with random jitter (±20%). In n8n, use a Wait node + Loop or sub-workflow with an attempt counter.

Circuit breaker — after N consecutive failures (e.g. 5), automatically route to fallback for M minutes without hitting the primary.

Quota budgeting — separate batch workflows (indexing) from interactive workflows (chat) across different API keys or orgs to prevent batch jobs from blocking users.

Pagination at scale

Workflows that sync large datasets (CRM, logs, catalogs) break when they try to process everything in one execution.

Resilient patterns:

Cursor-based pagination — prefer ?cursor= or since_id over offset/limit on large tables (offset degrades performance).
Fixed batching — process 100–500 records per loop; persist the cursor in Static Data (n8n) or Data Store (Make).
Checkpointing — after each successful batch, write last_synced_at to the relational database.
Dead letter — records that fail 3x go to a manual review queue; they do not block the loop.

Operational target: a single execution should never exceed platform timeout (n8n cloud ~2h, self-hosted configurable). If needed, chain multiple executions via a "continue from cursor" webhook.