Reliability Runbook

LLM API Reliability Notes

Practical guides for LLM API reliability and fallback engineering.

Route health monitored
  1. DetectTimeout or 5xx spike
  2. RouteFallback policy check
  3. VerifyResponse contract smoke test
  4. EscalateHuman review when evidence is thin
26 Published and archived Reliability notes
SLO Primary lens
Daily Review cadence

Failure Map

Reliability notes organized by failure mode

Use these notes to decide what to retry, route, stop, or escalate before production traffic is exposed.

01

Timeouts

Timeout budgets, retries, and safe fallback boundaries for LLM API calls.

02

Rate limits

How to identify quota behavior without hiding customer-impacting failures.

03

Fallback routes

Decision rules for switching providers, models, or degraded modes.

04

Incident checks

Smoke tests and evidence requirements before calling a route production-ready.

Latest Runbook Notes

Recent reliability guides

Historical archive entries remain available to readers while staying out of RSS, sitemap, and llms.txt.

View all

CometAPI chat completions fallback runbook

A contract-first fallback runbook for operators routing chat completion traffic through CometAPI, with monitoring signals, validation steps, and fields to verify from current docs.

Fallback Decision Logs for CometAPI Gateway Calls

A practical guide for operators who want to design, emit, and interpret fallback decision logs when CometAPI gateway calls fail or degrade. Covers log field design, decision taxonomy, smoke-test workflow, and the contract areas you must verify in the official docs.