Chat completions fallback runbook for CometAPI
Last reviewed: 2026-05-09
Who this is for: on-call engineers, platform owners, and application operators who run user-facing chat completion workloads through CometAPI and need a practical fallback decision process.
This runbook treats fallback as an operator-owned reliability control. It does not assume that every error should be routed away. Some failures should trigger a retry, some should page the owning team, and some should be stopped because fallback would hide a bad request, expired credential, or unsupported contract assumption.
For broader reliability patterns, keep this page connected to the site index at /sites/llm-api-reliability/ and related operational drafts under /sites/llm-api-reliability/posts/.
Key takeaways
- Use fallback only after classifying the failure signal: transport, provider response, rate-limit/quota, output contract, latency, or client configuration.
- Do not fallback on deterministic client errors such as malformed requests, invalid model identifiers, missing authentication, or schema bugs.
- Keep the CometAPI endpoint path, auth format, request fields, response fields, and error shape tied to the current public documentation, especially the CometAPI API docs and chat completions reference.
- Tune thresholds to your own SLO, traffic shape, and cost controls. The numeric examples below are starting points for drills, not universal rules.
- Validate fallback with injected failures before relying on it during an incident.
Concise definition
A CometAPI chat completions fallback is an application-side policy that decides what to do when a primary chat completion request path is unhealthy.
In this runbook, fallback can mean:
- retrying the same request after a short backoff;
- switching to a secondary configured route;
- returning a degraded but explicit response;
- queueing the request for later processing;
- stopping the request and paging an operator.
It does not mean blindly replaying every failed request through another model or provider.
Evidence basis
The public CometAPI documentation landing page is the first source to check for current API navigation and contract references: https://apidoc.cometapi.com/. The chat completions endpoint reference should be treated as the source of truth for the request and response contract: https://apidoc.cometapi.com/api-13851472. For account, support, and operational questions that are not defined in the endpoint reference, check the CometAPI help center: https://apidoc.cometapi.com/help-center.
Contract details to verify
Before enabling automated fallback, record the exact contract your production client depends on. This avoids a common failure mode: the fallback layer works, but it is protecting an integration that was never aligned with the documented API.
| Contract area | What to verify before production | Operational note | Source to check |
|---|---|---|---|
| Endpoint paths | Confirm the base URL and exact chat completions path used by your client. If your code assumes an OpenAI-compatible path such as /v1/chat/completions, verify that assumption against the endpoint reference. | Log the resolved URL path without query secrets. Alert on unexpected path drift between environments. | CometAPI docs landing page and chat completions reference: docs, endpoint reference |
| Auth headers | Confirm the required authentication header format, token location, and whether any organization, project, or account-scoping header is required. | Do not fallback on 401 or 403 until credential rotation, account status, and environment variables are checked. | CometAPI docs and help center: docs, help center |
| Request fields | Confirm required and optional fields such as model identifier, messages array, streaming flag, sampling parameters, and token limits. | Treat request validation errors as client bugs unless the endpoint reference documents otherwise. | Chat completions reference: endpoint reference |
| Response fields | Confirm the fields your application parses, such as assistant message content, finish reason, usage information, request identifier, and streaming chunks if streaming is enabled. | A response can be HTTP-successful and still fail your application contract. Track parse failures separately from transport failures. | Chat completions reference: endpoint reference |
| Error behavior | Confirm documented status codes, error object shape, retryable conditions, and whether rate-limit responses include retry timing. | If error semantics are not explicit, validate with controlled tests and keep conservative retry rules. | Endpoint reference and help center: endpoint reference, help center |
| Rate-limit or billing assumptions | Confirm whether limits, quota, billing state, or usage exhaustion can affect chat completion calls, and where those events are surfaced. | Do not assume current pricing, availability, or quota behavior from memory. Use current account-facing documentation or dashboard evidence. | Help center and account documentation: help center |
| Fallback compatibility | Confirm that the secondary route accepts the same message format, safety policy, max-token behavior, streaming behavior, and response parser expectations. | A fallback target is not safe just because it accepts a similar JSON body. Run contract tests against each configured target. | Your internal contract tests plus CometAPI endpoint reference |
Monitoring signals checklist
Use this checklist to classify the signal before choosing a fallback action.
| Signal | What to measure | Example trigger to tune | Recommended action | Validation step |
|---|---|---|---|---|
| Connection timeout | Client-side connect timeout, TLS timeout, DNS failure, socket timeout | More than 2 consecutive failures for the same route, or elevated failure rate over 5 minutes | Retry once with jitter. If still failing and user SLO is at risk, route to approved fallback. | Inject connection refusal or blackhole the upstream in staging. Confirm only retryable calls are replayed. |
| HTTP 5xx | Count and rate of server-side errors returned by the API path | 5xx rate above your normal baseline for 3 to 5 minutes | Retry idempotent-safe requests with exponential backoff. Escalate to fallback if burn rate remains high. | Stub 500 and 503 responses. Confirm alert includes status, route, model identifier, and request class. |
| HTTP 429 or quota-like response | Rate-limit responses, quota exhaustion, retry-after headers if documented | Any sustained 429 on production traffic | Respect documented retry guidance if present. Fallback only if alternate capacity and billing controls are approved. | Run a low-volume controlled test. Confirm your client does not stampede the alternate route. |
| HTTP 400 class validation error | Bad request, invalid field, unsupported parameter, context-too-large, malformed messages | Any repeated 400 on the same release or request builder version | Do not fallback by default. Roll back or fix the client request. | Send a deliberately invalid request in staging. Confirm it opens an integration alert, not a fallback event. |
| HTTP 401 or 403 | Authentication or authorization failure | Any production occurrence unless expected during rotation | Do not fallback until credential and account status are verified. Page platform owner. | Rotate a staging secret to an invalid value. Confirm no alternate provider receives the request. |
| Latency SLO breach | p50, p95, p99, timeout ratio, queue wait, time to first token for streaming | p95 above your user-facing budget for a sustained window | Prefer graceful degradation before full fallback if responses are still correct. | Inject upstream delay and verify timeout boundaries, user messaging, and cancellation behavior. |
| Empty or malformed assistant output | Empty content, invalid JSON where JSON is required, missing expected fields | Parser failure rate above baseline | Retry once if transient. Fallback only if the secondary route is contract-compatible. | Force malformed output in a test harness. Confirm the fallback path revalidates the response. |
| Finish reason or token-budget exhaustion | Responses cut off due to length or equivalent finish state | More than baseline truncation for a request class | Adjust prompt or token budget. Do not treat as provider outage unless correlated with other failures. | Run long-context test prompts. Confirm the alert names the request class and token budget. |
| Streaming interruption | Missing final event, stalled stream, chunk parse error, client disconnect | Stall beyond your stream idle timeout | Cancel and retry non-side-effecting requests if safe. For interactive users, show a clear partial-response state. | Simulate dropped SSE or chunked transfer. Confirm partial content is labeled, not silently accepted. |
| Cost or usage anomaly | Sudden token usage increase, retry amplification, fallback volume spike | Retry/fallback volume above planned budget | Circuit-break fallback and page operator if amplification is detected. | Run a drill with forced failures. Confirm dashboards show primary calls, retries, and fallback calls separately. |
Decision rules for on-call use
Use these rules during an incident:
- Classify the failure first. Determine whether the dominant signal is transport, 5xx, 429, 4xx validation, auth, latency, output contract, or cost amplification.
- Retry only when the request is safe to replay. If the request can trigger external side effects in your application, deduplicate before retrying.
- Fallback only when the alternate route has been pre-approved. It must have compatible request fields, response parsing, safety behavior, data handling, and budget controls.
- Stop on auth and malformed request failures. Fallback on
401,403, or repeated deterministic400errors usually hides the real issue. - Log the decision, not just the error. Each event should say whether the system retried, fell back, degraded, queued, or failed closed.
Minimal sanitized request example
Use this as a shape check, not as a contract guarantee. Confirm the path, required fields, and supported parameters in the CometAPI chat completions reference before production.
curl -X POST "https://<COMETAPI_BASE_URL>/v1/chat/completions" \
-H "Authorization: Bearer ${COMETAPI_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"model": "<model-id-approved-for-this-workload>",
"messages": [
{
"role": "system",
"content": "Answer with concise operational guidance."
},
{
"role": "user",
"content": "Summarize the current incident state from these sanitized symptoms: elevated 5xx and p95 latency."
}
],
"temperature": 0.2,
"max_tokens": 300,
"stream": false
}'
For production, attach your own request correlation identifier in application logs. Do not assume an extra API header is accepted unless the endpoint documentation explicitly supports it.
Practical validation steps
1. Build a contract fixture
Create a fixture with one approved request per workload class:
- normal short chat;
- long-context request near your expected token budget;
- structured-output request if your application requires JSON;
- streaming request if you use streaming;
- blocked or rejected request if your application has policy gates.
For each fixture, store:
- request body shape;
- expected required response fields;
- parser expectations;
- timeout budget;
- retry eligibility;
- fallback eligibility.
2. Run failure injection in staging
Inject one failure mode at a time:
- DNS or connection failure;
- upstream timeout;
- HTTP 500;
- HTTP 429;
- HTTP 400 malformed request;
- HTTP 401 invalid credential;
- malformed successful response;
- delayed streaming chunks.
Expected outcome:
- retryable transport and 5xx failures may retry and then fallback;
- 400, 401, and 403 failures should not fallback automatically;
- malformed success responses should be counted as contract failures;
- cost and retry amplification should be visible in dashboards.
3. Validate observability fields
Every chat completion attempt should emit structured telemetry. At minimum, capture:
- route name, not secret URL;
- endpoint path;
- workload class;
- model identifier used by your configuration;
- HTTP status code;
- timeout phase if available;
- retry attempt number;
- fallback decision;
- latency and time to first token if streaming;
- response parser result;
- token usage if provided by the response and documented for the endpoint;
- sanitized error category.
Avoid logging prompts, secrets, full user data, or raw authorization headers.
4. Drill the on-call decision
Run a 30-minute drill with the application owner and on-call engineer:
- Force 5xx for a small staging traffic slice.
- Confirm retry volume stays inside the configured cap.
- Confirm fallback begins only after the configured trigger.
- Confirm fallback responses pass the same parser.
- Confirm dashboards separate primary, retry, and fallback traffic.
- Confirm the runbook tells the operator when to disable fallback.
Record the result in your internal incident-readiness notes. If this site later publishes a validation template, link it from /sites/llm-api-reliability/editorial/.
Suggested alert routing
| Alert | Page immediately? | Owner | Notes |
|---|---|---|---|
| Sustained 5xx plus user-visible failures | Yes | On-call application/platform engineer | Fallback may be appropriate if pre-approved. |
| Sustained timeout rate | Yes | Platform engineer | Check network, DNS, upstream latency, and client timeout changes. |
| 429 or quota-like failures | Usually | Platform or account owner | Fallback requires budget and capacity approval. |
| 400 validation spike after deploy | Yes, if production-impacting | Releasing team | Roll back or patch request builder. |
| 401 or 403 | Yes | Secret/account owner | Do not auto-fallback until credentials and account status are verified. |
| Parser failure spike | Yes, if user-facing | Application owner | Revalidate response contract and structured-output assumptions. |
| Fallback volume spike | Yes | Reliability owner | Detect retry storms and cost amplification. |
FAQ
Should every failed chat completion request fallback automatically?
No. Automatic fallback is safest for transient transport failures, selected 5xx responses, and latency events where the alternate route is already validated. It is usually unsafe for malformed requests, auth failures, unsupported parameters, or schema bugs.
Is a retry the same as fallback?
No. A retry repeats the request against the same route, usually after a short delay. Fallback changes behavior: a secondary route, degraded response, queue, or fail-closed path. Track them separately.
What is the most important metric?
There is no single metric. Combine status code rate, timeout rate, latency, parser failure rate, fallback volume, and user-visible failure rate. A low HTTP error rate can still hide malformed successful responses.
Should 429 always trigger fallback?
Not always. A 429-style signal may indicate rate limiting, quota, or account-level constraints. Fallback can be appropriate only if the secondary route has approved capacity, compatible behavior, and cost controls.
Can fallback hide product bugs?
Yes. If the client sends an invalid request or parses the response incorrectly, fallback may make the incident harder to diagnose. That is why this runbook separates deterministic 4xx and parser failures from retryable transport failures.
How often should this runbook be reviewed?
Review it whenever the CometAPI endpoint documentation changes, when you add a new model route, after any production incident, and at least once per quarter for high-traffic workloads.
Sources checked
| Source | Access date | Purpose |
|---|---|---|
| CometAPI API documentation | 2026-05-09 | Checked as the public documentation entry point for API references and integration navigation. |
| CometAPI chat completions endpoint reference | 2026-05-09 | Checked as the evidence source for chat completion request and response contract verification. |
| CometAPI help center | 2026-05-09 | Checked for account, support, billing, and operational guidance that may affect fallback decisions. |