CometAPI chat completions incident review contract

Last reviewed: 2026-05-10

Who this is for: operators and platform engineers who already call CometAPI chat completions in production and need a repeatable post-incident review that separates provider contract issues from client-side assumptions.

This draft focuses on the API contract you should verify after an incident, not on generic uptime monitoring. Use it alongside your existing runbooks in /sites/llm-api-reliability/posts/ and keep site-level reliability notes linked from /sites/llm-api-reliability/ .

Key takeaways

Treat the CometAPI chat completions documentation as the contract baseline for endpoint, authentication, request shape, and response parsing.
During incident review, confirm whether the failure came from a contract mismatch, transient transport error, model-side refusal, malformed client request, timeout budget, or downstream parsing assumption.
Record the exact request class that failed: streaming or non-streaming, model identifier, message size, timeout, retry count, and fallback behavior.
Avoid making fallback decisions from HTTP status alone; include response body shape, retryability, latency, and idempotency risk.
Re-run a sanitized validation request after mitigation, then compare headers, status code, response fields, and usage fields against your parser expectations.

Concise definition

A chat completions reliability contract is the set of assumptions your application makes about a chat-completion API call: endpoint path, authentication, required request fields, optional request fields, response schema, error behavior, timeout limits, retry handling, billing or usage metadata, and fallback triggers.

For CometAPI, the public API reference page for chat completions is the primary source to check before you update those assumptions: CometAPI API documentation .

Why this incident review is different from a smoke test

A smoke test asks, “Can one known request succeed right now?”

An incident review asks:

Which contract assumption failed?
Did the client respond safely?
Did fallback preserve user experience without hiding a provider or integration issue?
Did observability capture enough detail to prevent recurrence?
Did the remediation update code, configuration, alerts, and documentation?

That distinction matters because a single successful request after an outage does not prove your production request classes are safe. For example, a short non-streaming prompt may pass while a large conversation, a streaming request, or a request using a less common parameter still fails.

Contract details to verify

Use this table during the post-incident review. The “expected value” column should be filled from your production configuration and checked against the CometAPI documentation page, not copied blindly from an SDK default.

Contract area	What to verify	Operational review question	Source supporting the check
Endpoint paths	Exact base URL and chat completions path used by production clients	Did the failing service call the documented CometAPI chat completions endpoint, or did it use a stale path, proxy rewrite, or environment-specific override?	CometAPI chat completions API reference: https://apidoc.cometapi.com/api-13851472
Auth headers	Required authentication header format and secret source	Was the incident caused by missing, expired, rotated, malformed, or environment-mismatched credentials?	CometAPI chat completions API reference: https://apidoc.cometapi.com/api-13851472
Request fields	Required fields such as model selection and message payload; optional fields used by your app	Did the failing request include unsupported, misspelled, null, oversized, or environment-specific fields?	CometAPI chat completions API reference: https://apidoc.cometapi.com/api-13851472
Response fields	Fields your parser requires, including choices/message content and any usage metadata your system records	Did your parser assume a field was always present when the API can return a different shape for errors, refusals, streaming, or partial responses?	CometAPI chat completions API reference: https://apidoc.cometapi.com/api-13851472
Error behavior	HTTP status codes, error response body shape, and retryability rules observed in the incident	Did the client classify the error correctly, or did it retry non-retryable failures and fail fast on retryable ones?	CometAPI chat completions API reference plus your incident logs
Rate-limit or billing assumptions	Whether usage fields, rate-limit headers, quota errors, or billing-relevant metadata are documented and captured	Did the incident involve quota, throttling, or unexpected usage accounting, and did monitoring distinguish those from generic 5xx failures?	CometAPI chat completions API reference plus your account telemetry
Timeout behavior	Client timeout, upstream proxy timeout, load balancer timeout, and fallback timeout	Did one layer cancel the request earlier than expected, causing duplicate retries or fallback races?	Your production configuration and incident traces
Streaming behavior	Whether streaming was enabled and how chunks, termination, and parser errors were handled	Did the client treat an interrupted stream as success, failure, retryable partial output, or fallback-required?	CometAPI chat completions API reference plus stream logs if applicable

Incident review checklist

1. Freeze the failing request class

Before changing code or keys, capture a sanitized version of the request class that failed.

Record:

service name and deployment version
environment, region, and egress path
endpoint path used by the client
model identifier sent in the request
streaming versus non-streaming mode
timeout values at client, proxy, queue, and worker layers
retry policy and retry count
fallback policy and fallback target
HTTP status code, error body shape, and response latency
correlation ID or trace ID
whether the response reached the parser
whether usage or billing metadata was recorded

Do not store raw user prompts in the incident ticket. Store a redacted prompt category, token-size bucket, and message-count bucket instead.

2. Compare observed behavior with the documented contract

Use the CometAPI chat completions API reference as the external contract baseline: https://apidoc.cometapi.com/api-13851472 .

Check the production request against the documentation:

Is the endpoint path still correct?
Is the authentication header formatted as documented?
Are required request fields present?
Are optional request fields supported for the request mode you used?
Is the response parser compatible with both success and error payloads?
Are streaming and non-streaming responses handled separately?
Are usage fields treated as optional unless your source confirms they are always present?
Are undocumented response fields avoided as hard dependencies?

If your client uses an OpenAI-compatible SDK or wrapper, verify the actual wire request. Do not assume SDK compatibility proves the contract.

3. Classify the incident by failure mode

Use a classification that helps action owners fix the right layer.

Failure mode	Typical evidence	Likely owner	Review action
Auth failure	401/403-like status, key rotation event, missing header	Platform or secrets owner	Confirm secret source, rotation timing, deploy order, and alert coverage
Request contract failure	400-like status, validation error, unsupported field	Application owner	Remove or gate the field; add schema validation before sending
Rate or quota pressure	Throttle-like status, burst traffic, queue growth	Platform or capacity owner	Tune concurrency, backoff, and traffic shaping
Provider/server transient	5xx-like status, elevated latency, intermittent success	Reliability owner	Confirm retry budget and fallback threshold
Client timeout	Request canceled before provider response	Application or network owner	Align client, proxy, and fallback timeouts
Parser failure	HTTP success but application error after response	Application owner	Relax parser assumptions; add contract tests
Fallback failure	Primary fails and fallback also fails or loops	Reliability owner	Verify fallback isolation, prompt compatibility, and circuit breaker state

4. Reconstruct the decision timeline

For each incident, create a minute-by-minute timeline with these events:

First elevated error, timeout, or latency observation.
First alert fired.
First fallback activation.
First customer-visible degradation, if known.
First manual mitigation.
First successful validation request after mitigation.
Full recovery time.
Post-recovery configuration or code change.

Then answer:

Did fallback start before or after the user-visible timeout?
Did retries increase load during the incident?
Did the system keep sending traffic to a known-bad path after circuit-break conditions were met?
Did operators have enough evidence to distinguish CometAPI-side behavior from local client behavior?

Sanitized validation example

The following curl-style request is intentionally generic. Replace placeholders with values from your verified CometAPI documentation and your environment. Do not paste production prompts or secrets into tickets.

curl -sS -X POST “$COMETAPI_BASE_URL/v1/chat/completions”
-H “Authorization: Bearer $COMETAPI_API_KEY”
-H “Content-Type: application/json”
–max-time 30
-d ‘{ “model”: “REPLACE_WITH_VERIFIED_MODEL_ID”, “messages”: [ { “role”: “system”, “content”: “Return a short operational acknowledgement.” }, { “role”: “user”, “content”: “Validation request after incident INC-REDACTED. No customer data.” } ], “temperature”: 0 }’

Validation notes:

Use a non-customer prompt.
Use the same endpoint path and auth method as production.
Start with non-streaming mode unless the incident was streaming-specific.
Record status code, latency, response body shape, and parser result.
Treat the 30-second timeout above as an example to tune, not a universal recommendation.
If production uses streaming, run a separate streaming validation that checks chunk handling and stream termination.

Practical validation steps after mitigation

Step 1: Validate from the same network path

Run the validation request from:

the affected service environment
the same region or cluster
the same egress proxy path
the same secret source

A request from a laptop or unrelated CI runner is useful for comparison, but it does not prove the production network path is healthy.

Step 2: Validate the parser, not only the HTTP response

A successful HTTP response is not enough. Confirm that your application can:

parse the response without null-reference errors
extract the generated message content only when present
handle empty, refused, or policy-driven outputs according to your product rules
record usage metadata only when present and documented
avoid treating an error payload as a successful completion

Step 3: Validate fallback trigger boundaries

Replay the incident class with controlled failure injection where possible.

Examples to tune:

force a synthetic timeout before the provider responds
return a synthetic 429-like throttle response from a test proxy
return a synthetic 500-like upstream failure
return a malformed success body to test parser guards
interrupt a streaming response mid-generation

For each case, verify:

one fallback decision per user request
bounded retry count
no retry storm
no duplicate charge-sensitive workflow if your business logic has side effects
user-visible error is acceptable when fallback is unsafe

Step 4: Check retry and timeout budgets together

Retries and timeouts must fit inside the user experience budget.

For example, if a user-facing endpoint has a 20-second response budget, three sequential 15-second provider attempts cannot succeed from a product perspective even if the final API call eventually returns. Use your own latency SLOs, not a generic value.

Review:

max attempts
per-attempt timeout
total deadline
backoff strategy
jitter
circuit-break threshold
fallback timeout
cancellation propagation

Step 5: Verify observability fields

At minimum, log structured metadata that lets operators answer contract questions without exposing prompt content:

provider name
endpoint class
model identifier
request mode: streaming or non-streaming
status code
error code or error type if present
latency bucket
retry attempt
fallback decision
timeout source
response parser outcome
usage metadata presence, not necessarily raw values
correlation ID

Keep prompt text, user identifiers, and secrets out of logs unless your data policy explicitly permits them.

What to update after the review

Update more than the immediate code path.

Runbook: Add the exact validation command shape and safe placeholders.
Contract tests: Add tests for success, error, timeout, parser failure, and fallback.
Configuration: Pin endpoint path, timeout, and retry values in reviewable config.
Alerts: Split auth failures, throttle-like failures, timeout failures, and parser failures where practical.
Dashboards: Add charts for fallback rate, retry rate, timeout source, and response-parse failures.
Incident template: Add contract fields from the table above.
Editorial documentation: Link the reviewed pattern from /sites/llm-api-reliability/editorial/ if it becomes a standard operating pattern.

FAQ

Is a successful CometAPI validation request enough to close an incident?

Usually no. It proves one request worked at one time. Close the incident only after you also verify the failing request class, parser behavior, retry/fallback behavior, and observability gaps.

Should we retry every failed chat completion request?

No. Retry policy should depend on failure type, request idempotency, timeout budget, and whether retrying can increase load or duplicate work. Treat retry thresholds as system-specific values to tune.

Should fallback trigger on any non-2xx response?

Not automatically. Some client-side contract errors should be fixed rather than hidden by fallback. Fallback is more appropriate for bounded transient failures, timeout conditions, or provider degradation when the fallback path is safe for the user task.

What if CometAPI returns a response shape our parser did not expect?

Capture the sanitized response shape, compare it with the documented API reference, and update your parser to handle documented variants. If the shape is undocumented or ambiguous, avoid building a hard dependency on it until verified.

Can we use production prompts for validation?

Avoid it. Use sanitized prompts that exercise the same request mode and approximate size without exposing customer data. If you must reproduce a customer-specific issue, follow your privacy and security review process.

Where should this checklist live?

Keep the operational version in your incident runbook and link it from your LLM API reliability index at /sites/llm-api-reliability/ or your post archive at /sites/llm-api-reliability/posts/ .

Sources checked

Source	Access date	Purpose
CometAPI chat completions API reference — https://apidoc.cometapi.com/api-13851472	2026-05-10	Primary contract source for checking endpoint, authentication, request fields, response fields, and documented API behavior before updating production assumptions.