CometAPI chat completions incident review contract
Last reviewed: 2026-05-10
Who this is for: operators and platform engineers who already call CometAPI chat completions in production and need a repeatable post-incident review that separates provider contract issues from client-side assumptions.
This draft focuses on the API contract you should verify after an incident, not on generic uptime monitoring. Use it alongside your existing runbooks in /sites/llm-api-reliability/posts/ and keep site-level reliability notes linked from /sites/llm-api-reliability/.
Key takeaways
- Treat the CometAPI chat completions documentation as the contract baseline for endpoint, authentication, request shape, and response parsing.
- During incident review, confirm whether the failure came from a contract mismatch, transient transport error, model-side refusal, malformed client request, timeout budget, or downstream parsing assumption.
- Record the exact request class that failed: streaming or non-streaming, model identifier, message size, timeout, retry count, and fallback behavior.
- Avoid making fallback decisions from HTTP status alone; include response body shape, retryability, latency, and idempotency risk.
- Re-run a sanitized validation request after mitigation, then compare headers, status code, response fields, and usage fields against your parser expectations.
Concise definition
A chat completions reliability contract is the set of assumptions your application makes about a chat-completion API call: endpoint path, authentication, required request fields, optional request fields, response schema, error behavior, timeout limits, retry handling, billing or usage metadata, and fallback triggers.
For CometAPI, the public API reference page for chat completions is the primary source to check before you update those assumptions: CometAPI API documentation.
Why this incident review is different from a smoke test
A smoke test asks, “Can one known request succeed right now?”
An incident review asks:
- Which contract assumption failed?
- Did the client respond safely?
- Did fallback preserve user experience without hiding a provider or integration issue?
- Did observability capture enough detail to prevent recurrence?
- Did the remediation update code, configuration, alerts, and documentation?
That distinction matters because a single successful request after an outage does not prove your production request classes are safe. For example, a short non-streaming prompt may pass while a large conversation, a streaming request, or a request using a less common parameter still fails.
Contract details to verify
Use this table during the post-incident review. The “expected value” column should be filled from your production configuration and checked against the CometAPI documentation page, not copied blindly from an SDK default.
| Contract area | What to verify | Operational review question | Source supporting the check |
|---|---|---|---|
| Endpoint paths | Exact base URL and chat completions path used by production clients | Did the failing service call the documented CometAPI chat completions endpoint, or did it use a stale path, proxy rewrite, or environment-specific override? | CometAPI chat completions API reference: https://apidoc.cometapi.com/api-13851472 |
| Auth headers | Required authentication header format and secret source | Was the incident caused by missing, expired, rotated, malformed, or environment-mismatched credentials? | CometAPI chat completions API reference: https://apidoc.cometapi.com/api-13851472 |
| Request fields | Required fields such as model selection and message payload; optional fields used by your app | Did the failing request include unsupported, misspelled, null, oversized, or environment-specific fields? | CometAPI chat completions API reference: https://apidoc.cometapi.com/api-13851472 |
| Response fields | Fields your parser requires, including choices/message content and any usage metadata your system records | Did your parser assume a field was always present when the API can return a different shape for errors, refusals, streaming, or partial responses? | CometAPI chat completions API reference: https://apidoc.cometapi.com/api-13851472 |
| Error behavior | HTTP status codes, error response body shape, and retryability rules observed in the incident | Did the client classify the error correctly, or did it retry non-retryable failures and fail fast on retryable ones? | CometAPI chat completions API reference plus your incident logs |
| Rate-limit or billing assumptions | Whether usage fields, rate-limit headers, quota errors, or billing-relevant metadata are documented and captured | Did the incident involve quota, throttling, or unexpected usage accounting, and did monitoring distinguish those from generic 5xx failures? | CometAPI chat completions API reference plus your account telemetry |
| Timeout behavior | Client timeout, upstream proxy timeout, load balancer timeout, and fallback timeout | Did one layer cancel the request earlier than expected, causing duplicate retries or fallback races? | Your production configuration and incident traces |
| Streaming behavior | Whether streaming was enabled and how chunks, termination, and parser errors were handled | Did the client treat an interrupted stream as success, failure, retryable partial output, or fallback-required? | CometAPI chat completions API reference plus stream logs if applicable |
Incident review checklist
1. Freeze the failing request class
Before changing code or keys, capture a sanitized version of the request class that failed.
Record:
- service name and deployment version
- environment, region, and egress path
- endpoint path used by the client
- model identifier sent in the request
- streaming versus non-streaming mode
- timeout values at client, proxy, queue, and worker layers
- retry policy and retry count
- fallback policy and fallback target
- HTTP status code, error body shape, and response latency
- correlation ID or trace ID
- whether the response reached the parser
- whether usage or billing metadata was recorded
Do not store raw user prompts in the incident ticket. Store a redacted prompt category, token-size bucket, and message-count bucket instead.
2. Compare observed behavior with the documented contract
Use the CometAPI chat completions API reference as the external contract baseline: https://apidoc.cometapi.com/api-13851472.
Check the production request against the documentation:
- Is the endpoint path still correct?
- Is the authentication header formatted as documented?
- Are required request fields present?
- Are optional request fields supported for the request mode you used?
- Is the response parser compatible with both success and error payloads?
- Are streaming and non-streaming responses handled separately?
- Are usage fields treated as optional unless your source confirms they are always present?
- Are undocumented response fields avoided as hard dependencies?
If your client uses an OpenAI-compatible SDK or wrapper, verify the actual wire request. Do not assume SDK compatibility proves the contract.
3. Classify the incident by failure mode
Use a classification that helps action owners fix the right layer.
| Failure mode | Typical evidence | Likely owner | Review action |
|---|---|---|---|
| Auth failure | 401/403-like status, key rotation event, missing header | Platform or secrets owner | Confirm secret source, rotation timing, deploy order, and alert coverage |
| Request contract failure | 400-like status, validation error, unsupported field | Application owner | Remove or gate the field; add schema validation before sending |
| Rate or quota pressure | Throttle-like status, burst traffic, queue growth | Platform or capacity owner | Tune concurrency, backoff, and traffic shaping |
| Provider/server transient | 5xx-like status, elevated latency, intermittent success | Reliability owner | Confirm retry budget and fallback threshold |
| Client timeout | Request canceled before provider response | Application or network owner | Align client, proxy, and fallback timeouts |
| Parser failure | HTTP success but application error after response | Application owner | Relax parser assumptions; add contract tests |
| Fallback failure | Primary fails and fallback also fails or loops | Reliability owner | Verify fallback isolation, prompt compatibility, and circuit breaker state |
4. Reconstruct the decision timeline
For each incident, create a minute-by-minute timeline with these events:
- First elevated error, timeout, or latency observation.
- First alert fired.
- First fallback activation.
- First customer-visible degradation, if known.
- First manual mitigation.
- First successful validation request after mitigation.
- Full recovery time.
- Post-recovery configuration or code change.
Then answer:
- Did fallback start before or after the user-visible timeout?
- Did retries increase load during the incident?
- Did the system keep sending traffic to a known-bad path after circuit-break conditions were met?
- Did operators have enough evidence to distinguish CometAPI-side behavior from local client behavior?
Sanitized validation example
The following curl-style request is intentionally generic. Replace placeholders with values from your verified CometAPI documentation and your environment. Do not paste production prompts or secrets into tickets.
curl -sS -X POST “$COMETAPI_BASE_URL/v1/chat/completions”
-H “Authorization: Bearer $COMETAPI_API_KEY”
-H “Content-Type: application/json”
–max-time 30
-d ‘{
“model”: “REPLACE_WITH_VERIFIED_MODEL_ID”,
“messages”: [
{
“role”: “system”,
“content”: “Return a short operational acknowledgement.”
},
{
“role”: “user”,
“content”: “Validation request after incident INC-REDACTED. No customer data.”
}
],
“temperature”: 0
}’
Validation notes:
- Use a non-customer prompt.
- Use the same endpoint path and auth method as production.
- Start with non-streaming mode unless the incident was streaming-specific.
- Record status code, latency, response body shape, and parser result.
- Treat the 30-second timeout above as an example to tune, not a universal recommendation.
- If production uses streaming, run a separate streaming validation that checks chunk handling and stream termination.
Practical validation steps after mitigation
Step 1: Validate from the same network path
Run the validation request from:
- the affected service environment
- the same region or cluster
- the same egress proxy path
- the same secret source
A request from a laptop or unrelated CI runner is useful for comparison, but it does not prove the production network path is healthy.
Step 2: Validate the parser, not only the HTTP response
A successful HTTP response is not enough. Confirm that your application can:
- parse the response without null-reference errors
- extract the generated message content only when present
- handle empty, refused, or policy-driven outputs according to your product rules
- record usage metadata only when present and documented
- avoid treating an error payload as a successful completion
Step 3: Validate fallback trigger boundaries
Replay the incident class with controlled failure injection where possible.
Examples to tune:
- force a synthetic timeout before the provider responds
- return a synthetic 429-like throttle response from a test proxy
- return a synthetic 500-like upstream failure
- return a malformed success body to test parser guards
- interrupt a streaming response mid-generation
For each case, verify:
- one fallback decision per user request
- bounded retry count
- no retry storm
- no duplicate charge-sensitive workflow if your business logic has side effects
- user-visible error is acceptable when fallback is unsafe
Step 4: Check retry and timeout budgets together
Retries and timeouts must fit inside the user experience budget.
For example, if a user-facing endpoint has a 20-second response budget, three sequential 15-second provider attempts cannot succeed from a product perspective even if the final API call eventually returns. Use your own latency SLOs, not a generic value.
Review:
- max attempts
- per-attempt timeout
- total deadline
- backoff strategy
- jitter
- circuit-break threshold
- fallback timeout
- cancellation propagation
Step 5: Verify observability fields
At minimum, log structured metadata that lets operators answer contract questions without exposing prompt content:
- provider name
- endpoint class
- model identifier
- request mode: streaming or non-streaming
- status code
- error code or error type if present
- latency bucket
- retry attempt
- fallback decision
- timeout source
- response parser outcome
- usage metadata presence, not necessarily raw values
- correlation ID
Keep prompt text, user identifiers, and secrets out of logs unless your data policy explicitly permits them.
What to update after the review
Update more than the immediate code path.
- Runbook: Add the exact validation command shape and safe placeholders.
- Contract tests: Add tests for success, error, timeout, parser failure, and fallback.
- Configuration: Pin endpoint path, timeout, and retry values in reviewable config.
- Alerts: Split auth failures, throttle-like failures, timeout failures, and parser failures where practical.
- Dashboards: Add charts for fallback rate, retry rate, timeout source, and response-parse failures.
- Incident template: Add contract fields from the table above.
- Editorial documentation: Link the reviewed pattern from /sites/llm-api-reliability/editorial/ if it becomes a standard operating pattern.
FAQ
Is a successful CometAPI validation request enough to close an incident?
Usually no. It proves one request worked at one time. Close the incident only after you also verify the failing request class, parser behavior, retry/fallback behavior, and observability gaps.
Should we retry every failed chat completion request?
No. Retry policy should depend on failure type, request idempotency, timeout budget, and whether retrying can increase load or duplicate work. Treat retry thresholds as system-specific values to tune.
Should fallback trigger on any non-2xx response?
Not automatically. Some client-side contract errors should be fixed rather than hidden by fallback. Fallback is more appropriate for bounded transient failures, timeout conditions, or provider degradation when the fallback path is safe for the user task.
What if CometAPI returns a response shape our parser did not expect?
Capture the sanitized response shape, compare it with the documented API reference, and update your parser to handle documented variants. If the shape is undocumented or ambiguous, avoid building a hard dependency on it until verified.
Can we use production prompts for validation?
Avoid it. Use sanitized prompts that exercise the same request mode and approximate size without exposing customer data. If you must reproduce a customer-specific issue, follow your privacy and security review process.
Where should this checklist live?
Keep the operational version in your incident runbook and link it from your LLM API reliability index at /sites/llm-api-reliability/ or your post archive at /sites/llm-api-reliability/posts/.
Sources checked
| Source | Access date | Purpose |
|---|---|---|
| CometAPI chat completions API reference — https://apidoc.cometapi.com/api-13851472 | 2026-05-10 | Primary contract source for checking endpoint, authentication, request fields, response fields, and documented API behavior before updating production assumptions. |