Build an On-call Evidence Packet for LLM API Incidents

Last reviewed: 2026-06-27

Direct answer

An on-call evidence packet for an LLM API incident should capture only what the operator can verify during the incident window: the affected route or integration, time range, HTTP status class, error type, retry count, fallback decision, trace or request identifier, escalation reference, and the exact public contract details checked before making a call. It should not assert vendor uptime, account limits, pricing impact, model availability, latency guarantees, rate behavior, or billing behavior unless those details are confirmed in public documentation or in the team’s own account system.

The point of the packet is not to prove that an API is reliable. The point is to make the incident reviewable without guesswork. A good packet lets another engineer answer three questions: what failed, what evidence was available, and what action was taken because of that evidence. For adjacent fallback structure, compare this article with the Source-Backed LLM API Fallback Checklist and the Retry Budget Evidence for Safer LLM API Calls .

A practical smoke-test workflow:

Setup assumptions: the operator has an approved test account, a non-production credential stored outside the runbook, a known test route, and a short test window agreed with the incident lead.
Happy-path request plan: send one minimal known-good request through the same integration path used by the service, using <API_KEY_PLACEHOLDER> in any shared example and recording only sanitized request metadata.
Error-path check: send one intentionally invalid or incomplete request that is safe for the test environment, then confirm the client captures a reviewable error class, HTTP status, trace identifier, and fallback decision.
Minimum assertions: record whether the client received a response, whether the response shape matched the documented contract areas checked by the operator, whether telemetry fields were present, and whether retry or fallback logic stayed within the service team’s own budget.
Pass/fail logging fields: incident_id, test_window_utc, route_name, http_status_class, error_type_placeholder, trace_id_placeholder, fallback_decision, retry_attempt_count_placeholder, support_reference_placeholder, operator_initials, result.
What not to assert: do not claim current model availability, exact latency, uptime, billing impact, pricing, quotas, or rate behavior from this smoke test alone.

Sanitized log-record template:

incident_id: INC-PLACEHOLDER
service_route: llm-gateway-placeholder
test_window_utc: 2026-06-27T00:00:00Z/2026-06-27T00:05:00Z
http_status_class: 2xx-or-4xx-or-5xx
error_type: placeholder_error_type
trace_id: trace-placeholder
fallback_decision: primary|fallback|hold
retry_attempt_count: placeholder_count
support_reference: support-placeholder
result: pass|fail|inconclusive
notes: sanitized operator note only

Who this is for

This guide is for on-call engineers, platform owners, incident commanders, and reliability reviewers who need a compact evidence packet for LLM API failures. It is most useful when the incident involves HTTP errors, overload symptoms, unexpected response shapes, retry decisions, fallback promotion, or escalation to an API support channel.

It is also useful for teams that have several older runbooks and need one neutral evidence format that does not depend on a single provider. If a team already has provider-specific checks, keep those checks, but put the incident-facing evidence into a common packet so reviewers can compare primary traffic, fallback traffic, and held traffic with the same fields.

Key takeaways

Keep the packet evidence-first: timestamps, HTTP telemetry, route names, trace identifiers, retry counts, fallback decisions, and support references are safer than broad reliability claims.
Use overload guidance to avoid turning one failed request into a retry storm. A retry decision should be bounded, locally explainable, and tied to the team’s own retry budget.
Use HTTP semantic conventions to keep span and metric fields consistent enough for incident review, especially status class, route, method, error type, and low-cardinality labels.
Verify request and response contract areas in the current API documentation before relying on any endpoint, field, or response-shape assumption.
Treat pricing, quotas, model availability, latency, and account-specific support behavior as separate checks unless a current source directly supports the exact statement.
Prefer an inconclusive result over an invented conclusion. If the evidence packet cannot support a claim, record what is missing and stop there.

Sources checked

Google SRE overload guidance - accessed 2026-06-27; purpose: verify overload and reliability risk context.
OpenTelemetry HTTP semantic conventions - accessed 2026-06-27; purpose: verify HTTP telemetry field context.
CometAPI documentation - accessed 2026-06-27; purpose: verify current CometAPI documentation navigation.
CometAPI help center - accessed 2026-06-27; purpose: verify support and escalation documentation areas.

Contract details to verify

Area	What to verify	Source URL	Accessed	Safe candidate wording
Overload response	Whether retrying is safe for this incident class and whether retries need a local cap	https://sre.google/sre-book/handling-overload/	2026-06-27	“Use a bounded retry decision and record when the client stops retrying.”
HTTP telemetry	Which HTTP attributes, status class, error type, and route labels are available in the service’s traces or metrics	https://opentelemetry.io/docs/specs/semconv/http/	2026-06-27	“Record low-cardinality HTTP telemetry fields that are present in the service.”
Documentation navigation	The current public documentation location used for API contract checks	https://apidoc.cometapi.com/	2026-06-27	“Start from the current documentation home when confirming references.”
Chat request contract	The current request method, path, required fields, and response fields for the documented chat API	https://apidoc.cometapi.com/api/text/chat	2026-06-27	“Verify request and response fields in the current chat reference before running the smoke test.”
Escalation context	The current help-center route for support or escalation notes	https://apidoc.cometapi.com/support/help-center	2026-06-27	“Attach a sanitized support reference when escalation is required.”

The packet should separate these checks instead of blending them into one conclusion. For example, an HTTP 5xx count may support a statement about observed server-side failures during the incident window, but it does not support a statement about uptime. A successful smoke test may support a statement about one route working during one short window, but it does not prove future availability or account-specific quota behavior. A support ticket may prove that escalation happened, but it does not prove root cause unless the support response says so and the team is allowed to cite it.

Failure modes

Evidence gap: the operator cannot inspect the failing log, trace, source page, request record, or command output. The safe action is to record the missing evidence instead of guessing.
Scope drift: the repair changes files, models, endpoints, credentials, retry settings, or fallback rules that are not connected to the observed failure. Keep the response tied to the failing signal and leave unrelated cleanup for a separate change.
Environment mismatch: the local check uses different versions, credentials, feature flags, or runtime settings than the hosted path. Record the mismatch before treating the result as proof.
Retry amplification: a client keeps retrying after the service has already shown overload symptoms. The packet should show where retrying stopped, what signal stopped it, and whether fallback or hold was chosen instead.
Weak handoff: the incident note says the issue is fixed but omits the command, result, changed files, and remaining uncertainty. That makes the next operator repeat the investigation.
Over-specific conclusions: the packet turns one narrow test into a broad statement about provider reliability, latency, pricing, billing, model availability, or rate limits. Keep those as separate checks with separate evidence.

Reader next step

Before the next incident, create a one-page evidence packet template in the same place your team keeps incident notes. Add the fields from the smoke-test workflow, prefill the internal route names that are safe to share, and link it from the team’s LLM gateway runbook. Then run one non-production drill: capture a happy-path request, capture one safe error-path request, attach the HTTP telemetry fields that are actually present, and write pass, fail, or inconclusive without adding any unsupported vendor claims.

If your team already has CometAPI-specific reliability notes, use the CometAPI Timeout Evidence Pack for On-call Reviews as a narrower companion, then keep this packet as the provider-neutral incident summary. The next useful improvement is not a longer checklist; it is a shorter handoff that another engineer can audit in five minutes.

Use CometAPI chat reliability contract review as the next comparison point. Keep Build a CometAPI Fallback Evidence Checklist nearby for setup and permission checks.

FAQ

What belongs in the first five minutes of the packet?

Start with the incident identifier, affected service route, time window, HTTP status class, trace or request identifier, customer-visible symptom, and whether traffic stayed on primary, moved to fallback, or was held. Add the first support reference only if escalation has actually started.

Should the packet include exact model IDs?

Only include model identifiers when the operator verifies them against current documentation or an internal account system. Otherwise, record a placeholder such as model_reference_checked: false and keep the packet focused on request behavior and telemetry.

Can a smoke test prove that an API is reliable?

No. A smoke test can show that a specific route and contract check behaved as expected during a narrow test window. It cannot prove uptime, latency, quota, billing, pricing, or future availability.

How much retry detail is enough?

Record the retry attempt count, where the retry decision happened, whether fallback was considered, and why the operator stopped. Avoid broad claims about retry safety unless the service’s own retry budget and overload signals support them.

When should support evidence be attached?

Attach a sanitized support reference when the incident requires vendor clarification, account-specific investigation, or contract details that are not safely answered by public documentation.

What should be left out of the packet?

Leave out real credentials, full prompts, full generated responses, private customer data, account-specific quota values, billing details, exact pricing claims, and any support text the team is not allowed to share. Use placeholders when a field matters operationally but should not be exposed in a public incident artifact.