Overload Signal Triage for LLM API On-Call Engineers

Last reviewed: 2026-06-17

Direct answer

When your LLM API call fails during an on-call incident, the first question is not “should I retry?” — it is “what kind of failure am I looking at?” Overload signals require a different response than transient network errors or client bugs, and confusing them makes incidents worse.

The core triage loop has three steps:

Read the HTTP status code and response body. A 429 Too Many Requests or a 503 Service Unavailable both suggest the server is under pressure, but they carry different semantics. A 429 generally means your client is above a request-volume threshold. A 503 generally means the upstream service itself cannot serve the request right now. A 500 can be either an overload symptom or a server-side bug — do not treat it as a safe retry target without checking the response body.
Check your telemetry spans before retrying. The OpenTelemetry HTTP semantic conventions define standard span attributes (http.response.status_code, http.request.method, error.type) that let you distinguish a fast rejection at the gateway layer from a slow timeout at the model layer. A fast 429 with a low latency span is a load-shedding signal; a slow 503 near your timeout budget is a capacity signal. Treat them differently.
Apply the overload handling principle before retrying. Google SRE’s overload handling guidance warns that naive retries under overload amplify the problem — every client retrying at full rate multiplies load on an already-saturated system. Before retrying, check whether Retry-After is set, apply exponential backoff with jitter, and consider whether your retry budget allows another attempt at all. See Overload Signals for LLM API Failover Runbooks for complementary runbook patterns.

If you are sending requests to CometAPI, the exact endpoint path, error code semantics, and support escalation path should be verified in the current API documentation before you act on any cached assumptions.

For broader release checks, see CometAPI chat reliability contract review .

Who this is for

This guide is for:

On-call engineers responding to LLM API degradation alerts.
Platform engineers building retry, fallback, or circuit-breaker logic around LLM API calls.
SRE teams writing or reviewing LLM API runbooks for the first time.

You should already be comfortable reading HTTP response codes and basic distributed tracing output. You do not need deep ML knowledge.

Key takeaways

Distinguish overload from bugs. 429 and 503 are capacity signals. 400 is a client error. 500 is ambiguous — read the body before deciding.
Low-latency failures are gateway rejections. If your span ends in under 100 ms with a 429 or 503, the request was shed before reaching the model. Retrying immediately at the same rate will produce the same outcome.
Respect Retry-After. If the server sets this header, your retry logic must honour it. Ignoring it is the most common way to turn a brief overload event into a prolonged one.
Exponential backoff with jitter is not optional under overload. Thundering-herd effects from synchronized retries are well-documented in SRE literature; applying jitter breaks the synchronisation.
Telemetry fields are your audit trail. Standard OTel HTTP span attributes give you the signal classification evidence you need during a post-incident review. Log them on every failure, not just on success.
Know your escalation path. If overload persists beyond your retry budget, the next action is escalation to the API provider’s support channel — not more retries. Verify the current escalation path in the API help documentation.

Smoke-test workflow

Setup assumptions

You have a valid API key and a working HTTP client.
You have access to a tracing backend that can receive OTel spans, or you can log span attributes to stdout during testing.
You are testing against a non-production environment or using a low-cost test payload.
Exact endpoint paths, auth header names, and request field names must be verified in the current API documentation before running this workflow.

Happy-path request plan

Send a minimal, well-formed chat completion request to the documented endpoint.
Assert that the response HTTP status is 200.
Assert that the response body contains the expected top-level fields (verify field names in the current documentation).
Record the span attributes: http.response.status_code, http.request.method, and the observed end-to-end latency.

Error-path check

Intentionally exceed the documented request rate (if testing in a safe environment) or send a malformed request to trigger a client-error response.
Observe whether the error code is 429 (rate/volume), 400 (client error), or 5xx (server-side).
If 429 is returned, check whether Retry-After is present in the response headers.
Apply a single retry after the indicated wait, or after a minimum backoff if Retry-After is absent.

Minimum assertions

Happy path returns 200 with a non-empty response body.
Error path returns a status code that matches the expected signal type for the fault injected.
Retry-After header is honoured when present.
End-to-end latency on rejected requests is materially lower than on successful requests.

Pass/fail logging fields

Record these fields after each smoke-test run:

run_id: endpoint: request_method: POST status_code: latency_ms: error_type: <empty-on-success / 429 / 503 / 500 / etc.> retry_after_s: backoff_applied: true/false assertions_pass: true/false notes:

What the smoke test must not assert

Do not assert specific model identifiers, as model routing may change without notice.
Do not assert exact token counts, latency targets, or pricing fields — these are not contract guarantees in a smoke test.
Do not assert that Retry-After has a specific numeric value; only assert its presence or absence.
Do not use real user data or production credentials in smoke-test payloads.

Failure modes

Evidence gap: the agent cannot inspect the failing log, source page, pull request, or local command output. The safe action is to stop and record the missing evidence instead of guessing.
Scope drift: the agent edits files that are not connected to the observed failure. Keep the repair tied to the failing signal and leave unrelated cleanup for a separate task.
Environment mismatch: the local check uses different versions, credentials, feature flags, or runtime settings than the hosted path. Record the mismatch before treating the result as proof.
Unreviewed fallback: the agent changes models, endpoints, permissions, or retry behavior to make a run pass without preserving the review boundary. Treat access and provider failures as operational blockers, not topic failures.
Weak handoff: the final note says the issue is fixed but omits the command, result, changed files, and remaining uncertainty. That makes the next operator repeat the investigation.

Sources checked

Google SRE overload guidance - accessed 2026-06-17; purpose: verify overload and reliability risk context.
OpenTelemetry HTTP semantic conventions - accessed 2026-06-17; purpose: verify HTTP telemetry field context.
CometAPI documentation - accessed 2026-06-17; purpose: verify current CometAPI documentation navigation.
CometAPI help center - accessed 2026-06-17; purpose: verify support and escalation documentation areas.

Contract details to verify

Area	What to verify	Source URL	Accessed	Safe candidate wording
Endpoint path	Confirm the current chat completions endpoint path	https://apidoc.cometapi.com/api/text/chat	2026-06-17	“the endpoint documented at /api/text/chat”
Auth header	Confirm the required auth header name and format	https://apidoc.cometapi.com/api/text/chat	2026-06-17	“the auth scheme described in the current API docs”
429 semantics	Confirm whether 429 maps to request-rate, token-rate, or both	https://apidoc.cometapi.com/support/help-center	2026-06-17	“a 429 response indicating the request volume threshold was exceeded”
Retry-After header	Confirm whether the API sets Retry-After on 429 responses	https://apidoc.cometapi.com/support/help-center	2026-06-17	“check for Retry-After in the response headers”
503 vs 529	Confirm which 5xx or non-standard codes indicate capacity overload vs. server error	https://apidoc.cometapi.com/api/text/chat	2026-06-17	“a 503 or equivalent capacity signal as documented”
Escalation path	Confirm the current support escalation channel for sustained overload	https://apidoc.cometapi.com/support/help-center	2026-06-17	“contact support via the channel described in the help center”
OTel span fields	Confirm which span attributes are emitted by the API gateway	https://opentelemetry.io/docs/specs/semconv/http/	2026-06-17	“standard http.response.status_code and error.type span attributes”

Reader next step

Compare the workflow against Start with CometAPI .

Use CometAPI chat reliability contract review as the next comparison point. Keep Timeout-budget fallback checks for chat completions nearby for setup and permission checks.

FAQ

Q: How do I tell the difference between a transient 503 and a sustained overload event?

A single 503 with a fast retry that succeeds is transient. A series of 503 responses over multiple minutes, especially if latency is low (indicating gateway-level rejection rather than timeout), is a sustained overload signal. Track the count and the inter-arrival rate of errors in your telemetry — a rising error rate that does not recover after one or two backoff cycles should trigger escalation rather than continued retries.

Q: Should I retry a 500 error?

It depends on the response body. A 500 can be a transient server fault that is safe to retry once with backoff, or it can be a persistent server-side error that will not resolve with retries. Read the error message in the response body first. If it resembles a capacity or throttling message, treat it like a 503. If it looks like a code or data error, do not retry — investigate instead.

Q: What is the right backoff strategy under overload?

The Google SRE guidance recommends exponential backoff with full jitter. “Exponential” means the wait time doubles with each retry attempt (for example: 1 s, 2 s, 4 s, 8 s). “Full jitter” means the actual wait time is randomised between zero and the calculated maximum, which prevents all clients from retrying at exactly the same moment. Always respect a Retry-After header when present — it overrides your calculated backoff.

Q: When should I stop retrying and escalate?

Stop retrying when you have exhausted your retry budget (the maximum number of attempts your system allows per request) or when the overload has persisted beyond a threshold you define in your runbook (for example, five minutes of sustained 429 or 503 responses). At that point, activate your fallback route if one exists, and open a support ticket. Verify the current support escalation path in the CometAPI Help Center.

Q: Can I use OTel HTTP spans to classify overload automatically?

Yes. If your tracing instrumentation emits http.response.status_code and records end-to-end latency per span, you can build alerting rules that distinguish fast rejections (low latency + 4xx/5xx) from slow timeouts (latency near your timeout budget + 5xx). The OpenTelemetry HTTP semantic conventions define the canonical attribute names to use. Exact attribute name support depends on the SDK version and instrumentation library in use.

Q: Does CometAPI publish a status page for overload events?

Verify whether a status page or operational-status feed exists in the CometAPI Help Center. During an active incident, a public status page is the fastest way to confirm whether the overload is on the provider side.