GitHub's auth service deployed to some hosts, not all, and every product felt it

2026-05-28by

#github #github-actions #ci-cd #postmortem #incident-response #reliability #operations #identity

TL;DR

2026-05-28, 19:07-19:16 UTC. A change partially deployed to a GitHub authentication service caused elevated error rates across the web UI, REST API, Git operations, and Actions (check window/figures against the status incident).
Peak impact: ~10% of Actions runs failed to queue or errored while downloading actions.
Fix: rollback. Nine minutes start to finish.
GitHub's stated follow-up: more test coverage and improved deployment validation (their words, not an inference).
The reusable part: auth is a blast-radius-1 dependency, and a partial-deploy state (some hosts new, some old) is its own failure class. If you ship identity services, gate partial rollouts behind health checks that fail closed before the change reaches a quorum.

What broke

Earlier today GitHub pushed a change to one of its authentication services. It didn't land everywhere at once. For about nine minutes some hosts ran the new code and some ran the old, and requests that hit the wrong combination failed.

Because the service does auth, the failures didn't stay contained. The web experience, the REST API, Git operations, and Actions all returned elevated errors at the same time. At peak, roughly 10% of Actions runs either failed to queue or errored while downloading actions.

GitHub rolled the change back and the error rates recovered. Per the status note, the stated follow-ups are expanding test coverage and improving the deployment validation process.

That's the whole incident. Window, cause, fix: 19:07-19:16 UTC, a partially-deployed auth change, a rollback.

Why a nine-minute incident is worth a post

Nine minutes is short. The severity isn't the story, and inflating it would be dishonest. The story is the shape of the failure, because it's a shape you can hit too.

Two things stack here.

Auth is a chokepoint. Web, REST, git, and Actions are different products with different teams and deploy cadences, but they share one identity dependency. When that dependency throws, the errors fan out to everything downstream at once. There's no graceful degradation path when the thing that's failing is the thing that decides whether a request is allowed to proceed. A service like that is effectively blast-radius-1: one fault, every dependent product.

Partial deploys are their own failure class. This wasn't "the new code is broken." It was "the new code is only some places." A fleet split between two versions can fail in ways neither version fails on its own: a token minted by a new host gets rejected by an old one, a request routes to a host that doesn't understand it yet, two halves of the fleet disagree about a format. You can have green tests on both versions and still break in the gap between them.

Combine the two and a routine rollout to a shared identity service becomes a multi-product incident, without anyone shipping obviously broken code.

If you run auth or identity services

The reader action here is narrow and concrete. If you operate an identity, token, or session service that other things depend on:

Gate partial rollouts behind health checks that fail closed. A canary that reaches a quorum of the fleet before anyone notices it's wrong is the failure mode. The check should block promotion before the bad version spreads, not alarm after.
Treat mixed-version states as a tested case, not an accident. New and old will coexist during every deploy. Test that an old host accepts what a new host produces and vice versa. The bug lives between versions, not inside one.
Budget for fan-out, not just for the service. When you reason about an auth service's SLO, the cost of a fault isn't its own error rate. It's every product that can't authenticate while it's down. Size alerting and rollback speed to that, not to the service in isolation.
Keep rollback boring. GitHub recovered in nine minutes because the answer was "put the old version back," not "diagnose the new one." A fast, dumb rollback beats a clever forward fix during an active auth incident.

None of this makes auth changes safe. It makes the partial-deploy window short and self-closing instead of a thing you discover from downstream error rates.

(GitHub had a separate, unrelated Actions incident two days earlier, when an account-review system suspended its own service account; that's GitHub's other May incident, a different failure with a different lesson. This post is only about the partial-deploy one.)

Sources

GitHub Status, incident history (primary; verify the 19:07-19:16 UTC window, the ~10% Actions figure, the affected surfaces, and the rollback / deployment-validation follow-up against the entry): githubstatus.com/history
Statusfield, GitHub incident archive (secondary aggregator, mirrors the status text): statusfield.com/services/github/incidents