Railway’s outage: an erroneous GCP suspension and a routing cache that did not fail over

2026-05-20by

#railway #google-cloud #outage #reliability #networking #control-plane #sre #postmortem

Between 2026-05-19 22:20 UTC and 2026-05-20 07:58 UTC, Railway ran a platform-wide incident: dashboard and API 503s, login failures, builds blocked, and at peak all regions returning 404 for customer workloads, including traffic that normally lives on Railway Metal and AWS burst capacity (Railway incident report). The trigger, per Railway, was not a gradual degradation inside a single region. Google Cloud incorrectly placed Railway’s production GCP account into suspended status through an automated action, with no proactive outreach before the restriction landed.

That sentence is worth sitting with if you run production on a hyperscaler: a billing-adjacent, policy-adjacent, or algorithmically-scored account flag can act like a global circuit breaker for everything tied to that account ID, not just the VMs that misbehaved. Railway is explicit that this was wrong on Google’s side and that they are awaiting Google’s internal review for the slower recovery steps after access returned. Even if the root mistake is rare, the shape of the failure (account-level, sudden, opaque from the outside) is a reminder that “we are multi-AZ” does not automatically mean “we survive a cloud account suspension.”

What broke, in plain routing terms

Railway’s own timeline is the cleanest map. Monitoring fired at 22:10 UTC; by 22:11 the dashboard was already unhealthy. By 22:19 they had identified GCP account suspension as the cause. They filed a P0 with Google at 22:22; Railway writes that account access was restored at 22:29, but compute stayed stopped, persistent disks were inaccessible, and networking was still down while layers were brought back in order (same report).

The part that turned a GCP outage into an every-provider outage is the admission that edge proxies cache routing tables populated from a network control plane API hosted on Google Cloud. While those caches held, Metal and AWS workloads could still answer. Once caches expired (first big cascade around 22:35 UTC in their timeline), the edge could no longer resolve routes to live instances, so customers saw 404 across the fleet even where the underlying hosts were still running. That is not a subtle footnote; it is the difference between “we run workloads in more than one cloud” and “a user request can always find a healthy instance without that one dependency.”

Recovery then stacked: disks back to ready by 23:54 UTC, but core networking and edge routing did not fully return until about 01:30 UTC on May 20. After that came orchestration, builds, throttled deploy drains, and another familiar pain: GitHub rate-limited Railway’s OAuth and webhook traffic during the recovery burst, which temporarily blocked logins and builds even as other pieces healed. Railway also notes terms-of-service acceptance records reset, so the dashboard asked people to re-accept on the next visit.

GCP uncertainty without pretending we have Google’s postmortem

Railway frames the suspension as part of a broader automated action inside Google Cloud that touched many accounts, not a bespoke human misclick aimed at one customer. If that description holds after Google’s review, it is the kind of event that makes finance and security teams ask sharper questions: what automated gates can freeze an entire production account, what telemetry proves it was wrongful, and what contractual or architectural mitigations exist when the vendor’s first response is a ticket queue rather than a graceful read-only mode.

This is not a license to treat every GCP user as one outage away from doom; it is a nudge to model account suspension as its own failure mode, distinct from instance loss or AZ failure, and to read SLAs with that lens. Railway’s report is unusually blunt that they are still waiting on Google for clarity on some of the networking restoration delay.

Where Railway says the buck stops

The post includes a line worth quoting because it matches how good incident writing sounds in private chat: Railway owns vendor choices, and customers experience your uptime as your product, not a shared blame exercise (same report). The engineering follow-through they publish matches that tone: remove the hard dependency that made workload discoverability ride on a GCP-hosted control plane API, push toward a true mesh where losing one interconnect still leaves a path, extend highly available database shards across AWS and Metal so quorum survives a sudden cloud disappearance, and move Google Cloud off the data-plane hot path while keeping it for secondary or failover roles.

Those are multi-quarter projects, not a weekend flag flip. They are also the honest price of learning that redundant fiber and multi-AZ databases did not, by themselves, prevent a single upstream account action from starving route population after cache expiry.

A short checklist if you are not Railway but you recognize the graph

You cannot copy Railway’s internal roadmap, but you can steal the questions:

If your cloud account were suspended tomorrow with no warning, what exactly stops first: auth, DNS, load balancers, artifact pulls, routing metadata, secrets, builds?
Which of those dependencies are cached, and what happens after TTL when the control plane is still gone?
Do you have a second account, second project, or second legal entity separation that is real enough to survive a mistaken policy action, or is it mostly cosmetic org chart lines?
During recovery, which third parties (GitHub, Okta, Stripe, registries) will see a retry storm and rate-limit you?

None of that removes the sting of an eight-hour window for a platform used as default hosting. It does give builders a specific failure cartoon to sketch on a whiteboard, which beats treating “multi-cloud mesh” as a mood board.

Sources

Railway: “Incident Report: May 19, 2026 – GCP Account Suspension” (primary timeline, impact, remediation commitments)

Search

What broke, in plain routing terms

GCP uncertainty without pretending we have Google’s postmortem

Where Railway says the buck stops

A short checklist if you are not Railway but you recognize the graph

Sources