Not The whole lot Is Google’s Fault (Simply Many Issues)
Over the past 18 months, we’ve had greater than a handful of points with Google. I used to be ready to write down about them till AFTER we acquired off Google, however now looks as if time as any.
Railway makes use of Google Cloud Platform merchandise reminiscent of Google’s Compute Engine to energy our software improvement platform.
On November thirtieth, at 16:40 UTC, Google unexpectedly restarted a workload working buyer site visitors. Whereas this isn’t sudden, it’s a uncommon prevalence.
We’ve got automated programs in place to detect and resolve this. We’re notified in Discord and if the field doesn’t turn out to be wholesome after failover, we’re paged.
By 20:53 UTC, the difficulty had been resolved, all workloads failed over efficiently, and repair was subsequently restored.
Networking
In 2022, we skilled continuous networking blips from Google’s cloud merchandise. After escalating to Google on a number of events, we acquired annoyed.
So we constructed our personal networking stack — a resilient eBPF/IPv6 Wireguard community that now powers all our deployments
All of a sudden, no extra networking points.
Registry
In 2023, Google randomly throttled our quota on their Artifact Registry right down to nothing.
This precipitated our builds to be delayed as throughput of picture distribution was lower considerably. Once more, we acquired annoyed.
Following this, we constructed our personal registry product.
Voila, no extra registry throughput points.
Support
After the problems above, I used to be fuming. How might Google do that? We paid them a number of hundreds of thousands of {dollars} per yr, but they merely couldn’t be bothered to care that their actions had been affecting consumer workloads (each ours and many. other. customers).
So, I did what any self-respecting founder does: I began tweeting.
Google reached out, and I took it upon myself to take a seat down with a couple of of their VPs to unravel what occurred, so it by no means occurred to anyone else.
Because it seems, a Google engineer was in a position to arbitrarily change the quota on GCP.
I expressed to the VPs that this wasn’t acceptable. They agreed and stated “We’re digging in closely to this. It’s going all the way in which to the highest!”
That was June. To today, I’m nonetheless following up to get that retro and official response + coverage added to stop arbitrary quota modifications
💡
Bonus: Throughout this entire alternate, they modified their ToS with out warning to trigger our prices to go up by 20%
In addition they stated they’d get again to us on that. 🦗
After citing Steve Yegge’s Platform Rant to any VP that might pay attention, I felt defeated.
Final quarter, we made the choice internally to sundown all Google Cloud providers and transfer to our personal naked steel situations. We acquired our first one up a pair weeks in the past, and will likely be shifting all our situations in 2024.
In our expertise, Google isn’t the place for dependable cloud compute, and it’s positive as heck not the place for dependable buyer help.
That leads us to at present’s incident.
On December 1st, at 8:52am PST, a field dropped offline; inaccessible. After which, as an alternative of robotically coming again after failover — it didn’t.
Our major on-call engineer was alerted for this and dug in. Whereas digging in, one other field fell offline and didn’t come again.
We began manually failing over these bins (~10 minutes of downtime every), however quickly sufficient there have been a dozen of those bins and half the corporate was known as in to undergo runbooks.
Given our expertise with Google Cloud, and what we noticed within the serial logs lining up with what we’d seen prior for Google Cloud’s automated live-migration, we filed a ticket with each our cloud reseller, in addition to the Googlers who stated they’d “Assist us, day or night time.” By the point Google acquired again to us hours later, we’d already resolved the incident.
As we had been manually failing these over, we saved digging.
Our first response was to look via the serial console logs — these logs come straight from the kernel through a virtualized serial system.
Once we scrubbed via the serial console logs, we seen tender-locked CPU cores in addition to stack traces for locked CPUs displaying entries reminiscent of kvm_wait
or __pv_queued_spin_lock_slowpath
.
The final time we had seen related logs and habits throughout the serial console logs was throughout a Google-initiated restart which occurred on three bins final yr, additionally in December.
As we dug into this extra, we discovered extra kernel errors which lined up with a couple of threads on GCP’s nested kernel virtualization inflicting tender lockups. You may see Google acknowledge this bug here. Moreover, different customers have complained about it here and here. All on GCP.
As a result of we don’t use virtualization ourselves on these hosts (but), these messages re: kvm and paravirtualization would relate to the visitor kernel code that interacts with the GCP hypervisor.
The customers within the 3 points above appeared to expertise the identical factor, all on GCP. GCP appears to have dismissed this as “not reproducible,” however we strongly consider we skilled the identical factor right here.
Particularly, we believe that there’s a doubtlessly deadly interplay in userspace-to-kernel reminiscence switch on GCP company which causes softlocks in uncommon circumstances.
Extra precisely, we consider this to be associated to the paravirtualized reminiscence administration and the way these pages are mapped and remapped on the hypervisor throughout sure sorts of useful resource strain. The one frequent consider all of the studies we’ve seen is that almost all studies are from GCP customers.
💡
If the above is true, very similar to the quota problem we skilled prior, this implies there’s an arbitrary, unposted pace restrict/threshold/situation by which your bins softlock, regardless of being nicely beneath utilization on all attainable noticed telemetry (CPU/mem/IOPS). These machines had been round 50% of their posted useful resource limits, which is inline with this article about GCP nested virtualization.
After a handbook reboot, we disabled some inside providers to lower useful resource strain on the affected situations, and the situations turned secure after that time.
Throughout handbook failover of those machines, there was a ten minute per host downtime. Nevertheless, as many individuals are working multi-service workloads, this downtime will be multiplied many occasions as bins subsequently went offline.
For all of our customers, we’re deeply sorry.
That is obscenely irritating for us and, as talked about, we’re shifting to our personal naked steel to provide you all elevated reliability.