Now Reading
Service Interruption: Cannot Destroy Machine, Deploy, or Restart – Questions / Assist

Service Interruption: Cannot Destroy Machine, Deploy, or Restart – Questions / Assist

2023-07-20 18:42:10

Hey workforce, the well being checks on our fly app have been failing this morning and so I’ve logged in to diagnose this. I can see the machine has the “stopped” state, and there’s a little banner saying:

Service Interruption 3 hours in the past
We're performing emergency upkeep on a number a few of your apps situations are working on.

and no different data. I can’t see a difficulty on the standing web page and unsure the place else to search for decision steps or ETA?

I’m unable to restart or destroy the stopped machine situations (it instances out), and making an attempt to re-deploy the app throws an error:

Error: discovered 1 machines which might be unmanaged. `fly deploy` solely updates machines with fly_platform_version=v2 of their metadata. Use `fly machine record` to record machines and `fly machine replace --metadata fly_platform_version=v2 <machine id>` to replace particular person machines with the metadata. As soon as carried out, `fly deploy` will replace machines with the metadata based mostly in your fly.toml app configuration

Undecided what else to attempt at this level wanting re-creating a brand new app or placing it up on one other host. Any information you may present?



2 Likes

We additionally skilled this. Considered one of our machines appears to have gone zombie mode, reporting unreachable for all fly machine instructions with “Error: couldn’t get machine.”

I’ve been capable of restore availability to our app through the use of fly scale to allocate extra VMs. Nonetheless, the unhealthy VM continues to exist in an indeterminate state, can’t be destroyed or faraway from the account. fly machine record reveals invalid information for the VM comparable to creation date of “1970-01-01T00:00:00Z”

I might recognize any recommendation on easy methods to remediate this.

We’re seeing the identical problem in our app, as of almost 14 hours in the past – and ours is for the database container, so it’s not so simple as scaling as much as restore entry :frowning:

We have now a paid plan so I emailed assist a number of hours in the past, however no reply as of but.

For the document, our app is hosted in syd, so possibly one or two hosts are having points there?

I’m additionally getting this – similar to @mfwgenerics, I labored round this by scaling to create a brand new machine, however nonetheless have the unique machine in a state the place it might’t be destroyed:

Error: couldn't get machine [machine ID]: did not get VM [machine ID]: unavailable: dial tcp [ipv6]:3593: join: connection refused

My staging atmosphere is in the identical state, however till I’m not including extra VMs that I’m absolutely going to be billed for till I do know I can clear up the zombies.

Identical to OP, I see the identical error within the dashboard about emergency upkeep. That’s been there for 15 hours, with no different data.

That is the second time I’ve had this type of problem with Fly, the place my service simply goes down, Fly studies every thing wholesome, and there’s actually no data and nothing I can actually do aside from wait and hope it comes again up someday (hours later, most likely).

I recognize the comfort that Fly affords, however these sort of issues erode my belief on this platform fully. Heroku had it’s faults, however I used to be by no means left scouring a discussion board making an attempt to get my service again up – if a number was unhealthy, my dyno can be robotically moved, no worries. I’m working a small-scale golden path Rails app with Postgres, I can’t think about making an attempt to repair these sorts of issues on a extra complicated app.



1 Like

Including some extra data right here since I’m additionally stunned that that is nonetheless ongoing 12 hours later with no response.

  • We had 4 machines (app + Postgres for staging and manufacturing) working yesterday, and three of the 4 (together with each databases) are nonetheless down and might’t be accessed. I can replicate the problems others have talked about right here.
  • That is our firm’s exterior API app and so the problem broke all of our integrations.
  • Our workforce ended up organising a brand new venture in fly to spin up an occasion to maintain us going which took a few hours (backfilling atmosphere variables and configuration and so forth, not a nasty take a look at of our DR potential).
  • There isn’t a approach I can discover to get the information from the db machines. Thank goodness this isn’t our principal manufacturing db and we have been capable of reverse engineer what we wanted into there.

Very eager to listen to what’s taking place with this and why after so many hours there’s no extra information or updates.

As an apart, it’s sort of a kick within the tooth to see the standing web page for our group reporting no incidents – the identical web page that lists our apps as underneath upkeep and inaccessible!



1 Like

Confirming my deployment is in syd too. I’m nonetheless seeing the zombie VM and observing failing CLI instructions in opposition to the machine.

We have now syd deployments as nicely for all our apps too

I’m feeling very fortunate that none of our paid manufacturing apps or databases are affected presently (solely our growth atmosphere is), but additionally actually stunned that the problem has been ongoing for 17 hours now with no standing web page replace, no notifications (past betterstack letting us realize it was down) and one be aware on the app with not a lot information as to whats occurring.

It actually worries me what would occur if it was one in all our paid manufacturing situations that was affected – the information we’re working with can’t merely be ‘recovered’ later, it’d simply get dropped till service resumed or we migrated to a different area to get issues working once more

Eager to know whats fallacious and whats being carried out about it



1 Like

Message has now been up to date

Service Interruption (20 hours in the past)
We're persevering with to research an infrastructure associated problem on this host.

Nonetheless no incidents listed on standing web page although for SYD area :thinking:



1 Like

unsure if related however had a redis app in lhr fall into suspended standing in a single day, killed an essential demo

machine is a zombie…

machine [id] was discovered and is presently in a stopped state, making an attempt to kill…
Error: couldn’t kill machine [id]: did not kill VM [id]: failed_precondition: machine not in identified state for signaling, stopped

I acquired a response from assist just a few hours in the past –

Sadly this host managed to get right into a extraordinarily poor state, and a repair is taking longer than anticipated. We have now a workforce persevering with to work on it, however no estimated decision time to share proper now. As quickly as we’ve got an replace we’ll let you recognize.

So I assume we simply wait…



1 Like

Similar problem right here for me, on a number in syd. It’s fully damaged a pg cluster.

The absence of any proactive standing updates on this problem has been actually poor.

Thanks for sharing that replace, stunned there isn’t a standing replace from Fly but although :cold_sweat:

I can recognize the problem could be taking on quite a lot of time and so they need to concentrate on fixing it first – however even only a message from the workers right here earlier would put me comfy for our manufacturing apps which might be working

See Also

We labored out we might create a brand new Postgres cluster from one of many snapshots of the currently-down app – so we’re again up and working for our app.

(We needed to create it with a special title, after which once we tried to make one other one with the earlier title, flyctl put the cluster on the identical currently-down host! Oops)

Additionally having this problem. Scale labored for the Phoenix server, however the Postgres server can also be useless.

And I can’t even restore the Postgres one:

Error: did not create quantity: Could not allocate quantity, not sufficient compute capability left in yyz

There’s a identified incident listed on the standing web page for YYZ, could be associated. Fly.io Status – We are undergoing emergency vendor hardware replacement in YYZ region.

Nonetheless crickets for the down host in SYD although :frowning:

Sure, mine has been fastened.

Actually bizarre how radio silent it’s been on it :thinking: we’re arising on 48 hours now

Only a be aware that the standing replace for me now states that the service interruption was resolved 7 hours in the past:

Service Interruption resolved 7 hours in the past
We’re persevering with to research an infrastructure associated problem on this host.

I nonetheless needed to manually restart the machine to really deliver my app again up, however I’ve been capable of really work together with machines now, so I assume it’s resolved.

By no means heard something from Fly, simply full silence. No standing web page updates both. I’m sympathetic to the issues I can think about Fly having to scale their service and assist, however the takeaway I’ve from this expertise is that if one thing outdoors my very own management occurs with Fly, there’s nothing I can do to search out out what’s occurring, when it’s going to be resolved, and if there’s something I can do to resolve the problem. It seems like even the paid e-mail assist has a multi-hour response, and even then it’s simply going to be a “we’re engaged on it”. I can’t suggest Fly professionally with that sort of expertise, and I’m unsure if I may even tolerate it for private apps.



3 Likes

Replace: my unhealthy VM has lastly been restored after a few days.

I’m involved in regards to the lack of readability and communication round what occurred however I’m joyful to place this example all the way down to rising pains on Fly’s half. I feel I’ll be sticking to non-critical, non-stateful workloads for the close to time period although. :sweat_smile:



1 Like

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top