IWStack down?

VittG · October 2021

I just signed up here because I'd like to add something that might be important for your troubleshooting, and I'm not sure how else to contact you since the panel is still down.

You say that the problem with the SAN happened only yesterday; but, the day before yesterday, I could see that, some minutes before everything completely collapsed, my iwstack instance went into 100% iowait and was unable to read/write from the disk. I was able to ssh and run htop, possibly because the needed files were still cached in in RAM, but everything else went in uninterruptible sleep ("D").

So some issue with the SAN and/or its power supply might have been there also in the first outage, as well as the second one.

By the way, I would like to also add that, except this unfortunate outage, my very small instance has been very very stable and reliable in these three years I've been with you.

I hope everything will get sorted out, and also that Salvatore can get well soon...

P.s. consider yourself very very lucky that your infrastructure is not in the Iredios DCs. Their DCs were good before the Iredios acquisitions, but today, sudden power outages of both rails, issues with "lost"/"forgotten" inventory, unannounced unilateral contractual changes, or even random construction works that will dirty all servers with rubble, are a pretty constant nightmare from what I'm told by friends who have gear there... Plus the fact that they removed 24x7x365 technical manned presence almost without notice. CDLAN is a very serious DC provider instead...

Best regards,
VittG

Maounique · October 2021

@VittG said: CDLAN is a very serious DC provider instead...

Yes, we didn't have issues with them, in fact, Salvatore worked for them almost since the start and he designed a lot of things there.
I do not believe the switches issue and the SAN issue are unrelated, just a coincidence, there must be a power issue somewhere even as the SAN is seriously shielded from such things, by internal UPSes and regulators.
The technical team is there, they will certainly have an explanation from the SAN logs about what happened. They are now trying to restore our main partition.
We also have a smaller SAN and a few NFS servers on site (the main SAN is somewhat further away this is why we didn't think it would be affected by what we thought was an ATS rack issue in the incident with the switches) which are running so the local storage nodes should be still up, but the orchestrator is down, so it is likely errors will crop up all the time.
We do not know how long will it take, it is a huge quantity of data and the last news we had was that they are trying to find a way to restore it, were not sure whether it can be done.
Only one partition left, but it is the most important one for us.

mgiammarco · October 2021

Now I have a very big problem. I activated additional backup of my VMs in iwsea minio service. I am restoring the most urgent VMs in another cloud. Now I discover that IWStack forgot to backup the second disk of my most important VM. I could not find in the minio backup (I see other backups infact are missing but this one is the most important).
What update can you give to me? Have you lost data? Can I access at least backups to recover my data disk?

mgiammarco · October 2021

I have just seen that a new backup has been added in my minio repository. So it seems that backups works. Can you add the backup of the second disk?

mgiammarco · October 2021

@Maounique I confirm the automatic backup to minio is working. I need absolutely to contact someone to enable the backup of the missing disk to minio so I download it and I am fine.

desperand · October 2021

Damn, that's really bad case... Fuck. Sad to read all of that... Shit happens. I wish you a lot of patience and mental health for resolving all of that shit....

Maounique · October 2021

Hello!

The Hitachi people believe it is a bug in the SAN firmware and released a patch 4 hours ago. They have installed it and trying to recover the data.
Given the situation, I can't say I blame them, they were on site in a few hours, in week-end and working for 10 hours already.
They have restored 2 of the three partitions, but, of course, the one with the most load is also the most corrupted one, not to mention the most data too.
In principle, there are some internal redundancies in the SAN and we also had a running backup of sorts (I don't understand Uncle's magic, but he was able to restore many accidentally deleted VMs by their users days after the event happened) so, in theory it should be possible to restore at least some of the data, but, if the problems started days ago when the switches issue happened, then it might be worse than I originally thought.

asianbookie · October 2021

can we at least bring up the cloudstack first so we can deploy at milan2 while milan1 is being sorted?

mgiammarco · October 2021

I need to ask several questions:
1) I have ssd based VMs, are they affected by SAN failure?
2) Have you backups outside SAN?
3) Can, to avoid addtional damage, enable the backup to minio/iwsea of the second disk of my most important VM?

I have seen backups on minio of my VMs (even today there are) and they seems not corrupted so I am pretty SSD based VMs are not on SAN so they are not affected.

wcypierre · October 2021

iwstack panel is up

wcypierre · October 2021

did a shutdown and starting up the vm (as suggested in the previous outage) and the vm is up now

Maounique · October 2021

Hello!

https://prometeus.net/billing/index.php?rp=/announcements/471/SAN-issues.html

Maounique · October 2021

This SAN is a state of the art piece of hardware with multiple built-in redundancies and a top tier support contract.
The data has been restored in full albeit the Hitachi technicians didn't allow for it to go live until they finished all integrity checks they could think of.
With tens of petabytes of data this took a pretty long time.
So, while we had the good news they will and did restore all data, we had to wait 6-7 more hours until all checks completed.
I can't fault them for that, if anything, it shows they are very thorough at their job. It is better to take a longer outage than multiple little ones.
This incident is still ongoing, we are restoring services in a certain order, but if the panels see your VM and give no errors at start, it probably means your service has been restored already.

Maounique · October 2021

@mgiammarco said: Now I discover that IWStack forgot to backup the second disk of my most important VM.

Hello!

IWStack creates snapshots of the VM data (or even full states in case of Xen and VMWare) in our secondary storage.
I understand that with the orchestrator down the secondary storage was inaccessible too, but IWStack does not make backups externally unless you had a custom solution created by one of our technicians. Could you pm me your name/email, something to identify your account and check?

Maounique · October 2021

All services should be up.
As anyone knows, this does not mean individual ones started correctly, after years of uptime some systems might fail to start due to start-up files being setup wrongly at updates and whatnot.
In particular, if you have ISOs mounted, please shutdown the service, remove the ISO and boot. The NFS storage had some issues with instances hang unable to access main storage and we had to clean up the servers. In theory, this should not affect the mounted ISO as the service came back up because we restored the services in the correct order, yet it is possible your instance was linked to an obsolete ISO we have replaced and the exact copy will no longer be found, to quote just one example of things gong bad.
So if you had years of uptime, please take the minute to fully shutdown the instance and start it without any attachments to check. If there are boot issues, please check the console. Checking the console is always recommended after downtime, fsck might be going on and repeated restarts will no tmake things easier.

Maounique · October 2021

I am sorry I haven't been able to answer individual questions and requests. Here it goes:

@asianbookie said: can we at least bring up the cloudstack first so we can deploy at milan2 while milan1 is being sorted?

Unfortunately, no, IWStack is built around the SAN, we only branched it out to local storage servers and multiple datacenters after it has been designed for our B2B side. As such, the orchestrator is storing its data on the SAN itself as we figured that would be the safest place to do it (and it was until recently and even after that the restore is 100%).

@mgiammarco said: 1) I have ssd based VMs, are they affected by SAN failure?

Yes and no.
Yes because the orchestrator is down and can't manage the nodes in any way, the console, the panel, the API and the management of the secondary storage do not work.
No, because if you do not need to intervene in any way and you do not rely on the Virtual Router, external firewall to access the Internet, it should work undisturbed.

@mgiammarco said: 2) Have you backups outside SAN?

Again, yes and no.
Yes, because all the secondary storage is outside the SAN and this is everything not live, from snapshots to ISOs and templates, therefore, if you took a snapshot you can restore it in full even if 100% of the data on SAN has been lost.
No, because the redundancies in the SAN as well as our rolling snapshots are stored on the SAN, albeit in another partition. So, the snapshots were available while the main partition was being checked, but we could not restore them yet because we did no thave where to restore them on lacking access to the main partition until the Hitachi engineers gave the green light. Suppose the main live data was lost, we could have restored it at most an hour back, but that could have taken a very long time going through so much data and iterations. The fact they restored the main partition was a relief, cut some 20 hours of downtime.

@mgiammarco said: 3) Can, to avoid addtional damage, enable the backup to minio/iwsea of the second disk of my most important VM?

This is the part I do not understand, ISWtack does not provide this service, back-up out of its ecosystem automatically, it can only take snapshot to the secondary storage (which is the little SAN in the rack or some NFS servers, depending on the zone. Also, with the orchestrator down, even if it did have such a service, it could have not been done.

@VittG said: You say that the problem with the SAN happened only yesterday; but, the day before yesterday, I could see that, some minutes before everything completely collapsed, my iwstack instance went into 100% iowait and was unable to read/write from the disk. I was able to ssh and run htop, possibly because the needed files were still cached in in RAM, but everything else went in uninterruptible sleep ("D").

I did forward this, mainly in supporting my theory the incidents are related, but the technicians from Hitachi denied too it was a power issue and I know for a fact that the SAN has multiple layers of protection and it is checked regularly.
I do not see any other common cause than a power issue, otherwise the 2 affected parts, the switches and the SAN are in different rooms, pretty far apart and the SAN does not use the network other than for being monitored, not for data transfers. It has own FC switches and separate links. The Hitachi people told us squarely it was a bug in their firmware, released a patch, applied it restored and checked the data. Both the DC and them said there was no power issue so I will have to accept this was a coincidence, even as it is hard to believe 2 totally unrelated incidents strike us after years of smooth operation just 2 days apart.

mgiammarco · October 2021

@Maounique said:

@mgiammarco said: 3) Can, to avoid addtional damage, enable the backup to minio/iwsea of the second disk of my most important VM?

This is the part I do not understand, ISWtack does not provide this service, back-up out of its ecosystem automatically, it can only take snapshot to the secondary storage (which is the little SAN in the rack or some NFS servers, depending on the zone. Also, with the orchestrator down, even if it did have such a service, it could have not been done.

I clarify you @Maounique : IWStack (Salvatore?) offered me that type of service. I have written you a PM but you can read ticket 875888.
Please please please reply to this ticket and add the second disk to backup tomorrow. We are risking a legal issue because my customer got angry NOT because of the big IWStack down but because the outside datacenter backup I promised to him was not working due to missing data disk!!! If the customer sues me I will be forced to sue Prometeus and it is a thing I do not want to do so please give this simple task top priority! Thanks.

Both the DC and them said there was no power issue so I will have to accept this was a coincidence, even as it is hard to believe 2 totally unrelated incidents strike us after years of smooth operation just 2 days apart.

Probably the main routers split brain triggered the san bug. Without bug details I do not have other ideas.

Neoon · October 2021

@mgiammarco said:

@Maounique said:

@mgiammarco said: 3) Can, to avoid addtional damage, enable the backup to minio/iwsea of the second disk of my most important VM?

This is the part I do not understand, ISWtack does not provide this service, back-up out of its ecosystem automatically, it can only take snapshot to the secondary storage (which is the little SAN in the rack or some NFS servers, depending on the zone. Also, with the orchestrator down, even if it did have such a service, it could have not been done.

I clarify you @Maounique : IWStack (Salvatore?) offered me that type of service. I have written you a PM but you can read ticket 875888.
Please please please reply to this ticket and add the second disk to backup tomorrow. We are risking a legal issue because my customer got angry NOT because of the big IWStack down but because the outside datacenter backup I promised to him was not working due to missing data disk!!! If the customer sues me I will be forced to sue Prometeus and it is a thing I do not want to do so please give this simple task top priority! Thanks.

Both the DC and them said there was no power issue so I will have to accept this was a coincidence, even as it is hard to believe 2 totally unrelated incidents strike us after years of smooth operation just 2 days apart.

Probably the main routers split brain triggered the san bug. Without bug details I do not have other ideas.

" Where data is transmitted to us, the customer is to back up their data regularly. The servers will be backed up regularly by us only when this is part of the offer. In the case of data loss, the customer must reload the data. "

I hope you learn from your mistakes and do your own backups in the future.

VittG · October 2021

@Neoon said: The servers will be backed up regularly by us only when this is part of the offer.
[...]
I hope you learn from your mistakes and do your own backups in the future.

Well... He actually said that backups were part of what was offered to him, and his problem is that those backups were not backing up a secondary disk. Of course it seems he still failed to notice that the secondary disk backups were actually not working since the beginning (or were they working before?), but, it all depends on what's in the actual agreement between them.

wcypierre · October 2021

@VittG said:

@Neoon said: The servers will be backed up regularly by us only when this is part of the offer.
[...]
I hope you learn from your mistakes and do your own backups in the future.

Well... He actually said that backups were part of what was offered to him, and his problem is that those backups were not backing up a secondary disk. Of course it seems he still failed to notice that the secondary disk backups were actually not working since the beginning (or were they working before?), but, it all depends on what's in the actual agreement between them.

regardless of what was offered, I still wouldn't solely rely on only one single backup, especially when its provided by the provider (for exactly this scenario).
After everything has settled down, he should probably review and test his bcp plan

Maounique · October 2021

It was already working for another server. He asked for this server to be added, but only the root disk was added.
Either way, threats are not well received anywhere.
This is one of the reasons I tell Uncle not to do custom solutions outside of our B2B contracts where everything is clearly specified and no misunderstandings can happen. We offer plenty of off-the-shelf back-up plans, people should be able to back-up their data with us or with someone else.

@VittG said: but, it all depends on what's in the actual agreement between them.

That is the problem, there was no special contract between us so it is all up to the courts to decide whether he specified correctly what should be backed up or not, or if, indeed, the ticket can be considered a contract or not (I think he might have a case here).
In my opinion, given the ticket and IANAL, he has no case given how the April ticket went with the extra request, and with the fact the supposedly undelivered service was not paid for, nor were we notified the second disk should have been added or it was not added after the request. It can go either way, depending on luck and how sensible the judge will be. I will link to the public documents about the case so we will all see how it works in practice.

Howdy, Stranger!

Categories

In this Discussion

IWStack down?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

IWStack down?

Comments