BuyVM Catastrophic Data Failure - All data lost on a node!

saibal · April 2018

deank said: "Samsung" actually means three stars.

What do you expect from 3-star products?

Get Subaru branded products ;-)

But seriously, good luck @Francisco. Hopefully, you recover from this soon. Once this is over, maybe you can publish drive reliability reports like Backblaze.

Saragoldfarb · April 2018

Though luck I guess. It happens to the best. At least with BuyVM you can be assured they're working on the smoothest recovery possible.

Francisco · April 2018

jetchirag said: Bad batch of SSDs (maybe)

As I mentioned it's all the same order batch from the same vendor so near certain the same production batch.

I'll be going to the DC later today to continue on the storage project so I'll pull whatever ID information I can find from them. Since some people are pretty curious about this i'll post some screenshots of what we're seeing happening.

In all cases (both in lv-shared03 & now on 04) the drive activity lights stay 100% lit, no blinking, which usually means "she's dead jim".

We booted into linux and hot plugged the drives but they sit there trying to detect but never completes.

We tried ICH9, ICH10, LSI9211, Adaptec HBA's, and really whatever else we could find to see if any of them would see the drives.

Francisco

Francisco · April 2018

saibal said: But seriously, good luck @Francisco. Hopefully, you recover from this soon. Once this is over, maybe you can publish drive reliability reports like Backblaze.

Hah!

So far it's just the 1TB's that have been cancerous. I got 500GB's and they're solid, happy as can be. My 840's (500 & 1TB) are both solid.

Saragoldfarb said: Though luck I guess. It happens to the best. At least with BuyVM you can be assured they're working on the smoothest recovery possible.

Restores will be done today assuming we can keep the restore scripts fed. At this point Anthony expanded on my scripts from the lv-shared03 problems and has multiple restore queues depending on account size, priority restores, etc.

The bigger issue is that the array where the backups are stored is just RAID60's so it's having a hard time keeping up with the IOP demand of all of the backups packing. JetBackup has a disaster recovery option but i'm fairly sure it works the same (generate a tarball, transfer, restore as a cpmove). In that case our scripts are much faster.

As of now we're over 30% restored with an average restore rate of around 100 accounts per hour. This is going to slow down though since i doubt we can continue to feed the beast this quickly.

EDIT - Looks like we're actually 45%+ done restoring.

Francisco

letbox · April 2018

@angstrom said:
Anyone want to post the email?

Hello, 
We've suffered a catastrophic failure on this node where multiple SSD controllers failed. All data has been lost. 

We're in the midst of setting up a new node with new NVME drives and are working to restore all the backups. 

From the looks of it the backups are a few days old but you should be back online in a day or so. 

We will send updates over email, please do not bump this ticket or reply to it until we say everything is restored and you're still having issues. 

We apologize for this, this is not how we wanted to spend our week. 

Francisco/Frantech Team

hzr · April 2018

saibal said: Get Subaru branded products ;-)

didn't expect that ..

eric1212 · April 2018

@Lee said:

Ewok said: 'Catastrophic'? That's a bit fucking dramatic.

A bit like your reply.

Their words, not mine!

Francisco · April 2018

deank said: On a more serious note, is it for real? Could be a badly miss-timed April fool's joke.

"We lost all your data".

"lol jk"

Francisco

Harambe · April 2018

@Francisco try this one weird trick: https://dfarq.homeip.net/fix-dead-ssd/

I've actually seen this work before with a "dead" 850 evo.

https://www.reddit.com/r/pcmasterrace/comments/7y7jut/did_your_ssd_suddenly_die_on_you_do_this_power/

Francisco · April 2018

@Harambe said:
@Francisco try this one weird trick: https://dfarq.homeip.net/fix-dead-ssd/

I've actually seen this work before with a "dead" 850 evo.

https://www.reddit.com/r/pcmasterrace/comments/7y7jut/did_your_ssd_suddenly_die_on_you_do_this_power/

What in the fuck?

I gotta try that when I go to the DC.

Francisco

AnthonySmith · April 2018

Its a ridiculous situation, people always think... 'no way did multiple drives fail at the same time'

It does happen, I had it happen in the UK, 3 disks in a 4 disk raid 10 array kicked themselves out of the array at the same time and when trying to recover the data it was completely beyond hope.

I thought that was my 1 in a million, the impossible happened, about 6 months later it happened again on a different node, both were using disks from the same batch.

Reminds me of that old serial/com port mouse I had that would stop any PC it was connected to from posting... sense it makes none, yet it happens.

Good luck with the recovery.

jetchirag · April 2018

@Francisco said:

@Harambe said:
@Francisco try this one weird trick: https://dfarq.homeip.net/fix-dead-ssd/

I've actually seen this work before with a "dead" 850 evo.

https://www.reddit.com/r/pcmasterrace/comments/7y7jut/did_your_ssd_suddenly_die_on_you_do_this_power/

What in the fuck?

I gotta try that when I go to the DC.

Francisco

If it works, you'd owe him your shoemining rigs

Harambe · April 2018

@Francisco said:

@Harambe said:
@Francisco try this one weird trick: https://dfarq.homeip.net/fix-dead-ssd/

I've actually seen this work before with a "dead" 850 evo.

https://www.reddit.com/r/pcmasterrace/comments/7y7jut/did_your_ssd_suddenly_die_on_you_do_this_power/

What in the fuck?

I gotta try that when I go to the DC.

Francisco

SATA power only, no data cable attached. 30 mins powered on, 30 seconds unplugged, 30 mins on, 30 seconds unplugged, then plug it all back in.

Hope it works, would at least get you the most recent copy of stuff back.

raindog308 · April 2018

HBAndrei said: Yeah, I'm also affected, millions just being wasted every second...

Dow Jones is down 40 points...

lurch · April 2018

@Harambe said:
@Francisco try this one weird trick: https://dfarq.homeip.net/fix-dead-ssd/

I've actually seen this work before with a "dead" 850 evo.

https://www.reddit.com/r/pcmasterrace/comments/7y7jut/did_your_ssd_suddenly_die_on_you_do_this_power/

My very first ssd a 60gb ocz vertex never worked from new but for some strange reason I never threw it away, i'm gonna try this.

deank · April 2018

The end is nigh.

Francisco · April 2018

Harambe said: no data cable attached.

Think a backplane would cause issues or probably not since it'd just be doing pass through?

Francisco

eric1212 · April 2018

@Francisco

Will it be possible to give us the exact date and time of the backup being restored? Would be super helpful once we're back live!

Francisco · April 2018

eric1212 said: @Francisco

Will it be possible to give us the exact date and time of the backup being restored? Would be super helpful once we're back live!

As of this very second we're 49% restored. Slowed down a bit but Anthony is waiting for the next batch to sync on over.

I suspect by end of day we'll be back in action.

Francisco

eastonch · April 2018

AnthonySmith said: Its a ridiculous situation, people always think... 'no way did multiple drives fail at the same time'

I think if you had all 4, from the same batch, that were used equally at the same time then it is absolutely feasible that they all hit their batches 'lifetime' write cycles or had exactly the same failure point. I have heard a general consensus amongst PC enthusiasts that the 850 250GB ,500GB and 1TB drives have faired worse than their 840 counterparts.

Francisco · April 2018

eastonch said: I think if you had all 4, from the same batch, that were used equally at the same time then it is absolutely feasible that they all hit their batches 'lifetime' write cycles or had exactly the same failure point. I have heard a general consensus amongst PC enthusiasts that the 850 250GB ,500GB and 1TB drives have faired worse than their 840 counterparts.

I checked the drives last week and they were all 30 - 40% left, so plenty of room.

Francisco

Harambe · April 2018

@Francisco said:

Harambe said: no data cable attached.

Think a backplane would cause issues or probably not since it'd just be doing pass through?

Francisco

Possibly an issue. If the backplane is disconnected from any controller it may work, but basically the SSD is looking for a power-only connection for a long period of time (~30 mins) and that triggers a factory reset of sorts on the controller. I'm not sure if something plugged into the data connector will allow it do the same reset.

Best bet is to crack open a server you have kicking around that has sata power coming off the PSU and do it that way.

Francisco · April 2018

Harambe said: Best bet is to crack open a server you have kicking around that has sata power coming off the PSU and do it that way.

Yep, I'll pull some SATA plugs off a backplane and see what happens.

I'll owe you a banana whenever I go to the big gorilla in the sky if this works.

Francisco

eric1212 · April 2018

Francisco said: I suspect by end of day we'll be back in action.

Thanks for the info. Sorry, I meant the date and time the backup was taken. (how old the restored data will be).

Harambe · April 2018

@Francisco said:

Harambe said: Best bet is to crack open a server you have kicking around that has sata power coming off the PSU and do it that way.

Yep, I'll pull some SATA plugs off a backplane and see what happens.

I'll owe you a banana whenever I go to the big gorilla in the sky if this works.

Francisco

Francisco · April 2018

eric1212 said: Thanks for the info. Sorry, I meant the date and time the backup was taken. (how old the restored data will be).

I've seen people that had backups from April 2nd, so literally the day prior if not the day of.

Others had backups from late March.

Francisco

lurch · April 2018

That's my first 30min/30sec cycle done

letbox · April 2018

@Francisco said:

eastonch said: I think if you had all 4, from the same batch, that were used equally at the same time then it is absolutely feasible that they all hit their batches 'lifetime' write cycles or had exactly the same failure point. I have heard a general consensus amongst PC enthusiasts that the 850 250GB ,500GB and 1TB drives have faired worse than their 840 counterparts.

I checked the drives last week and they were all 30 - 40% left, so plenty of room.

Francisco

The issue can be with Backplane or the RAID Card just die but i'd to check the backplane first, i had like this issue before but it was read 2 drives from 4 and sometime just not read any.

Francisco · April 2018

key900 said: The issue can be with Backplane or the RAID Card just die but i'd to check the backplane first, i had like this issue before but it was read 2 drives from 4 and sometime just not read any.

We tried all of those as well as bypassing the backplane, same thing on all of them.

I'll for sure try the 30/30/30/30 trick when I go in later today/tomorrow.

Francisco

teamacc · April 2018

Had an older crucial m4 ssd that also liked to stall on boot sometimes, seems like it was stuck in some garbage collecting mode. Leaving it with a mix of power and no power for multiple days fixed it for me. Never made an exact science out of "how to fix it" though.

A firmware update later on fixed it for good.

Howdy, Stranger!

Categories

In this Discussion

BuyVM Catastrophic Data Failure - All data lost on a node!

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

BuyVM Catastrophic Data Failure - All data lost on a node!

Comments