Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


BuyVM Catastrophic Data Failure - All data lost on a node! - Page 2
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

BuyVM Catastrophic Data Failure - All data lost on a node!

24567

Comments

  • saibalsaibal Member
    edited April 2018

    deank said: "Samsung" actually means three stars.

    What do you expect from 3-star products? :p

    Get Subaru branded products ;-)

    But seriously, good luck @Francisco. Hopefully, you recover from this soon. Once this is over, maybe you can publish drive reliability reports like Backblaze.

  • Though luck I guess. It happens to the best. At least with BuyVM you can be assured they're working on the smoothest recovery possible.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    jetchirag said: Bad batch of SSDs (maybe)

    As I mentioned it's all the same order batch from the same vendor so near certain the same production batch.

    I'll be going to the DC later today to continue on the storage project so I'll pull whatever ID information I can find from them. Since some people are pretty curious about this i'll post some screenshots of what we're seeing happening.

    In all cases (both in lv-shared03 & now on 04) the drive activity lights stay 100% lit, no blinking, which usually means "she's dead jim".

    We booted into linux and hot plugged the drives but they sit there trying to detect but never completes.

    We tried ICH9, ICH10, LSI9211, Adaptec HBA's, and really whatever else we could find to see if any of them would see the drives.

    Francisco

    Thanked by 1Saragoldfarb
  • FranciscoFrancisco Top Host, Host Rep, Veteran
    edited April 2018

    saibal said: But seriously, good luck @Francisco. Hopefully, you recover from this soon. Once this is over, maybe you can publish drive reliability reports like Backblaze.

    Hah!

    So far it's just the 1TB's that have been cancerous. I got 500GB's and they're solid, happy as can be. My 840's (500 & 1TB) are both solid.

    Saragoldfarb said: Though luck I guess. It happens to the best. At least with BuyVM you can be assured they're working on the smoothest recovery possible.

    Restores will be done today assuming we can keep the restore scripts fed. At this point Anthony expanded on my scripts from the lv-shared03 problems and has multiple restore queues depending on account size, priority restores, etc.

    The bigger issue is that the array where the backups are stored is just RAID60's so it's having a hard time keeping up with the IOP demand of all of the backups packing. JetBackup has a disaster recovery option but i'm fairly sure it works the same (generate a tarball, transfer, restore as a cpmove). In that case our scripts are much faster.

    As of now we're over 30% restored with an average restore rate of around 100 accounts per hour. This is going to slow down though since i doubt we can continue to feed the beast this quickly.

    EDIT - Looks like we're actually 45%+ done restoring.

    Francisco

    Thanked by 1Saragoldfarb
  • letboxletbox Member, Patron Provider

    @angstrom said:
    Anyone want to post the email?

    Hello, 
    We've suffered a catastrophic failure on this node where multiple SSD controllers failed. All data has been lost. 
    
    We're in the midst of setting up a new node with new NVME drives and are working to restore all the backups. 
    
    From the looks of it the backups are a few days old but you should be back online in a day or so. 
    
    We will send updates over email, please do not bump this ticket or reply to it until we say everything is restored and you're still having issues. 
    
    We apologize for this, this is not how we wanted to spend our week. 
    
    Francisco/Frantech Team
    
    Thanked by 1angstrom
  • hzrhzr Member

    saibal said: Get Subaru branded products ;-)

    image

    image

    didn't expect that ..

  • @Lee said:

    Ewok said: 'Catastrophic'? That's a bit fucking dramatic.

    A bit like your reply.

    Their words, not mine!

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    deank said: On a more serious note, is it for real? Could be a badly miss-timed April fool's joke.

    "We lost all your data".

    "lol jk"

    Francisco

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    What in the fuck?

    I gotta try that when I go to the DC.

    Francisco

  • AnthonySmithAnthonySmith Member, Patron Provider
    edited April 2018

    Its a ridiculous situation, people always think... 'no way did multiple drives fail at the same time'

    It does happen, I had it happen in the UK, 3 disks in a 4 disk raid 10 array kicked themselves out of the array at the same time and when trying to recover the data it was completely beyond hope.

    I thought that was my 1 in a million, the impossible happened, about 6 months later it happened again on a different node, both were using disks from the same batch.

    Reminds me of that old serial/com port mouse I had that would stop any PC it was connected to from posting... sense it makes none, yet it happens.

    Good luck with the recovery.

    Thanked by 1netomx
  • @Francisco said:

    What in the fuck?

    I gotta try that when I go to the DC.

    Francisco

    If it works, you'd owe him your shoemining rigs

  • HarambeHarambe Member, Host Rep

    @Francisco said:

    What in the fuck?

    I gotta try that when I go to the DC.

    Francisco

    SATA power only, no data cable attached. 30 mins powered on, 30 seconds unplugged, 30 mins on, 30 seconds unplugged, then plug it all back in.

    Hope it works, would at least get you the most recent copy of stuff back.

  • raindog308raindog308 Administrator, Veteran

    HBAndrei said: Yeah, I'm also affected, millions just being wasted every second...

    Dow Jones is down 40 points...

    Thanked by 2ThracianDog Ole_Juul
  • lurchlurch Member
    edited April 2018

    My very first ssd a 60gb ocz vertex never worked from new but for some strange reason I never threw it away, i'm gonna try this.

  • deankdeank Member, Troll

    The end is nigh.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    Harambe said: no data cable attached.

    Think a backplane would cause issues or probably not since it'd just be doing pass through?

    Francisco

  • @Francisco

    Will it be possible to give us the exact date and time of the backup being restored? Would be super helpful once we're back live!

  • FranciscoFrancisco Top Host, Host Rep, Veteran
    edited April 2018

    Will it be possible to give us the exact date and time of the backup being restored? Would be super helpful once we're back live!

    As of this very second we're 49% restored. Slowed down a bit but Anthony is waiting for the next batch to sync on over.

    I suspect by end of day we'll be back in action.

    Francisco

  • AnthonySmith said: Its a ridiculous situation, people always think... 'no way did multiple drives fail at the same time'

    I think if you had all 4, from the same batch, that were used equally at the same time then it is absolutely feasible that they all hit their batches 'lifetime' write cycles or had exactly the same failure point. I have heard a general consensus amongst PC enthusiasts that the 850 250GB ,500GB and 1TB drives have faired worse than their 840 counterparts.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    eastonch said: I think if you had all 4, from the same batch, that were used equally at the same time then it is absolutely feasible that they all hit their batches 'lifetime' write cycles or had exactly the same failure point. I have heard a general consensus amongst PC enthusiasts that the 850 250GB ,500GB and 1TB drives have faired worse than their 840 counterparts.

    I checked the drives last week and they were all 30 - 40% left, so plenty of room.

    Francisco

  • HarambeHarambe Member, Host Rep

    @Francisco said:

    Harambe said: no data cable attached.

    Think a backplane would cause issues or probably not since it'd just be doing pass through?

    Francisco

    Possibly an issue. If the backplane is disconnected from any controller it may work, but basically the SSD is looking for a power-only connection for a long period of time (~30 mins) and that triggers a factory reset of sorts on the controller. I'm not sure if something plugged into the data connector will allow it do the same reset.

    Best bet is to crack open a server you have kicking around that has sata power coming off the PSU and do it that way.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    Harambe said: Best bet is to crack open a server you have kicking around that has sata power coming off the PSU and do it that way.

    Yep, I'll pull some SATA plugs off a backplane and see what happens.

    I'll owe you a banana whenever I go to the big gorilla in the sky if this works.

    Francisco

    Thanked by 1netomx
  • Francisco said: I suspect by end of day we'll be back in action.

    Thanks for the info. Sorry, I meant the date and time the backup was taken. (how old the restored data will be).

  • HarambeHarambe Member, Host Rep

    @Francisco said:

    Harambe said: Best bet is to crack open a server you have kicking around that has sata power coming off the PSU and do it that way.

    Yep, I'll pull some SATA plugs off a backplane and see what happens.

    I'll owe you a banana whenever I go to the big gorilla in the sky if this works.

    Francisco

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    eric1212 said: Thanks for the info. Sorry, I meant the date and time the backup was taken. (how old the restored data will be).

    I've seen people that had backups from April 2nd, so literally the day prior if not the day of.

    Others had backups from late March.

    Francisco

  • lurchlurch Member

    That's my first 30min/30sec cycle done

  • letboxletbox Member, Patron Provider
    edited April 2018

    @Francisco said:

    eastonch said: I think if you had all 4, from the same batch, that were used equally at the same time then it is absolutely feasible that they all hit their batches 'lifetime' write cycles or had exactly the same failure point. I have heard a general consensus amongst PC enthusiasts that the 850 250GB ,500GB and 1TB drives have faired worse than their 840 counterparts.

    I checked the drives last week and they were all 30 - 40% left, so plenty of room.

    Francisco

    The issue can be with Backplane or the RAID Card just die but i'd to check the backplane first, i had like this issue before but it was read 2 drives from 4 and sometime just not read any.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    key900 said: The issue can be with Backplane or the RAID Card just die but i'd to check the backplane first, i had like this issue before but it was read 2 drives from 4 and sometime just not read any.

    We tried all of those as well as bypassing the backplane, same thing on all of them.

    I'll for sure try the 30/30/30/30 trick when I go in later today/tomorrow.

    Francisco

  • Had an older crucial m4 ssd that also liked to stall on boot sometimes, seems like it was stuck in some garbage collecting mode. Leaving it with a mix of power and no power for multiple days fixed it for me. Never made an exact science out of "how to fix it" though.

    A firmware update later on fixed it for good.

Sign In or Register to comment.