Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


BuyVM Catastrophic Data Failure - All data lost on a node! - Page 6
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

BuyVM Catastrophic Data Failure - All data lost on a node!

12346

Comments

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    Zerpy said: How is jetbackup configured if it can take more than a day to perform a backup of a server? o.O Even with a raid 6 array that I have for backups, 400 accounts take only a few hours and is mostly due to some accounts having 2+ million files :')

    Yeah, that's 2 million files. For instance, lv-shared04 has ~22M inodes on it.

    Francisco

  • FranciscoFrancisco Top Host, Host Rep, Veteran
    edited April 2018

    @vimalware said:
    Neat. Are these weekly backups made on LUX-shared nodes too? that would make me smile.

    I'm just too lazy to backup friends n family shared hosting.

    Every node gets 3 backups a week. We shift things some so LUX is like Tues/Thurs/Sat or something like that, that way each group of nodes doesn't have to fight with other nodes for inodes.

    As of now all but a handful of accounts we missed on the initial rounds have been imported. We've fixed whatever IP's were wrong.

    This issue is resolved :)

    Francisco

  • raindog308raindog308 Administrator, Veteran

    Zerpy said: How is jetbackup configured if it can take more than a day to perform a backup of a server? o.O Even with a raid 6 array that I have for backups, 400 accounts take only a few hours and is mostly due to some accounts having 2+ million files :')

    I'm going to take a wild guess that BuyShared's boxes have an order of magnitude more accounts per server.

    Thanked by 3Aidan FHR Clouvider
  • deankdeank Member, Troll

    40,000 accounts per server.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    raindog308 said: I'm going to take a wild guess that BuyShared's boxes have an order of magnitude more accounts per server.

    Not that bad. lv-shared04 has a peak of 1000 IP's (a /22 of IP's).

    We just have a lot of people that have never, ever, cleaned their spam folders so you end up with 50,000+ unread emails.

    Francisco

  • ZerpyZerpy Member
    edited April 2018

    @Francisco said:

    Zerpy said: How is jetbackup configured if it can take more than a day to perform a backup of a server? o.O Even with a raid 6 array that I have for backups, 400 accounts take only a few hours and is mostly due to some accounts having 2+ million files :')

    Yeah, that's 2 million files. For instance, lv-shared04 has ~22M inodes on it.

    Francisco

    2 million files, per account for the top ones.

    In my case, it's 403 accounts and 34.4 million inodes and total of 2'645 gigabyte of files.

    Total backup time being 2 hours and 37 minutes last night.

    If it takes 24+ hours for 22 mil inodes, then something must be very wrong, it means you're reading on average 250 iops on the webserver itself on average per second over a 24 hour span - that seems super low :-) But what do I know.

    Another smaller box is 228 gigabyte of data, 4.3 million inodes and 125 accounts, takes about 40 minutes, and that's even on spinning rust

    @Francisco said: We just have a lot of people that have never, ever, cleaned their spam folders so you end up with 50,000+ unread emails.

    Consider mbox format - would probably also greatly reduce a recovery process.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    The issue isn't the reading on the shared side, it's that the destination is getting slammed by not only 7 other shared nodes looking to do the same work, but also BuyVM backups (though those are more stream heavy).

    I thought mbox was deprecated. If not, i'll for sure consider that.

    Francisco

  • FranciscoFrancisco Top Host, Host Rep, Veteran
    You can use the /scripts/convert2maildir script to perform conversions on mail storage data. Maildir is the only supported mail storage system for cPanel & WHM servers. Because of this, users who migrate data onto cPanel & WHM servers will convert any mbox data to the Maildir format.
    

    Shlonged.

    Francisco

  • ZerpyZerpy Member

    Sorry, mdbox - not mbox - hate the fact their naming is close: https://documentation.cpanel.net/display/68Docs/Mailbox+Conversion

    They even considered making mdbox default at some point - however was not done, but this was added in the recent releases - and cPanel did migrate all their internal emails to mdbox themselves

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    @Zerpy said:
    Sorry, mdbox - not mbox - hate the fact their naming is close: https://documentation.cpanel.net/display/68Docs/Mailbox+Conversion

    They even considered making mdbox default at some point - however was not done, but this was added in the recent releases - and cPanel did migrate all their internal emails to mdbox themselves

    Neat :)

    Will wait and see if there's any known issues but i'd for sure love to have something like that instead of a metric crap ton of inodes.

    Francisco

    Thanked by 1Zerpy
  • ZerpyZerpy Member

    @Francisco said:
    Will wait and see if there's any known issues but i'd for sure love to have something like that instead of a metric crap ton of inodes.

    Francisco

    It was introduced in cPanel v56, so has been there for at least 1+ year - and cPanel run it internally with terabytes of emails - if they'd switch their own @cpanel.net emails to run with it, I'd assume it's "good enough" to make the company rely on it

  • williewillie Member
    edited April 2018

    Might be faster to just save a compressed tarball of each account as backup, rather than attempting differential or file by file backup. So you write just one file per account on the backup server.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    willie said: Might be faster to just save a compressed tarball of each account as backup, rather than attempting differential or file by file backup. So you write just one file per account on the backup server.

    True, but very heavy on space usage.

    In the case of a disaster we'd likely resort to the full drive snapshots which I'll be able to restore at full line rate since it's just streaming data.

    Francisco

  • some hours offline. a copy a few days ago. but much better than losing everything

    Thanked by 1Aidan
  • Francisco said: True, but very heavy on space usage.

    Actually dump/restor is still a thing and can do differential dumping into a single file on the backup device. That might be a decent alternative. Space consumption stays the same with all approaches if you're not keeping multiple backups around, but the tarball approach increases traffic to the backup server so that's not great.

    Francisco said: In the case of a disaster we'd likely resort to the full drive snapshots which I'll be able to restore at full line rate since it's just streaming data.

    I'm not so sure you can do that on an active filesystem because of getting inconsistent snapshots as stuff changes during the snapshot process.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    willie said: I'm not so sure you can do that on an active filesystem because of getting inconsistent snapshots as stuff changes during the snapshot process.

    I can LVM snapshot the entire node so it'd be just fine.

    Francisco

  • Francisco said:

    I can LVM snapshot the entire node so it'd be just fine.

    How does that work? The stuff inside the LVM partitions would still be changing during the snapshot, I would have thought. I could imagine setting up an overlay filesystem or similar to allow individual accounts to be safely snapshotted and that would be interesting, but I'm not aware of it having been done.

  • VossVoss Member

    Francisco said: "Awww fuck."

    "A guide to webhosting"

    More like "OVH Server + Summer Holidays: A Guide to Scamming"

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    willie said: How does that work? The stuff inside the LVM partitions would still be changing during the snapshot,

    No. The 'changed data' gets stored in a different area. When you make a snapshot you tell it how much space it can use for that very purpose.

    https://www.thomas-krenn.com/en/wiki/LVM_Snapshots

    Francisco

    Thanked by 1willie
  • NomadNomad Member

    Why doesn't anyone mention this might have something to do with the low amount of TBW the consumer series SSDs have?

    If they are from the same batch and are in a raid, isn't it expected for them to fail just simultaneously?

  • @Nomad said:
    Why doesn't anyone mention this might have something to do with the low amount of TBW the consumer series SSDs have?

    If they are from the same batch and are in a raid, isn't it expected for them to fail just simultaneously?

    That was actually already mentioned.

  • NomadNomad Member

    @YokedEgg said:

    @Nomad said:
    Why doesn't anyone mention this might have something to do with the low amount of TBW the consumer series SSDs have?

    If they are from the same batch and are in a raid, isn't it expected for them to fail just simultaneously?

    That was actually already mentioned.

    My bad then, I missed that.

    Thanked by 1YokedEgg
  • HarambeHarambe Member, Host Rep

    Nomad said: Why doesn't anyone mention this might have something to do with the low amount of TBW the consumer series SSDs have?

    Fran mentioned he monitors the wear level on them, on the 1TB variants you're looking at like 2PB of write life per drive. Definitely a possibility, but that seems like a lot of available room for writes on a shared hosting node.

  • sonicsonic Veteran

    My site backs up online without issue. Great work, Fran!

  • FranciscoFrancisco Top Host, Host Rep, Veteran
    edited April 2018

    Nomad said: Why doesn't anyone mention this might have something to do with the low amount of TBW the consumer series SSDs have?

    If they are from the same batch and are in a raid, isn't it expected for them to fail just simultaneously?

    I documented about this earlier but I'll repeat it here :)

    The SSD's were all in the 30 - 40% left on the official ratings for the drives, but Samsung's can go far and beyond the limits on those drives. Still, we weren't anywhere near that.

    I know this for a fact because I smart'd all the drives around 2 weeks ago when Karen & I decided we wanted to give Shared a nice upgrade with bigger CPU's and a move to the NVME drives.

    Honestly, with what we've seen I think it was just because the drives never got a firmware update. We bought those drives right when the 850's first hit the market and pulling the node offline to flash the firmware and take a chance at losing everyones data wasn't a happy thought to me.

    For all I know there's some subsystem that was patched in a later firmware where the drives can go cockeyed if they hit a certain wear level. Samsung knows, I don't.

    We do a lot to make sure we take as much strain off our drives as we can. We're close to a 85/15 Read/Write work load on our shared node SSD's. There's absolutely no swap on our nodes and we do some other tricks to take high thrash areas off the drives completely.

    I'm not happy with it but I'm incredibly proud with how quickly Anthony & I were able to diagnose the problem, build completely new nodes, get everything installed, every backup we had packed and restored.

    I think it was just over 24 hours for the whole episode. Given what i've seen from countless other hosts out there, they would be 5 - 10% into the restore by now.

    This is resolved. Anyone with outstanding issues with their site, log a ticket, i'll personally sort you out!

    Francisco

  • williewillie Member
    edited April 2018

    Harambe said: Fran mentioned he monitors the wear level on them, on the 1TB variants you're looking at like 2PB of write life per drive.

    https://www.anandtech.com/show/8747/samsung-ssd-850-evo-review lists the rated endurance of the 1TB 850 EVO as 150TB.

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    @willie said:

    Harambe said: Fran mentioned he monitors the wear level on them, on the 1TB variants you're looking at like 2PB of write life per drive.

    https://www.anandtech.com/show/8747/samsung-ssd-850-evo-review lists the rated endurance of the 1TB 850 EVO as 150TB.

    Correct! Even the PROs' aren't that much higher.

    The enterprise stuff gets into the multi PB.

    Still, this is shared. shared04 already had a couple years on it and still had a good bit left.

    Francisco

  • YokedEggYokedEgg Member
    edited April 2018

    @Francisco said:

    @willie said:

    Harambe said: Fran mentioned he monitors the wear level on them, on the 1TB variants you're looking at like 2PB of write life per drive.

    https://www.anandtech.com/show/8747/samsung-ssd-850-evo-review lists the rated endurance of the 1TB 850 EVO as 150TB.

    Correct! Even the PROs' aren't that much higher.

    The enterprise stuff gets into the multi PB.

    Still, this is shared. shared04 already had a couple years on it and still had a good bit left.

    Francisco

    Quick question from the official BuyVM LET help desk, when do you think the KVM slices will get NVMe installed in 'em? (srs)

  • FranciscoFrancisco Top Host, Host Rep, Veteran

    YokedEgg said: Quick question from the official BuyVM LET help desk, when do you think the KVM slices will get NVMe installed in 'em? (srs)

    Unlikely any time soon.

    It's a huge cost and not enough people will care about it. I would have to do full motherboard changes, or move to a 2U chassis, since our spare PCI slots are being taken by our infiniband networking.

    Free nightly backups & snapshosts will come out around summer or so though :) That should be fun.

    Francisco

    Thanked by 1YokedEgg
  • HarambeHarambe Member, Host Rep

    @willie said:

    Harambe said: Fran mentioned he monitors the wear level on them, on the 1TB variants you're looking at like 2PB of write life per drive.

    https://www.anandtech.com/show/8747/samsung-ssd-850-evo-review lists the rated endurance of the 1TB 850 EVO as 150TB.

    Ah, my bad. My quick google search brought this page up earlier: https://www.anandtech.com/show/8747/samsung-ssd-850-evo-review/4

Sign In or Register to comment.