MXroute failed, and I'm sorry

jar · December 2023

Hey friends,

Let's be real, my most involved and interested customers are here and Reddit. Fuck Reddit, so that leaves here. I come to you, customers and potential customers, in a public space to do a few things, and to do it all without my beloved ChatGPT. Let me make myself an outline so I don't forget what I'm here to do:

I hope this message finds you well. Just kidding.
Explain what happened
Explain what we’re doing
Explain what was learned from it
Ask for your forgiveness
Ask for a second chance
Ask my resellers for a favor
Beg you to not ask me to explain why I can't throw the entire business in the garbage can and start over with a brand new software stack

So here's what happened:

On September 24th the RAID controller failed for the lucy.mxrouting.net server. Through incredible late night efforts, including such rockstars as Jeff & Fran, the OS was recovered and booted after a roughly 12 hour outage. However, it's disks were transplanted into a 12 bay server. I didn't want to pay for that chassis, Jeff didn't want to waste that chassis on a 4 disk server. Understandably, we knew it was a temporary home but it was the fastest way to get the job done by remote hands. It's not that they're incapable, remote hands, it's that you really want them performing as little surgery as possible. Anyway, next.

The chassis swap needed to be coordinated between myself and Jeff, and my schedule was pretty booked. We're talking about the days and weeks surrounding Black Friday, I'm pretty busy. We managed to get a time for a chassis swap back to the repaired server (new RAID controller, at least) on December 5th. It's a disk swap scenario, this is barely worse than a reboot. In fact we had rebooted prior just to test that it came back fine. So we went ahead with the maintenance, and Lucy wouldn't boot in the other chassis. You see, here's the big problem with rebooting this server: MySQL will shut down as it damn well pleases, in 6-8 hours. I may be exaggerating, but any way you spin it, if we wait for OS services to spin down normally this becomes a multi hour outage. So both the test reboot and the power off for the chassis swap were hard power events. It's fine, a little InnoDB recovery happens sometimes, it's not a huge deal and we have backups. Little did we know, each one of those hard power events was taking more and more of a toll on an already barely stable file system (never again, XFS) and this last one was the straw that broke the camel's back.

So there we are on December 5th, around 10PM. with a Lucy that just isn't going to boot. I'm not great with boot stuff, I ask for help. The great minds of Jeff and Paul, and some advice from Fran along the way, we get it booted after an xfs_repair (after that segfaults a few times, and not due to a memory issue). Then starts the "input/output" errors flooding the console, it's fucked. Reboot, no go. Another xfs_repair, it boots. Input/output error. You see the pattern, and every time we fill the lost+found folder with new guests.

At this point I begin working to restore backups to a Hetzner server, because my backups are there and transferring several TB is best done in the same datacenter. Even still, because they're incremental and not archives, transferring those backups to a nearby server was going to take days (and it did take days).

So there I am with my thumbs up my ass while the guys are poking around at the original server trying to help me save face. I'm transferring backups, not exactly a process that needs me to baby it. After so much time and effort, I become convinced that it's a hardware issue with the original server. It doesn't matter to me that Jeff double checked and tested every single thing in this server. I'm not sold. I wish I could tell you I didn't cry over this, but as soon as everyone else was asleep I was in tears. This is everything to me, this is how I have a roof over our heads and food on the table. If I fail, we fail. Jeff felt that pain and drove 400 miles to build me a brand new server and put those disks in. We got Lucy back up for about 30 minutes. I was there repairing what xfs_repair took, copying key system files from an identical system, reinstalling perl, rebuilding services. Just when I get it all working and Jeff is questioning his sanity, boom. Input/output error. He was right, I was wrong, but that beautiful bastard still did this for me. Remember that when you need a server in the US, but anyway.

I booted into a recovery ISO, turned on networking, ran xfs_repair again, mounted the file system, and began using that as the framework for restoring Lucy to a new server in Virginia. I had to rebuild domain password files for a few hundred domains. I had to rebuild alias files, carefully copy what was needed into passwd, shadow, group, you know the job. I was restoring by hand, not from the multiple TBs of backups all the way in Germany. When I finished creating the accounts and their file/folder structures, I turned it on and told users that their emails would be coming back in two rsync jobs. Once from the previous server's mounted FS, and once from the maildirs in the incremental JetBackup backups.

We got it all settled and I declared Lucy fixed on the afternoon of December 10th. I wasn't done, I still had tickets to take care of for one-off fixes all over the box. But none were connected to each other or globally shared issues, so at that point I considered it "done" but kept working.

Doing things like traversing all directories in /home on a ~2800 user box, with resellers frantically running the DirectAdmin backup system to gzip and transfer their backups for safe keeping, and while the RAID was syncing, we had about 25% iowait on average with occasional spikes of up to 50%. So while I had come up with a better backup plan that would help me recover from this type of event faster, I wasn't going to bring the box to it's knees to enact it. That was going to go into effect on the afternoon of December 16th, as iowait had dropped to 0% and things were settled enough to be worth taking a new backup. I went to bed the night before, set an alarm, and knew that I'd be doing backups after lunch. Murphy, unfortunately, had other plans.

On the morning of December 16th, bright and early (just after 5AM my time) I received a page from customers. Just a page, as Lucy was checking all of the boxes for monitoring (ping, SMTP). It seemed that parts of the server were inaccessible. In fact, I couldn’t SSH into it. There were permission errors across the IPMI console. I rebooted. After all, we’re on ext4 now, it’s not a little bitch like XFS. Then I get an error from the RAID controller about a “foreign configuration.” As with boot problems, hardware RAID lingo isn’t my strong point. I called in help. It was determined that at least 1 drive had been dropped from the RAID. We had remote hands reseat the drives, no dice. After a lot of work, again by Jeff who is a master at these things that I’m not, things weren’t looking much better. We still don’t know with 100% certainty what happened, and we’re still working to recover what we can from that server, but this is what we think happened:

With all of the repair operations digging through the folders on /home and with resellers doing so many backup jobs at once, we believe that one of the disks took so long to sync that the controller gave up on it. I wasn’t monitoring the RAID yet, I wasn’t even done with my work on the box yet, so I wouldn’t have noticed at the time. We then believe a second drive started failing, as it began showing SMART errors that weren’t there when we provisioned the box, and as you probably know this is the death moment for RAID10. You get one out of sync, that’s it. Any others, the end.

Here’s what we’re doing:

The backups that I was transferring in Germany to a new nearby server, just in case we couldn’t resurrect Lucy quickly enough in Virginia, had finished transferring to that new server days ago. However, since JetBackup requires that you check an additional box to enable a feature that makes restoring backups not absolute hell (and leaves it off by default. Fuckers.), I still had to package them all up and restore them one by one, each account. After scripting it, I broke the job into 12 tasks (each with a list of 250 or less accounts fed to it) to perform the tar and restore on each account. Those restores are still running. As I’m writing this sentence, we’ve restored 1009 of them. I expect the rest to be done by the end of tomorrow (Monday, US/Central).

We’re cloning all 4 disks from the RAID on the previous Lucy server in hopes that we might get around any possible disk issues and be able to focus on recovering whatever we can from it’s file system, because that’s where the last week of email for those users resides. Because, again, the first backup of the new server was to take place after this outage occurred.

We’re removing JetBackup on all systems and going to an rsync backup. With this, we can rebuild a server’s framework more quickly and get users back online, and then transfer email data over afterward, with ease. Of course an rsync of MySQL while running is a little shitty but again, a little InnoDB recovery isn’t too bad.

We also credited every user $25 because our most expensive reseller plan is $25/m, and while I wanted to make a more grand gesture, I do need to keep the lights on.

Here’s what we learned:

Servers of this size cannot be backed up using traditional, easy backup systems. A whole snapshot of the server would take weeks, a gzipped archive of each account would take a week to finish one run. I can recover from a copy of the server’s file system within a few hours, and it’s not like this is an everyday situation. Hell, it took 10 years to see one like it. Also, fuck JetBackup for making it optional and off by default to export the JB config to the backup server.

I ask for your forgiveness:

As a long time provider in this community, I’ve failed to be the best version of me that I could be. I wasn’t as ready for this as I thought I was. I’m sorry. I hope that you can forgive me.

A second chance?

I ask for a second chance. I’ve learned a lot from this. I will be more ready for this type of event in 1-2 weeks and I SWEAR TO GOD THAT IS NOT A FUCKING CHALLENGE, MURPHY.

A favor from my resellers:

If you are a reseller of mine and your customers are experiencing any degree of outage from the Lucy server, and especially if you’re losing money, please post in this thread links to where I can order your services. I want to personally purchase service from you. I have no intent to use it. I want to give you cash, not credits. I want to at least help mitigate this for you.

Please don’t:

I get it. Email storage should be replicated. All servers should have a failover. Everything should be WebHostingTalk’s retarded and incorrect definition of “cloud” (high availability). I didn’t set out to just make another email service, I set out to model it after shared web hosting because that’s what I knew, that’s how I could outsource the frontend to a company like DirectAdmin (or previously, cPanel). I wanted to master outbound delivery, not remake Gmail and charge their prices. This is what I made, this is what I’m committed to. I can do better without throwing the entire business plan in the garbage and starting from a clean slate with a set of investors and developers, only to be “yet another $5/m per email user” provider. I appreciate the thoughts and concern, I do appreciate advice, but I’m really not in the mood to hear all of that right now. Say what you want, I just don’t “want” (not that I can have what I want) to hear “Why is each server a standalone box with no failover connected to a $400,000 storage array?”

jenkki · December 2023

Apologizes

febryanvaldo · December 2023

Me: Please, summarize the thread for me.

AI: The email outlines a series of technical issues with a server named Lucy, leading to significant downtime. The problems include a RAID controller failure, a temporary fix that later exacerbated issues, and subsequent hardware failures.

Despite extensive efforts to restore the system, further complications arose, resulting in a prolonged outage. The sender details the steps taken to address the situation, including restoring from backups, hardware cloning, and transitioning to a new backup system.

The email concludes with an apology for the disruption, a request for forgiveness, a promise to be better prepared in the future, and a plea for a second chance. Additionally, the sender offers financial assistance to affected resellers and discourages advice on radically changing the business model.

PineappleM · December 2023

Shit happens, no point beating yourself up over it. It is not your fault. Just learn what you need to and move forward.

Murphy's Law is real, if something has the possibility to shit all over itself, it's-all-but-certain to do so one day, and that day just so happened to be a few days ago.

Moopah · December 2023

I have lost trillions of dollars

seikan · December 2023

God bless Lucy.

yoursunny · December 2023

TLDR but title says MXroute is deadpooling and we should open disputes for Black Friday $3/3y service.

Moopah · December 2023

On a serious note, this reminds me that I actually need to setup a proper email backup solution for my MXRoute accounts instead of using Thunderbird's email storage.

Wish you the best of luck in getting everything sorted out

JosephF · December 2023

@Moopah said:
On a serious note, this reminds me that I actually need to setup a proper email backup solution for my MXRoute accounts instead of using Thunderbird's email storage.

What are your backup options?

JosephF · December 2023

@jar This clearly sounds like MXRoute's biggest mess up in its young history. That said, what was the second biggest mess after this, and how long ago?

Moopah · December 2023

@josephf said:

@Moopah said:
On a serious note, this reminds me that I actually need to setup a proper email backup solution for my MXRoute accounts instead of using Thunderbird's email storage.

What are your backup options?

Currently looking at imapsync, offlineimap3, and piler (not free) and backing it up to my repuc VPS.

Francisco · December 2023

@josephf said:
@jar This clearly sounds like MXRoute's biggest mess up in its young history. That said, what was the second biggest mess after this, and how long ago?

Starting mxroute.

I kid I kid.

Francisco

HostEONS · December 2023

@jar these kind of issues almost all providers face at some point and this may not be the last time, but it happens

Moreover I think regarding Mysql Rsync I think setting up mysql master and slave might be a better option bcz if master crashes you can simply use the slave as new master, of course you can even do rysnc from MYSQL SLAVe, just stop the slave before doing RSYNC and then start replication again once RSYNC is done ...

Few yrs ago I was handling a huge mysql cluster for a dynamic DNS service provider with millions of records in it and this mysql slave saved me on multiple occasions because restoring a huge DB could take days, but if the slave has all data synched, just changing the mysql server in the scripts or whatever you are using brings everything back online quickly

MannDude · December 2023

MXRoute didn't fail. One sever did. The server we're on with you has and continues to be fantastic. Shit happens.

darkimmortal · December 2023

The missing part for me is why not use a copy of the half working system as a starting point to restore the backup to, to reduce restoration time?

It seems the two worst case approaches were taken: fixing the half working system and restoring the backup from scratch

DP · December 2023

@Francisco said: Starting mxroute.

BuyMX.

Moopah · December 2023

@DP said:

@Francisco said: Starting mxroute.

BuyMX.

MXSolutions and MXBytes

DP · December 2023

@Moopah said:

@DP said:

@Francisco said: Starting mxroute.

BuyMX.

MXSolutions and MXBytes

MXDoc.

cybertech · December 2023

MXServers thanks you for the feedback.

Moopah · December 2023

@DP said:

@Moopah said:

@DP said:

@Francisco said: Starting mxroute.

BuyMX.

MXSolutions and MXBytes

MXDoc.

AlphaMX
WootMX
HXMX
NFPMX
QuadraMX
PsychzMX
MXCrossing

HostEONS · December 2023

@cybertech said:
MXServers thanks you for the feedback.

Your MailBoxes have been doubled

Gravely · December 2023

@febryanvaldo said:
Me: Please, summarize the thread for me.

AI: The email outlines a series of technical issues with a server named Lucy, leading to significant downtime. The problems include a RAID controller failure, a temporary fix that later exacerbated issues, and subsequent hardware failures.

Despite extensive efforts to restore the system, further complications arose, resulting in a prolonged outage. The sender details the steps taken to address the situation, including restoring from backups, hardware cloning, and transitioning to a new backup system.

The email concludes with an apology for the disruption, a request for forgiveness, a promise to be better prepared in the future, and a plea for a second chance. Additionally, the sender offers financial assistance to affected resellers and discourages advice on radically changing the business model.

My lack of attention spans thanks you.

jar · December 2023

@darkimmortal said:
The missing part for me is why not use a copy of the half working system as a starting point to restore the backup to, to reduce restoration time?

It seems the two worst case approaches were taken: fixing the half working system and restoring the backup from scratch

It’s all choices and everyone will have an opinion, but I also got railed on for the data I had to reconstruct from restoring data first from the damaged FS. Someone is mad no matter what I do, but it’ll be nice this time to not have to rebuild passwd files and ask users to reset their passwords again (because I can’t just copy those from JB, only way I know how is to formally restore its backup). But I had the backups on the server that was most ready to play the role, I made a choice.

DataRecovery · December 2023

This is unacceptable.
Outrageous!

I can't believe I RECENTLY PAID THE WHOLE $6 for three years of this "service".

Terrible.
TERRIBLE.

Has anyone started the class action lawsuit yet?
Count me in please.

ScreenReader · December 2023

that's rough. a damaged FS especially from RAID need a lot of time to repair. I personally would just restore from backup, if you didn't have any decent backup yet: consider your backup has been doubled

i recall you have dedicated storage with churchbit so i guess we don't really have to worry about this in future?

Moopah · December 2023

@ScreenReader said:
that's rough. a damaged FS especially from RAID need a lot of time to repair. I personally would just restore from backup, if you didn't have any decent backup yet: consider your backup has been doubled

i recall you have dedicated storage with churchbit so i guess we don't really have to worry about this in future?

churchbit? @crunchbits , I hereby proclaim that you are a tax-exempt religious organization

ScreenReader · December 2023

@Moopah said:

@ScreenReader said:
that's rough. a damaged FS especially from RAID need a lot of time to repair. I personally would just restore from backup, if you didn't have any decent backup yet: consider your backup has been doubled

i recall you have dedicated storage with churchbit so i guess we don't really have to worry about this in future?

churchbit? @crunchbits , I hereby proclaim that you are a tax-exempt religious organization

a surprise, to be sure, but a welcome one

FootKaput · December 2023

Bought a lifetime in 2019, and picked up a $30/3yr on BF. MXRoute has a been a rock solid service that I've not even had to think about since I started...it's always just worked for me. I used to host my main email domain with icloud custom domain, but moved it over to MX route recently. I really appreciate what you've provided me, @jar

Sorry to hear of your recent server woes...I hope that things stabilize enough to get back some of that sleep you've missed.

JasonM · December 2023

@MannDude said: One sever did

yeah, indeed!

DataRecovery · December 2023

@josephf said:
@jar This clearly sounds like MXRoute's biggest mess up in its young history

Domain Name: MXROUTE.COM
Creation Date: 2013-10-14T22:37:59Z

MXroute turned 10 this year.

So... For the first time one server went down for several hours.

OMG! TEH DRAMA!!
Billions lost!
Fresh dickpics not delivered! Hard-earned ones are in danger of extinction.
Doomsday!

P.S.
Sorry guys for reminding you how time flies.

TrK · December 2023

i am not using mxroute cause i missed the best promo for the cheapest plans but inam sure you already have a well thought plan already in action to avoid the future scenario if it occurs again, also i am little bit curious about raid fs being bombed care to share the detailed explanation?

Howdy, Stranger!

Categories

In this Discussion

MXroute failed, and I'm sorry

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

MXroute failed, and I'm sorry

Comments