lucy.mxrouting.net

Neoon · December 2023

@yoursunny said:

@josephf said:
I thought all the servers were named after days of the week.

A glance of the server names:
https://crt.sh/?q=mxrouting.net

"longhorn" any chance you still using Vista @jar ?

hdpixel · December 2023

Who setup secondary MX? Several of my users are going back to Google $6 plan. The timing is really bad.

jar · December 2023

Got it back online, repaired every service that was broken by xfs_repair, got all email flowing beautifully, was just about to mark the status resolved and...

input/output error

Still copying backups, but I wasn't ready to give up on the main server yet.

jar · December 2023

I'd like to point out that Jeff @qps drove 4 hours today and built me a new server because I was convinced that a hardware problem was making everything worse for the original server. I was obviously wrong. He knew I was wrong. He still did it.

I don't know how many of your providers would drive 4 hours and drop their whole day for you. He's got another 4 hour drive back.

JosephF · December 2023

@yoursunny said:

@josephf said:
I thought all the servers were named after days of the week.

A glance of the server names:
https://crt.sh/?q=mxrouting.net

I wonder if some servers are better than others. And if higher end plans are put on the better servers.

aj_potc · December 2023

@jar said:
I'd like to point out that Jeff @qps drove 4 hours today and built me a new server because I was convinced that a hardware problem was making everything worse for the original server. I was obviously wrong. He knew I was wrong. He still did it.

So, if not the other hardware, it was the storage in the original server that went bad?

jar · December 2023

@aj_potc said:

@jar said:
I'd like to point out that Jeff @qps drove 4 hours today and built me a new server because I was convinced that a hardware problem was making everything worse for the original server. I was obviously wrong. He knew I was wrong. He still did it.

So, if not the other hardware, it was the storage in the original server that went bad?

It’s hard for me to believe that a system boots beautifully, runs like a dream, and then spits out “input/output error” like someone just ripped out the disk, and it not be a hardware problem. But it has to be the file system doing it, because the disks always report as connected, we replaced 1 of the disks and let it rebuild the array for having only a few reallocated sectors, and then we replaced every single thing other than the drives themselves.

I guess it really does have to be the file system. Never again with xfs.

aj_potc · December 2023

@jar said:

I guess it really does have to be the file system. Never again with xfs.

I had a similar thing happen to me on a large storage VPS. Ran fine for a long time, then suddenly started throwing IO errors. The provider swore up and down that the RAID array was fine and blamed XFS, the file system I was using. I told him that he's crazy, that XFS is mature and stable, and that most likely some underlying issues with the hardware were to blame. He wasn't having it.

On the provider's advice, I reinstalled from scratch (though I stuck with xfs and didn't use ext4, as he demanded). Had no trouble booting the new system, but while restoring my multi-TB data over the next day, started to get IO errors again. I reinstalled once again, and the same thing happened within 48 hours: more IO errors.

Finally, I begged the provider to move me to a new node, where I set up the same system for a third time (again with XFS). This time all good. All data, some 16 million+ files, could be restored, and the system ran happily until I decommissioned it.

So, I tend to blame the hardware.

jar · December 2023

So to close out the problem that this thread is about:

We were able to gain better access to Original Lucy's file system and use that to rebuild the skeleton of the server. Accounts, email accounts, all that jazz. By withholding everyone's email we're able to get it online faster, and we'll sync everyone's previous email data back over next, which may take a few days to completely finish. In this, I think I've discovered a unique new method for quick disaster recovery but we'll talk about the details of that later in the postmortem.

Users can login to DirectAdmin now, and will be able to login to email within the hour (many already can). We're not accepting inbound email until directory permissions finish setting so it can be delivered. The mail queue on Original Lucy will be placed into the spool right before we open port 25. Crossbox will be down for longer (MySQL versions incompatible, we upgraded from Cent7 to Alma9, need to export DBs via chroot and that's going to take longer).

bikegremlin · December 2023

@hdpixel said:
Who setup secondary MX? Several of my users are going back to Google $6 plan. The timing is really bad.

This may not sound very comforting now, but maybe those users were not a good fit.

Situations like this can help you filter out any "Karens," and also strengthen your relations with the other clients by being open about what's going on and offering reasonable compensations (this does not imply that I expect, nor require any refunds from @jar - just get some sleep man ).

Relja

jar · December 2023

@bikegremlin said:

@hdpixel said:
Who setup secondary MX? Several of my users are going back to Google $6 plan. The timing is really bad.

This may not sound very comforting now, but maybe those users were not a good fit.

Situations like this can help you filter out any "Karens," and also strengthen your relations with the other clients by being open about what's going on and offering reasonable compensations (this does not imply that I expect, nor require any refunds from @jar - just get some sleep man ).

Relja

There’s nothing wrong with users moving on due to an outage like this. It’s very reasonable.

jar · December 2023

So I've got 2 postmortems for anyone interested. First, here's what I wrote over the course of 2 days. I did not go back and read it over after finishing it, I did not clean it up, it's a complete mess and was never intended to be posted anywhere as it is: https://mxbin.io/VDL3G7

The reason I am posting that here is that the more technical people may appreciate it for being a bit more detailed, even if a jumbled mess. But for the blog, I let ChatGPT clean it up, added a few edits to it of my own, and posted it here: https://blog.mxroute.com/postmortem-report-lucy-mxrouting-net-server-outage/

aj_potc · December 2023

So what's the final verdict on hardware failure? Do you suspect some issue with the RAID subsystem on the original server corrupted your file system?

Following from that, have you been successful so far in restoring from the disks swapped out of the original Lucy?

And last, have you considered a backup software such as Veeam or Acronis that uses block-level rather than file-level cloning? This may not make the recovery faster, but it would provide an alternative backup. You can run those backups to some external location in addition to maintaining rsync'ed hot spares. This is what I do, because I don't like to rely on just one backup method/software.

jar · December 2023

@aj_potc said:
So what's the final verdict on hardware failure? Do you suspect some issue with the RAID subsystem on the original server corrupted your file system?

Following from that, have you been successful so far in restoring from the disks swapped out of the original Lucy?

And last, have you considered a backup software such as Veeam or Acronis that uses block-level rather than file-level cloning? This may not make the recovery faster, but it would provide an alternative backup. You can run those backups to some external location in addition to maintaining rsync'ed hot spares. This is what I do, because I don't like to rely on just one backup method/software.

We're calling it filesystem failure. It's funny I can leave it mounted all day long and rsync data out of it, but if I boot from it it'll be "input/output error" in 30 minutes or less. But after a full rebuild of a RAID10 array and replacing every single thing but the 4 drives, hardware issue just doesn't seem like a reasonable conclusion.

From now on I'm going to use rsync for the whole OS. It's what I used to do, and if I did it here I would've been back online in 3-4 hours and just had some delay syncing user emails from the backup. Too many backup plans cost too much at our size, and I don't intend to raise prices over this. The chances of this kind of failure are low enough, the chances of this kind of failure and the failure of another server in another datacenter at the same time, those are odds I'm comfortable with.

darkimmortal · December 2023

It smells like the drives became inconsistent from the rest of the mirror at some point or maybe still are (would explain the mount+rsync vs boot behaviour difference)

Basically every fs will explode in that case with the exception of btrfs native raid 1, but that has its own downsides

qps · December 2023

@jar said: I'd like to point out that Jeff @qps drove 4 hours today and built me a new server because I was convinced that a hardware problem was making everything worse for the original server. I was obviously wrong. He knew I was wrong. He still did it.

I don't know how many of your providers would drive 4 hours and drop their whole day for you. He's got another 4 hour drive back.

It's actually a bit over 6 hours each way.

Anyway, I got to eat some gas station food (Wawa) in the middle of the night. Sadly, I think maybe that was the highlight of the trip?

bikegremlin · December 2023

@jar said:

@aj_potc said:
So what's the final verdict on hardware failure? Do you suspect some issue with the RAID subsystem on the original server corrupted your file system?

Following from that, have you been successful so far in restoring from the disks swapped out of the original Lucy?

And last, have you considered a backup software such as Veeam or Acronis that uses block-level rather than file-level cloning? This may not make the recovery faster, but it would provide an alternative backup. You can run those backups to some external location in addition to maintaining rsync'ed hot spares. This is what I do, because I don't like to rely on just one backup method/software.

We're calling it filesystem failure. It's funny I can leave it mounted all day long and rsync data out of it, but if I boot from it it'll be "input/output error" in 30 minutes or less. But after a full rebuild of a RAID10 array and replacing every single thing but the 4 drives, hardware issue just doesn't seem like a reasonable conclusion.

From now on I'm going to use rsync for the whole OS. It's what I used to do, and if I did it here I would've been back online in 3-4 hours and just had some delay syncing user emails from the backup. Too many backup plans cost too much at our size, and I don't intend to raise prices over this. The chances of this kind of failure are low enough, the chances of this kind of failure and the failure of another server in another datacenter at the same time, those are odds I'm comfortable with.

For practically every email account (either mine, or clients'), I've been using either Gmail POP3, or Thunderbird.

MXroute is awesome for reliable sending and receiving. But I thought that email backups should be "my" responsibility, so no emails are stored on MXroute.

I rely on your service for what it does best - deliver(abilit)y. The rest should be my concern, or you should charge accordingly. Hence, as far as I'm concerned, you should raise prices to match what works (and is reasonably profitable) for you.

P.S.

I've read your KB article about the downsides of using Gmail:
https://blog.mxroute.com/the-flaws-of-using-gmail-as-a-frontend-for-mxroute/

It's good, spot on. And forwarding is a bad idea (I called it "dangerous"). But POP3 has worked for me, and dozens of non-tech savvy clients just fine, for years now, with MXroute.

I wrote this (the sections 5 - and 6) primarily for my clients:
https://io.bikegremlin.com/10364/website-gmail/#5

I think it might be a good idea to link to that, for your clients who "insist" on using Gmail (could be wrong, but that's my opinion based on my experience so far).

Relja

mike1s · December 2023

@qps said:

@jar said: I'd like to point out that Jeff @qps drove 4 hours today and built me a new server because I was convinced that a hardware problem was making everything worse for the original server. I was obviously wrong. He knew I was wrong. He still did it.

I don't know how many of your providers would drive 4 hours and drop their whole day for you. He's got another 4 hour drive back.

It's actually a bit over 6 hours each way.

Anyway, I got to eat some gas station food (Wawa) in the middle of the night. Sadly, I think maybe that was the highlight of the trip?

Yep, that’s one thing I miss about when I lived in Florida… no Wawa in NC supposedly they’re coming out here tho

Turbo_Pascal · December 2023

After all this, I hope jar is looking forward to some much deserved rest during the holidays.

jar · December 2023

@Turbo_Pascal said:
After all this, I hope jar is looking forward to some much deserved rest during the holidays.

I’m happy for what I learned to do better, because it was quite a number of things. But damn am I ready to put it behind me. Ten years, that’s how long I went without a disaster of this magnitude.

Francisco · December 2023

@qps said: Anyway, I got to eat some gas station food (Wawa) in the middle of the night. Sadly, I think maybe that was the highlight of the trip?

Or the lowlight depending on how quickly it cut through you.

Francisco

default · December 2023

I have an idling account with Lucy, therefore I did not get to lose millions and be part of this drama.

Howdy, Stranger!

Categories

In this Discussion

lucy.mxrouting.net

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

lucy.mxrouting.net

Comments