MXroute failed, and I'm sorry

vedran · December 2023

@DataRecovery said:
This is unacceptable.
Outrageous!

I can't believe I RECENTLY PAID THE WHOLE $6 for three years of this "service".

Terrible.
TERRIBLE.

Has anyone started the class action lawsuit yet?
Count me in please.

You can joke all you want but this is seriously outrageous. I've been a customer for I don't know how many years and I've never experienced any downtime and therefore was never eligible for any compensation. But on top of that I just found out some people are paying $2-3/year while I've been paying $5. Ridiculous, scam, avoid

hyperblast · December 2023

jar is a man of honour!

JosephF · December 2023

I haven't paid attention to the minute by minute saga, but have any customers lost email data (once all will be said and done)?

jar · December 2023

@josephf said:
I haven't paid attention to the minute but minute saga, but have any customers lost email data (once all will be said and done)?

If we don't recover anything from the previous Lucy server, they will have lost a week of new data.

commercial · December 2023

As customer on this Lucy server, I found the handling of the incident very professional. Both technically and in terms of communication. By the way, I was surprised by the $25 compensation for my $50 annual plan. I didn't expect much.

I can only repeat my confidence.

optisoft · December 2023

@jar

do you have hdd or ssd on lucy server?

is partition of os on same drives of emails datastore?

is a crazy idea to have some automatic config to can send and receive emails if some server go down? ( without access of email folders after to solve main problems ).

With that, you can minimize full shut down of some servers, and recovery process be with less strees.

Hope you will find a good and cost effective way to continue with your solid product.

raindog308 · December 2023

@jar said: You see, here's the big problem with rebooting this server: MySQL will shut down as it damn well pleases, in 6-8 hours.

MariaDB states that the SHUTDOWN command will go through and kill all client connections in random order. However, I think the engine still has to go through and rollback all their transactions so if you have a lot of uncommitted data, that could take some time.

If it makes you feel any better, $50,000-a-core Oracle has the same issue even though it has a SHUTDOWN IMMEDIATE command. They provide a SHUTDOWN ABORT which is prompt but it essentially the same as a kill -TERM.

Now that I read that MariaDB KB, it looks like a shutdown can strand non-replicated data from the replicas, and this is the default. If you shutdown via systemd, this is what happens - you have to go in and do a SHUTDOWN WAIT FOR ALL REPLICAS (or WAIT FOR ALL SLAVES if you're not running woke code) to prevent this. But then you're waiting for rollback and replication.

MariaDB doesn't really say that this data will then be properly replicated once the instance is restarted...hopefully it is...? Oracle, SQL Server, and others properly pick up replication where it left off.

JosephF · December 2023

The title of this thread is unfair. The company didn't fail; only one particular hardware, serving a proportion of customers, did. The title is much too broad.

rattlecattle · December 2023

@Kebab said:

@Moopah said:
I have lost trillions of dollars

I'm not even on lucy node yet I already lost millions because of this

I'm not a MXroute customer yet I already lost quadrillions because of this.

When MXroute goes down it takes down half of the world with it. Cascading failures everywhere.

Moopah · December 2023

@josephf said:
The title of this thread is unfair. The company didn't fail; only one particular hardware, serving a proportion of customers, did. The title is much too broad.

This email server failure has resulted in the start of the next doomsday apocalypse, with catastrophic world-wide impacts.

MikeDVB · December 2023

Jarland - I am so very sorry to hear about this. I went through something much the same myself in 2018. I remember working 96 hours straight without a break and even then - there was still work to be done.

You will make it past this and you will learn from the mistakes made. I often say that you don't know what you don't know - and this is definitely a case of that.

Once all of the dust has settled reach out to me and I can share with you some of the lessons I learned back in 2018 - as they may help you.

If there is anything I can do to assist between now and then - please do not hesitate to reach out. I am happy to help you however I can.

tdwuk · December 2023

Ouch! Good luck getting the server going, sounds like a big task. My two services on Taylor and another one are working fine as always, thankfully.

jsg · December 2023

@jar

I'm not a customer of yours, but I sincerely wish you and your business well!
If you feel that I could somehow be of help don't hesitate to shoot me a PM!

emgh · December 2023

I’m not on the affected server but anyway, please buy a service from me

jlet88 · December 2023

Great write-up of a very tough, unfortunate situation, great transparency, great honesty, also nice to see the community be so supportive.

Big reminder to all of us about disaster recovery too... worth spending some time reviewing our own houses.

Wishing @jar well, don't let your blood pressure get the better of you.

bikegremlin · December 2023

@MikeDVB said:
Jarland - I am so very sorry to hear about this. I went through something much the same myself in 2018. I remember working 96 hours straight without a break and even then - there was still work to be done.

You will make it past this and you will learn from the mistakes made. I often say that you don't know what you don't know - and this is definitely a case of that.

Once all of the dust has settled reach out to me and I can share with you some of the lessons I learned back in 2018 - as they may help you.

If there is anything I can do to assist between now and then - please do not hesitate to reach out. I am happy to help you however I can.

For what it's worth, the way MDDHosting handled their 2018 outage put them on my shortlist of trustworthy providers I should try.

And now I am a very happy customer.

Likewise, Jarland and MXroute are my favourite email provider. I was not in the least surprised to read this (the OP post of this thread): a no-nonsense, no-bullshit owning of the mistake(s) and learning from them, while doing all in their power to fix them.

Only those who never do any work can afford to never make any mistakes. The difference between the good and the bad providers is how they handle mistakes, and that says a lot about the business and the character.

I feel very sorry for the amount of work and stress that Jar must be going through. Wish I could help somehow (if there is something I can do, let me know).

I still use & recommend MXroute to anyone who asks about email. Came for the low price, stayed for the awesome service and integrity - that's the kind of folks I like doing business with.

I'll wrap this up with the story about Magic boxes.

Stay cool, watch out for your health, and test your backups every now and then.

Relja

gbzret4d · December 2023

@DP said:

@Francisco said: Starting mxroute.

BuyMX.

i would bid

ZA_capetown · December 2023

@jar sorry to hear about the crisis and best of luck with solving everything

I've always admired your work ethics, generosity and style, even though I've never been a client of yours (yet), by just following threads you post in.

But, make sure you remember to get enough rest/sleep while you try to get everything perfect again

by the sounds of it, and if you're like me, you might be so focused on fixing everything during a crisis and trying to make everyone happy as quickly as possible, that you neglect your own well being

so schedule sleep /rest / meals, even if you have to set alarms as reminders! :-)

rafaelscs · December 2023

only for death is there no solution

vitobotta · December 2023

@jar Kudos to you for being so transparent and honest, I wish all providers were like you with this stuff!

You have stated quite clearly that you are not interested in comments about redundancy etc and that you are not willing to change business model so sorry in advance.

But to me, if I can be honest, this sounds like a deliberate decision to sacrifice or reduce reliability for your customers in order to keep the service as affordable/cheap as possible. This is something I question if it's the right attitude when it's about an email service, considering how long this outage was and the problems it may have caused to your customers. I guess some customers use the service for their businesses.

If I were you, given what happened on this occasion I would definitely take steps to make things more resilient than they are now, and not rely just on backups that might take a long time to restore.

Shit happens, but while it's good to have backups etc I would also prefer have some redundancy and minimise as much as possible the chances of this kind of an outage taking so long. Ceph or whatever for a distributed file system with multiple replicas, database replication with automated failover, etc usual stuff.

If that means raising the pricing accordingly, I would definitely do that if I were you. I am pretty sure that most of your customers would be happy to pay something more for a service that can be trusted as more reliable. Several people have even told you this on some occasions and even in this thread. Those other customers who would complain if you raised your so low prices would probably better go with some shitty service instead if price is all that matters.

Personally, when I signed up for the service I was extremely happy with the price obviously but I didn't expect the lack of redundancy since it's a managed email service. This makes me realise that it's quicker for me to restore from backup if self host my email since it's just my data and I am literally up and running on another server within half an hour, than waiting for a restore if my MXRoute server goes kaput. And that is something that surprises me to be honest in a managed service.

jar · December 2023

@vitobotta said: sounds like a deliberate decision to sacrifice or reduce reliability for your customers in order to keep the service as affordable/cheap as possible

I guess you thought I was kidding, but alright. You want to go there. I warned you how much I didn't want to hear it.

We went 10 years straight without this bad of an issue, just keep that in proper context. I've never seen a giant ceph cluster go a decade without user impacting issues. In fact, one of our servers is on a giant ceph cluster and it's the server with the most storage related issues by far. All that complexity doesn't guarantee that failures never happen, it just guarantees that prices are higher and repairs are more complex. I doubt anyone would want to pay a 4000% increase in price in an overreaction to provide false sense of security that promises 8% of users won't experience a major issue in 10 years (and most likely gets that promise wrong too, thus the false part). Especially when I can promise a better future response to a "once in a decade" issue without a price increase, further reducing the value of increasing prices to enact such an unimaginative solution. This is exactly why I don't want to hear this shit. I'm not stupid, it's not like I don't know that replicated storage systems exist. I weigh the pros and cons, this was just a poorly planned disaster recovery.

My new plans address the extremely rare issue much better, I just didn't get a chance to enact them all yet because of the most incredibly bad luck in the history of servers. That's all. Please, leave me alone with that shit. I don't need the opinion of a hobbyist that over replicates everything for fun. Your data is not like my data, your challenges are not like my challenges. I don't get the beauty of running a company like an armchair admin, I have to actually be the admin. For you this is all a hobby, everything you see is awesome one day but then subpar the next when you see a new cool thing. I get it. But unless you have about $500,000 I can have to build out a totally new stack for the fleet, I don't want to hear it. I wasn't joking.

Maybe you shouldn't judge me by how a multi billion dollar business runs today, but how they were running when they dealt with their FIRST major failure. I'll judge myself just fine. If you want to go build another Gmail and charge $5/m per user, feel free. That's not what I'm doing here, and fuck you for implying that I'm intentionally trying to screw people by not doing it. I've been honest about my plans the entire time. When I figure out better solutions that allow us to keep our pricing strategy, I'll enact them, but creative solutions to keep price low has always been the focus. Creative solutions aren't low hanging fruit. What you're talking about is low hanging fruit. Yes, I could set fire to the entire thing and start over with 4x the infrastructure and 400x the price, or I could just recommend any number of email services that already do that. That's boring. That's not the business I'm trying to run. If it got me this far in 10 years, there must be something to it.

No matter how many people try to discourage me from going down the path I’m on, it’s clear to me that more than 90% of my users are with me because of the path I’ve chosen. I won’t let them down by giving up.

corbpie · December 2023

As someone who idles Mxroute I think you do a good job

Moopah · December 2023

I still have an old cpanel MXroute service, works well enough

Ramsterdam · December 2023

@corbpie said: As someone who idles Mxroute I think you do a good job

Second that, 8 years and counting by the way.

tra10000 · December 2023

You don't necessarily have to use ceph. Or expensive storage..

Purchase another copy of the main server with exactly the same features. Replicate every few hours. In this case, the cost increases by 100%.

Or use a replication server with lower specifications than the main physical server. And the cost increase is probably 30 - 50%

Get up and running in minutes/hours with minimal data loss.

We're getting into your business a lot, but maybe you'll find a simple solution that works.

jar · December 2023

@tra10000 said:
You don't necessarily have to use ceph. Or expensive storage..

Purchase another copy of the main server with exactly the same features. Replicate every few hours. In this case, the cost increases by 100%.

Or use a replication server with lower specifications than the main physical server. And the cost increase is probably 30 - 50%

Get up and running in minutes/hours with minimal data loss.

We're getting into your business a lot, but maybe you'll find a simple solution that works.

That's basically a slight variation of my new backup plan if you haven't been paying attention. But 200% cost for every server over the course of 10 years so that I might save everyone 2-4 hours of downtime per decade is fucking ridiculous and you know it. Given my new plans, 2-4 hours of downtime is what we're looking at when I finish enacting the new backup plan.

rafaelscs · December 2023

I think, working directly with the end customer, that operation is more of a priority than data at beginning. Customers want to receive that email that someone sent, and it is always urgent (the word i hear most).
I mean, it's not a suggestion, but a question, would it be possible to have a similar problem, keep a secondary server running, without the data, and add the data little by little, or, after recovery data, syncronize all?
For example, Lucy died, but recreate only the users in anothers (and domains) so that customers only receive the newest emails?
until the backup is restored

jar · December 2023

@rafaelscs said:
I think, working directly with the end customer, that operation is more of a priority than data at beginning. Customers want to receive that email that someone sent, and it is always urgent (the word i hear most).
I mean, it's not a suggestion, but a question, would it be possible to have a similar problem, keep a secondary server running, without the data, and add the data little by little, or, after recovery data, syncronize all?
For example, Lucy died, but recreate only the users in anothers (and domains) so that customers only receive the newest emails?
until the backup is restored

Once I finish enacting the new backup plans, the plan for disaster recovery will be to get users online first and repopulate their previous data second. This should limit us to 2-4 hours of downtime in the event of a full system failure. Given that it took 10 years to see this dramatic of a failure, the only one to date that would even benefit from any variation on that strategy, I think it’s the right plan for retaining our prices while providing a better response in such a rare event.

vitobotta · December 2023

@jar said:

@rafaelscs said:
I think, working directly with the end customer, that operation is more of a priority than data at beginning. Customers want to receive that email that someone sent, and it is always urgent (the word i hear most).
I mean, it's not a suggestion, but a question, would it be possible to have a similar problem, keep a secondary server running, without the data, and add the data little by little, or, after recovery data, syncronize all?
For example, Lucy died, but recreate only the users in anothers (and domains) so that customers only receive the newest emails?
until the backup is restored

Once I finish enacting the new backup plans, the plan for disaster recovery will be to get users online first and repopulate their previous data second. This should limit us to 2-4 hours of downtime in the event of a full system failure. Given that it took 10 years to see this dramatic of a failure, the only one to date that would even benefit from any variation on that strategy, I think it’s the right plan for retaining our prices while providing a better response in such a rare event.

That sound a lot better. With this backup strategy can you also recover data for a single customer easily?

jar · December 2023

@vitobotta said: That sound a lot better. With this backup strategy can you also recover data for a single customer easily?

Probably not any easier than the current plan. Which isn't that difficult, if that were something I set out to do at any particular moment.

Howdy, Stranger!

Categories

In this Discussion

MXroute failed, and I'm sorry

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

MXroute failed, and I'm sorry

Comments