LEGACY-03.LV.BUYVM.NET is down now.

sin · December 2018

@msg7086 said:

@sin said:

Francisco said: Some of it was QCOW being stupid.

My slabs weren't in use or partitioned at the time of the power outage, they seem fine though and my VPS is recognizing them - are they safe to use now? Thanks

You should ticket to see if yours are in qcow or raw format. I believe Fran started to convert people into raw, so if you haven't started using it, it's a good time to do so -- just a re-provision.

Thank you! I opened a ticket and they're going to convert them to RAW for me.

TimboJones · December 2018

@Francisco said:

@PieNotEvenEaten said:
Well this gives the whole story now thanks! Sounds like someone was impatient.

No, I was simply asleep when the initial ticket came in.

The OP's not at fault at all, this is squarely on me for not addressing the 2nd power feed back in August.

Truth be told I came back from Vegas and was in the midst of finalizing things but ran into some annoying medical issues so i've been preoccupied with that for the entire year (still dealing with it now). It simply slipped my mind and bit me.

Francisco

OP said he waited 24 hours. You say around 6 hours. Those are in major disagreement.

If OP received an email status and didn't wait the 24 hours like stated he did, then OP did tell a different story and was impatient.

TimboJones · December 2018

@eol said:

@tcp6 said:
Condoms are for pussies.

You want a condom during penetration tests to avoid SQL injection.

I don't want to know what your coworkers have been telling you, but you're doing it wrong. If they tell you that you need to "firewall" their penis with your mouth, don't fall for that!

TimboJones · December 2018

Question, what does this exactly mean? Who does the certification and how often?

SAS-70 / SSAE16 Type II certified datacenter facility with redundant power and HVAC systems.

TimboJones · December 2018

@Xenos said:
@Francisco was transparent about the situation and was on top of the issue with my block storage. I feel like I should be paying him a lot more for the service I receive.

Holy fucking Kool-aid drinker, Batman! This is an epic fuck up that led to data loss, and the customers feel guilty and think they should pay more.

Francisco, future president in the making.

Daniel15 · December 2018

@deank said:
Backups are for pussies. Real men take real risks and shed manly tears later.

Be a man and don't make backups.

My first dedicated server had a single 80 GB IDE hard drive, and I never took backups. The hardware wasn't even enterprise-grade, it was a WD Green HDD and I'm pretty sure the 'server' was just a tower PC sitting on a shelf in the DC somewhere. I'm lucky I never lost any data. It's a bit scary to think about now

BuyVM uses RAID so you're less likely to lose data, and yet I still have daily backups and verify my backups at least monthly.

deank · December 2018

Vote f0r Fran in 2020.

2024, 2028, etc.

msg7086 · December 2018

@TimboJones said:
Holy fucking Kool-aid drinker, Batman! This is an epic fuck up that led to data loss, and the customers feel guilty and think they should pay more.

Making mistake is not that horrible. The point is to get the experience and not repeatedly making the same mistakes.

deank · December 2018

Which is lost on majority of LET lurkers wh0 sign up for the cheapest deals.

sidewinder · December 2018

@TimboJones said:

There was a post the other day from a guy pissed he didn't get an email after several days of downtime and didn't file a ticket or look at the network status page but demanded being told what was in the network status update already. You'd have a case with that guy bashing provider, but not this guy.

That was me and guess what? The status page said there was "an incident". It had no ETA for recovery and no explanation for what was going on. Additionally, there was a link you could click to give more detailed explanation but guess what? Hyperlinks are supposed to be underlined for fucking reason and all these assholes and their CSS have ruined the web for colorblind people like me when they ** DONT UNDERLINE THEIR FUCKING HYPERLINKS ** Why? Because we can't tell its supposed to be fucking clicked on. Why? Because it's not fucking underlined, per the original HTML spec.

IF they are going to throw up all over the Internet with their retarded CSS, Hosts should at least send an email with an explanation, end of discussion.

"There was an incident in Los Angeles" with a fucking invisible hyperlink isn't good enough.

YOU DIG?

willie · December 2018

I got an email dated Sat, 22 Dec 2018 17:15:42 +0000 titled "Partial Power Outage in Las Vegas" that described in detail what had happened.

eol · December 2018

I got an email dated Fri, 34 Nov 1864 25:69:42 +0000 titled "BUY PENIS PILLS NOW 30% OFF" that described in detail what could have happened.

TimboJones · December 2018

@sidewinder said:

@TimboJones said:

There was a post the other day from a guy pissed he didn't get an email after several days of downtime and didn't file a ticket or look at the network status page but demanded being told what was in the network status update already. You'd have a case with that guy bashing provider, but not this guy.

That was me and guess what? The status page said there was "an incident". It had no ETA for recovery and no explanation for what was going on. Additionally, there was a link you could click to give more detailed explanation but guess what? Hyperlinks are supposed to be underlined for fucking reason and all these assholes and their CSS have ruined the web for colorblind people like me when they ** DONT UNDERLINE THEIR FUCKING HYPERLINKS ** Why? Because we can't tell its supposed to be fucking clicked on. Why? Because it's not fucking underlined, per the original HTML spec.

IF they are going to throw up all over the Internet with their retarded CSS, Hosts should at least send an email with an explanation, end of discussion.

"There was an incident in Los Angeles" with a fucking invisible hyperlink isn't good enough.

YOU DIG?

Sucks to be color blind. You might need to look at accessibility options or maybe an add-on that will ignore the website. Some Google search made it sound like accessibility options should override the website style. Since people have been complaining ever since this started, someone must have come up with a workaround for colour blind.

TimboJones · December 2018

@msg7086 said:

@TimboJones said:
Holy fucking Kool-aid drinker, Batman! This is an epic fuck up that led to data loss, and the customers feel guilty and think they should pay more.

Making mistake is not that horrible. The point is to get the experience and not repeatedly making the same mistakes.

No, my point is data loss IS horrible and not as trivial as some are treating it. A mistake that leads to downtime is not that horrible. Data loss is.

Learning mistakes are great - when it's not at the expense of your money and data.

(I don't have services with BuyVM, just speaking in general. People are really chill about this and that's not normal).

sibaper · December 2018

email from buyvm

Hello .......,

We wanted to give you an update on our Storage cluster.

As many of you have seen, we've had some issues where end user Volumes
require reattaching, have required a FSCK, or in a few rare cases, restoring
from user backups.

While the UPS dropping its power load is the main reason this all happened,
a lot of the blame falls squarely on myself for not laying out a "Plan Of
Action" to make sure the cluster was ready to come back online. Truth be
told, I simply panicked once I saw the amount of services affected and
started booting services as soon as possible.

While the filesystems that make up our cluster all claimed to be fine, that
really wasn't the case. A forced FSCK has been required to make sure things
are *really* OK. Unfortunately this wasn't realized until Sunday when issues
had already appeared.

Since then we've had to pause each storage node to run FSCK's and clear page
caches just to make sure things are in good shape. Physical hardware has
been inspected to make sure there's nothing damaged with full array
check/verify's being scheduled for Christmas Day.

An additional power feed (to round out our A+B configuration) has been
ordered and we expect it to be installed for this storage cluster within the
next 2 weeks, possibly even the end of this week. This power installation
won't require any downtime to complete. This should have been in place
before the cluster ever took any customers, but personal medical issues have
kept me preoccupied for much of the year, leaving a few projects like
lingering. 

We truly apologize to anyone and everyone that has had a rough few days with
us. I've personally spent over a year researching, designing, testing,
retesting, and finally putting live this past May. To say i'm upset that
something outside of our control has caused this big of a headache would be
an understatement.

We'll continue to work to assist any customers still having issues. We'll
also use our findings to better improve our Vegas offers, as well as our
future roll outs in both New York & Luxembourg.

Please don't hesitate to ticket for any support you require. While we're
short staffed for the holidays, I'll personally be around helping as best I
can when I can.

We'd like to wish everyone a Merry Christmas, a Happy Holidays, and a
prosperous New Years.

Thank you,

Francisco
BuyVM

dont complain about buyvm too many fanboy around here

I'm a bit disappointed with this problem, but lucky me I had spare backup somewhere else, it's only take few minute to restore my intterupted services. But I'm glad they transparant with thiss problem, looking forward to a better service from buyvm.

Good luck with fsck

cheers

dajiba · December 2018

@Daniel15 said:

@dajiba said:
I spent the whole day, and I found that my 500g of data was unrecoverable, damn it.

Restore from backups? You do have backups, right?

My VPS was fine after the power outage, but my slab was corrupted to the point where it won't even mount any more. No worries, I do daily backups so I just restored from backup onto the VPS, and will wipe the slab, reformat it, and move everything back once I've got some free time.

If you don't have backups, maybe this is a wakeup call.

Yes, I always have remote backup, but the number of files is huge, and it will take a long time to backup again.hope it won't happen again

Francisco · December 2018

@TimboJones said:

@msg7086 said:

@TimboJones said:
Holy fucking Kool-aid drinker, Batman! This is an epic fuck up that led to data loss, and the customers feel guilty and think they should pay more.

Making mistake is not that horrible. The point is to get the experience and not repeatedly making the same mistakes.

No, my point is data loss IS horrible and not as trivial as some are treating it. A mistake that leads to downtime is not that horrible. Data loss is.

Learning mistakes are great - when it's not at the expense of your money and data.

(I don't have services with BuyVM, just speaking in general. People are really chill about this and that's not normal).

I'm doing my very best to not gloss over what happened and if you read the big email I sent on Christmas day, you'll see it was just a lot of fumbling on my part and not having a "plan of action" for such cases.

The vast majority of users with issues have been fixed with a minor FSCK. Some users had serious damage, but a chunk of that was related to QCOW not checking properly. We had to apply a lot of experimental patches to qemu-img in hopes of repairing some users, but they didn't work. There's outstanding issues in qemu-img where it tries to allocate TB's worth of memory to work. QCOW was originally picked so we could offer snapshots/backups of block storage in the New Year, but that's not going to happen. We've swapped all new provisions to RAW based images and will find a way to snapshot/backup that instead.

Some users had total failure, there's no excusing that. We've done our best to accommodate people, usually with extensive credits, or in some cases, an extra volume on the house so they can do their own RAID1 in their setup. If you had a failure and haven't talked to us, do so. Most people get around 3 months credit.

The additional power feed goes in early next week, so this particular issue won't happen again. It should never have happened, but with me having medical issues for the past 6 months, it simply got lost on my TODO list. A "plan of attack" has also been written so we know what did/didn't work if there's ever another issue.

The platform itself didn't fail, it kept chugging along. The underlying XFS filesystems didn't fail either, they simply didn't get xfs_repair like it should've. The failures squarely on me for panicking to get things resolved and not thinking clearly. It happens when you're half asleep.

We fucked up, we'll make sure it doesn't happen again.

Francisco

Janevski · December 2018

@ersite Santa Claus, was bringing presents earlier in the datacenter, tripped on the cord, fucked up the server and the whole server room floor, to be more precise. They're changing the piping as we're speaking.

deank · December 2018

Has Santa Claus been arrested?

Janevski · December 2018

No, he is still at large.

deank · December 2018

I've been hearing that GDPR is after Santa Claus. I also hear Google is after him for his data.

A message to Santa Claus: Stay strong. Don't let them have your data which include one's name, address, and sexual orientation.

dahartigan · December 2018

"Glorified hobby shop" my ass. Good work Mr
San @Francisco

eol · December 2018

Frantastic.

dahartigan · December 2018

Out of curiosity, did you observe any difference in recovery success between encrypted and non-encrypted volumes?

TimboJones · December 2018

@Francisco said:

We fucked up, we'll make sure it doesn't happen again.

Francisco

Reminds me of the parable where a new hire fucked something up and expected to be fired. When asked, the boss said, "why would I hire someone else to repeat that same mistake, when I know this guy won't do it again?"

Francisco · December 2018

@dahartigan said:
Out of curiosity, did you observe any difference in recovery success between encrypted and non-encrypted volumes?

Nope. The few people I talked to with encrypted volumes usually got wacked by qcow issues so recovery wasn't possible.

One customer had me transfer him a copy of his broken QCOW so he can see what he can fix. As mentioned there has been lots of patches trying to address the memory allocation issues but haven't had much in the way of luck. One patch set is from just last week.

Francisco

dahartigan · December 2018

@Francisco said:

@dahartigan said:
Out of curiosity, did you observe any difference in recovery success between encrypted and non-encrypted volumes?

Nope. The few people I talked to with encrypted volumes usually got wacked by qcow issues so recovery wasn't possible.

One customer had me transfer him a copy of his broken QCOW so he can see what he can fix. As mentioned there has been lots of patches trying to address the memory allocation issues but haven't had much in the way of luck. One patch set is from just last week.

Francisco

Thanks for answering that

Francisco · December 2018

dahartigan said: Thanks for answering that

No problem.

One of the patch sets I applied was https://patchwork.kernel.org/patch/10731187/ but it doesn't actually fix anything. It doesn't OOM, but it is more or less marking every single cluster/block in it as bad which is incorrect.

Any drives that have been affected by it I've put away for safe keeping and will test future patch sets.

Francisco

willie · December 2018

Do any of these patches claim to fix issues after the fact? I'd have expected them to stop corrupted data from being written, rather than fixing corruption that already happened.

That they were released so recently suggests qcow2 wasn't/isn't really ready for production.

I've been a bit tied up with holiday travel and haven't had a chance to mess with the qcow image much yet, but the conversion utilities don't work on it (they report corruption) and it looks like the file offset table has gotten messed up. I want to look into how those table entries are allocated (hopefully something simple) and see if there is any hope of reconstructing them.

dahartigan · December 2018

I think this just shows that no matter how well set up a host's storage array is set up, a diligent vps customer would have a backup in another location to mitigate rare things like this.

Howdy, Stranger!

Categories

In this Discussion

LEGACY-03.LV.BUYVM.NET is down now.

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

LEGACY-03.LV.BUYVM.NET is down now.

Comments