Kernel panic - help is needed

Corey · September 2012

@Alex_LiquidHost you need to be proactively monitoring for bad disks man!

AlexBarakov · September 2012

I am.. Smart did not return anything disturbing.

Corey · September 2012

@Alex_LiquidHost and this was a raid1 array?

AlexBarakov · September 2012

RAID10 array. As far as I understood from the 3 system admins that took a look into it, it deffinetely is HDD failure. The whole filesystem got corrupted and the RAID array itself has failed, Steven actually tried to rebuild it, however I think he was not able to run fsck at the end. Same with Fran (failed fsck)

Corey · September 2012

@Alex_LiquidHost it is actually hard to believe that "Smart did not return anything disturbing." Especially if you were verifying the results with the hard disk manufacturer.

AlexBarakov · September 2012

@Corey said: it is actually hard to believe that "Smart did not return anything disturbing." Especially if you were verifying the results with the hard disk manufacturer.

I was not verifying the results with the manufacter, I was looking for any errors reported by it. I guess this is a mistake on my side. We all learn from the mistakes, don't we?

Corey · September 2012

@Alex_LiquidHost yes we do..... were you paying attention to 'reallocated sector count' ?

Damian · September 2012

@Alex_LiquidHost said: The whole filesystem got corrupted and the RAID array itself has failed

Any RCA on that?

Francisco · September 2012

@Corey said: @Alex_LiquidHost it is actually hard to believe that "Smart did not return anything disturbing." Especially if you were verifying the results with the hard disk manufacturer.

It didn't.

Sentris, because they're awesome and all, used 4 100% different brands to build the array. SMART reported no reallocs or bad sectors, yet it refused to bring the 4th partition into play to bring the array back into a working state.

The FSCK was started in read-only mode and was simply a wall of "THIS FIX WILL LIKELY DESTROY A LOT DATA" scrolled by with 'NO' appended to it.

MDADM refused to stop the array so I could check into it more so I rebooted the box to free up whatever locks were persisting. After a LOT of jimmy rigging due to the HORRIBLE KVM they provided Alex I got back into a maintenance linux I could work around.

While this time it assembled the array (still broken), it refused to even see an ext3 partition existing. tune2fs showed nothing and testdisk couldn't find anymore super blocks.

Now, the system was able to re-assemble his / partition (/vz having its own dedicated hunk), but it was trashed. Pre FSCK /etc/ was missing and most of /bin & /sbin were input/output erroring. Unmounted FSCK'd and remounted. /bin/ wasn't input/output erroring but the files were all scrambled and not actual binaries anymore. /etc didn't get restored and got plastered all over lost+found.

While the drives didn't report any smart errors at the get go, mdadm simply refused to assemble any of the arrays properly and marked a full physical drive as unsuitable for inclusion. We didn't run any long term SMART tests nor could I access /var/log to look for any SCSI disconnect errors pre his hard crash.

Many many years ago I managed a very large gameserver that decided to order a bunch of nodes from them because they wanted to move to the west coast. Not only did we get installed bad drives into our arrays (noticed from the get go when we would clean reboot and drives were removed from the array), but they couldn't get our private network working properly. It took them the better part of a week to wire up our ports for a private LAN and kept giving us 100Mbit switches instead of the gbit ports we paid for (and was listed as so on our invoice).

It's a mess and I blame it fully on sentris. I am so sorry that I wasn't of more help Alex.

Francisco

Mon5t3r · September 2012

@Francisco said: used 4 100% different brands to build the array.

thankfully i decide for not going there since their sales respond seems hiding something. btw thanks Fran for your explanation, if you don't mind i'll take this as my further references.

AlexBarakov · September 2012

@Francisco - thank you for all the help today, as alerady said - it is really appreciated!

About sentris - well I do not really know what to say. Generally I was satisfied with them, untill now. They lack support, however for 1 year I had not ahd any hardawre or major problems at all, so I did not need to contact their support for pretty much anything. Their network is stable, so far had only one outage. No power cuts in the DC. Generally they are not that bad, if we cut the bad support from the equation. However I did not notice the different brand in this server till yesterday, honestly it never passed my mind that they will be 4 different brands. And it never passed my mind that a RAID10 array would fail cause of bad drives, however I guess it was my mistake.

They are currently rebuilding the node with new hard drives, as I requested all of them to be replaced with new drives. From that point on - I am not absolutely sure how I would proceed, however once I get the ~40 clients on that node up, I will start thinking of alternatives. At this point I think I tend to buying and coloing hardware somewhere in Seattle, however I can not make any actual promises for my clients, as I myself am not absolutely sure. The main priority is to get everyone back online, give proper compensation and start thinking on how to proceed from there.

I wanted to thank @Francisco one more time for trying to help me out

emilv · September 2012

Did you make regular backups?

AlexBarakov · September 2012

@emilv said: Did you make regular backups?

The last backups of this node are from 2 weeks ago.

My TOS are clear that we do not take backups and that they are sole responsibility of the clients, however on good will I have a local backup server that is synced to external lcoation as well. However the particular server was rebuilt with RAID10 array 2 weeks ago and I somehow did not enable the backups on this exact node. Extremely bad luck I guess.

The data is at this point lost. I will get one of the working drives from the last array attached to other server and see if I would be able to recover anything from it and will of course provide access to the files to the customers, on request. However I have a feeling that this is a long-shot.

emilv · September 2012

Feel for you man, sucks to be in this situation. Hope all goes well in the end.

AlexBarakov · September 2012

@emilv said: Feel for you man, sucks to be in this situation. Hope all goes well in the end.

Well, shit tends to happen from time to time.

The main priority at the moment is to get everyone online. Once done I will be finally able to get some hours of rest, as it has been tough 48 hours for me. From that point on I will start working on a "fresh head" on recovering any of the data and other scenarios.

seanho · September 2012

Thanks so much, Alex and Fran, for the updates. I am sorry you had to go through all this headache; I guess those must have been pretty flaky hard drives to have up and died with no warning signs from SMART.

The data in my container was unimportant, but I do value your having a location in Seattle. Thanks for letting us know when the node is back up.

AlexBarakov · September 2012

As I know that a couple of my users are monitoring this thread I will post it in here as well - the node with new hard drives is here, I will wait for the RAID array to be rebuilt. Than I will recreate teh VPS and in case you have a new IP, you will get an email that contains it.

AlexBarakov · September 2012

And update: All the VMs were re-created. Actually there were only ~20 clients affected by this. And I found another bug in my solusvm/whmcs - a buch of VPS were not terminated months after the due date, so at teh end, atleat they got terminated.

seanho · September 2012

Thanks, Alex. I can confirm my VPS is up and running again with a new IP; I appreciate your hard work!

AlexBarakov · September 2012

At the moment, the VPS are created at a sentris node with new hard drives. I can not confirm anything yet, however I will make sure to keep the location (Seattle or something close to it in the worse case), so the clients should not be affected by any eventual move.

Howdy, Stranger!

Categories

In this Discussion

Kernel panic - help is needed

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Kernel panic - help is needed

Comments