256MB @ $10.50/yr ; 1GB @ $40/yr ; 2GB @ $7/mo ; KVM SSD+RAID10 / SAN+HA ; ALL NEW Chicago Location

GoodHosting · September 2014

Hello Customers et al,

This declaration goes out to all those customers who were affected by the Chi1 cluster outage we recently experienced, and worked through. A Lot has happened in these past three days, and I plan to dispel any rumours or confusion that may have arisen during this time with some concrete facts of the matter.

What Happened?

On September 3rd at approximately 6PM Pacific Standard Time, our technicians warned me of an issue regarding our Storage Area Network (consisting of two standalone storage servers, each server with multiple dedicated storage chassis and own OS).

The following explanation of events (and disclosure thereof) is slightly technical in description, and will use the following acronyms:

NFS: Network File System; A software (virtual) means of mounting (using) a hard drive on the server, over ethernet (as opposed to it being physically plugged into that server.) Our NFS is served over iSCSI using tgtd to serve the virtual provider.
SAN: Storage Area Network; A network (group, connected via ethernet) of servers which constitutes the storage backend for our storage nodes, which then connects to our compute nodes via dedicated interfaces.
SNI: Storage Network Interface(s); A collective term encompassing our multiple bonded pair 10Ge fiber interlinks that make up the ethernet connections in our SAN. SNI covers the dedicated network interfaces as well as the dedicated SAN switches.

When I logged in to establish a baseline understanding of the situation, I found that our cache device had begun showing signs of failure, and immediately we put into action our worst case scenario recovery plan. We brought the SNI offline for the affected server (CHI1-S1A) to prevent data degradation, and make a baseline recovery effort. We then proceeded to offline and flush out all cache buffers to the underlying storage devices, which took some time. Unfortunately, our secondary storage server (CHI1-S1B) was not capable of operating as the storage master in a catastrophic failure situation such as this. We were unable to bring up a new SNI and NFS daemon on this server to take over, resulting in the disk input/output errors and downtime that some customers experienced.

To make matters worse, and the recovery effort far more difficult: we experienced several issues with the proprietary code that powers our cache device, and were unable to debug it inhouse. We had to do the best we could to stabilize the situation without having access to the source code, and had to wait for the manufacturer to help with the restoration (that took until the next morning (September 4th, 2014) during regular business hours. (8AM Eastern Standard Time.))

The excellent staff at the cache manufacturers company were able to assist with the baseline stabilization, data offloading, and cache replaying; so we were able to recover 99.958% of our data successfully; even though the situation started out as the worst case scenario. We were even luckier that the only data lost, was, in fact, not customer data, but, in fact, our templates, which we had ample backups thereof.

What's the current state?

We are currently working around the clock to restore the original integrity of both storage servers, and bring them both online. Once both systems are scrutinized and given the A-OK to return to production, we will online the SNI and restore SAN connectivity to the compute nodes. This process may take up to 48 hours as the data in question exceeds 10 TB in size. We will be restoring access to the Virtual Machines in a rolling fashion, where they will be started in rounds; as to not cause undue load to the storage servers. We will have four staff members watching each storage server, and continuously monitoring the success and progress of the restoration efforts.

We will have staff check each Virtual Machine via the inbuilt VNC functionality of OpenNebula SunStone to ensure the Operating System does not show any signs of corruption or data loss. However, we will NOT login to the instances. We will only observe the boot process. Unfortunately, this means that we will be completely unable to assess the situation and integrity of Windows Virtual Machines; as they do not have a visible boot log. If you have reason to believe your instance has suffered any data integrity failure, please contact our support staff immediately so we can help you work through this situation; such as helping you conduct filesystem checking and repair operations (fsck, e2fsck, xfs_check, xfs_repair, and friends.)

What caused this failure?

Upon later inspection, we found that the culprit that initially caused the stress that led up to the cache failure was, in fact, the networking driver for our 10Ge ethernet cards. Unfortunately, there currently exists a known bug in the QLogic/XenNet Incorporated drivers that ship with RedHat (and thus CentOS) which causes memory leakage and other issues directly in the kernel. This bug is still OPEN on the RedHat bug tracker, and, unfortunately, there is currently no known solution. That being said; we are looking to replace these QLogic 10Ge network cards with cards from another manufacturer that will not rely on this buggy driver package.

How will you prevent this in the future?

We have taken extensive steps to prevent a situation such as this from happening ever again, such as modifying our hardware and software infrastructure to provide earlier warnings of cases that look similar to what led up to this, as well as be more resilient to failures such as these.

We will be reconfiguring both the storage master and storage slave servers to act in a "Multi-Master" fashion, so that future isolated failures will not bring down any Virtual Machines as they did this time.
We have upgraded the hardware RAID cards installed on all of our compute nodes, storage servers, disk chassis, backup nodes, and control nodes to newer cards that will better survive a host Operating System failure such as this.
We have hired additional staff to watch over the safety and security of the SAN during the off hours, and are organizing training seminars for our existing staff in position management and how best to respond during a crisis. We will re-brief all of our staff in our existing mitigation and response plans.
We will replace the 10Ge network adapters on our storage servers for ones that were not made by QLogic/XenNet Incorporated; as to no longer rely on the buggy driver that ships with the RedHat kernel.

Will I be credited?

We understand that the trust that has been placed with us to safeguard your data and Virtual Machines may have been lost during this situation. We hope that this full disclosure of events and proceedings will help to reinforce that we take situations like these very seriously. As such, we fully acknowledge that as service providers; we are the ones considered responsible for the situation, and declare that we will be providing the following reparations to affected clients:

Every client effected shall receive an appropriate downtime credit for their service, double the regularly assessed rate for issues concerning hardware; with the severity of the situation understood.
Every invoice due for a service affected by this situation shall be compensated as such, and PAID 50% OF THE TOTAL AMOUNT towards your account as credits, in addition to the above credits.
Every client deployed on our Chi1 cluster, even if they were not affected by this situation, shall receive a conveniently packaged copy of all their Virtual Machine data in a downloadable format, should they wish to keep a local backup.
Any clients that had cancelled services recently, or suspended by our system in response to invoices that would be credited pursuant to the above reparations shall be reinstated for a period of seven days at no extra charge (we pay the difference.)

In Closing

We hope that given our handling of the situation and our attempts at being both forthcoming and completely transparent, we may have proven that we take these situations very seriously. We appreciate your business and wish to continue to provide outstanding service with unparalleled quality and performance as we did in the months preceding this case.

In the future, during situations such as this; we plan to update our customers much earlier in the recovery process, so they understand that the issue has, in fact, occured, but we are fully aware of the situation, and are already working on a solution.

Sincerely,

Acting Director Damon Blais

Albino Geek Services Ltd.

Hours of contact: 9AM - 5PM PST

e: [email protected]

p: +1 (912) 330-4222

Skype: GoodHosting

GoodHosting · September 2014

@vpnarea , @sleddog , @hosein4213 , @FrankZ ; please see the above declaration.

sleddog · September 2014

GoodHosting said: please see the above declaration.

Thanks. A real PITA for you, best wishes getting it all sorted.

hosein4213 · September 2014

@goodhosting
Here is the whole history of our last ticket

hosein4213 · September 2014

@goodhosting
Here is the whole history of our last ticket

Silvenga · September 2014

hosein4213 said: Here is the whole history of our last ticket http://oi58.tinypic.com/w0qiv6.jpg

Love how you saved that so the letters are 1.5px in height.

@GoodHosting Could I recommend using Twitter? I would prefer more, 80 word updates, that show progress, over one detailed 1,400 word update.

FrankZ · September 2014

So glad to see it did not have anything to do with changing your product to suit one big compulsive customer. http://lowendtalk.com/discussion/comment/716573/#Comment_716573

Am I correct that service should resume sometime on the 8th of September?

GoodHosting · September 2014

@FrankZ said:
So glad to see it did not have anything to do with changing your product to suit one big compulsive customer. http://lowendtalk.com/discussion/comment/716573/#Comment_716573

Am I correct that service should resume sometime on the 8th of September?

We took this opportunity to compile our own kernel and deploy it on all the compute nodes, since they were down during the migration anyways; we best use the most out of the time we can, and get all of the maintenance and patchwork done while the migration was under way. But yeah; the downtime was NOT because of this, we're not that dumb .

@hosein4213 said:
goodhosting
Here is the whole history of our last ticket

I'm sorry mate, while that proves that you may have an account, I can't read a single thing in that screenshot. All I asked you for was a Ticket ID, but you couldn't even give me that? Instead I get a tinypic (that can't be zoomed into by the way, because of their crap site); that was literally zoomed out to all hell?

@Silvenga said:
GoodHosting Could I recommend using Twitter? I would prefer more, 80 word updates, that show progress, over one detailed 1,400 word update.

I will never touch Twatter as long as I live. We will have a status page shortly however.

hosein4213 · September 2014

@goodhosting
All your excuse is the ticket number?!!!
OK here is the ticket number #NLC-7230-UQW

FrankZ · September 2014

GoodHosting · September 2014

@hosein4213 said:

goodhosting

All your excuse is the ticket number?!!!

OK here is the ticket number #NLC-7230-UQW

You know very well why your ticket wasn't dealt with, you still refuse to give the information required to debug the issue; such as proof that the issue even is on our side. The last MTR you provided showed that your connection has a 50% packet loss to Google's DNS servers, which you then blamed on our services.

Your Internet Service Provider being bad has nothing to do with our servers.

theweblover007 · September 2014

The "I will bring my own" license option will install 180 day trial by default ?

GoodHosting · September 2014

@theweblover007 said:
The "I will bring my own" license option will install 180 day trial by default ?

Yes.

As per the Chicago1 cluster issue, this has now been solved.

hosein4213 · September 2014

@goodhosting As i told you before all my client has a same situation with this problem. You were suppose to check by my ID and password but we got nothing?!!!! Meanwhile all of our information+ our win server is removed and we can not install it too. in your last ticket you said you needed to 48 ours to fix but we got nothing yet too

timnboys · September 2014

Okay @goodhosting please just restore my vm quota as I don't want to have a nested vm anymore since it is too much trouble for me and you, so please just put my vm quota back to what it was
here is some screenshots to show you what I am asking:

wei5451532 · September 2014

@GoodHosting said:

Featured YEARLY KVM Plans What time ends

GoodHosting · September 2014

@wei5451532 said:

Check our latest offer thread: http://lowendtalk.com/discussion/34319/256mb-10-5-yr-1gb-40-yr-2gb-7-mo-kvm-ssd-raid10-jingling-hitleap

Howdy, Stranger!

Categories

In this Discussion

256MB @ $10.50/yr ; 1GB @ $40/yr ; 2GB @ $7/mo ; KVM SSD+RAID10 / SAN+HA ; ALL NEW Chicago Location

Comments

What Happened?

What's the current state?

What caused this failure?

How will you prevent this in the future?

Will I be credited?

In Closing

Howdy, Stranger!

Quick Links

Categories

In this Discussion

256MB @ $10.50/yr ; 1GB @ $40/yr ; 2GB @ $7/mo ; KVM SSD+RAID10 / SAN+HA ; ALL NEW Chicago Location

Comments

What Happened?

What's the current state?

What caused this failure?

How will you prevent this in the future?

Will I be credited?

In Closing