GreenCloudVPS Node Down & all Data lost

I have a VPS from GreenCloudVPS's Node DC2-SG2 in Singapore. Recently, I received an email telling that they found a bad memory & there will be a scheduled maintenance for briefly 1 hour. Then, it went offline & after some moments, they announced that they found issue with Raid Array & trying to re-build it which will take around 8-10 hours. After that I forgot about it as the VPS was being used by My Dev Team.
Suddenly, I received an email with New KVM VPS information & also found that they extended the service period.
I have backups of the data. But, many might not have. My concern is, if Raid can't save you, then what is the importance of making Raid ? Is there any statictics that in how many incidences someone got the advantage of Raid ? Also, what's the way which the VPS Providers can adopt to recover data loss in any situation ?
Comments
This can happen, keep backups.
Enjoy your 6 months free service.
Raid is not backup.
It is not a backup. I agree. But then what's the advantage of doing RAID?
I think this is very good behaviour of Greencloud. They seem to have informed you correctly, and, without asking for it, they've generously extended your service period.
As others say as well: RAID is not backup. I don't know what RAID configuration they used, but even if it wasn't RAID0, you can always have a failure in a way that your storage is lost. RAID with some kind of redundancy can prevent a lot of scenarios, but not every one.
https://www.yoloraid.com/
@NDTN might share some insights so that we can know more about the incident & get something new to learn.
Some protection against disk failure, increase in IO and storage space, nothing more
Its for some resilancy and some performance depending on arrangement. If the machine was provisioned with just a bunch of disks with no RAID or just a simple RAID0 - you would've likely experienced more issues and all would be catastrophic.
But as with everything in life, somethings just don't work out as planned and catastrophically fails. Free 6mo's of free service is good compensation.
This is true, even raid 10 has a week spot, if you loose the same 2 drives belonging to the same mirror, you loose all the pool, even if it is al 10,16,20 drive pool.
GreenCloudVPS handled this like a champ, giving you 6 months of free VPS.
If this is something that bothers you, next time get something that has a backup plan included.
6 month extension is very generous. Shit happens.
At the risk of being obvious, everyone who ever uses RAID should at least read this:
https://en.wikipedia.org/wiki/Standard_RAID_levels
Then, figure out why you might want to use a particular RAID level, and then you'll have your list of advantages for your own use cases. YMMV in all cases.
For me, I use RAID 1, 5, and 10 for different use-cases when needed. The primary benefits are all listed in the article above.
In ALL cases, I have actual backups, of course. Again, "RAID is not backup," as the mantra goes. Say it 10 times. Put it on a post-it on your computer monitor.
GreenCloud did not have a backup in this case. Nor did they promise they would. You, as a user, should have your backup/failure plan, and assume a level of risk.
It is extrenly important, it will give you performance ( IOPS ) and data redundancy depending on the raid type used, and number of drives.
A typical SATA HDD 7200 RPM drive will have like 140-160 IOPS/SEC
A typical SAS HDD 7200 RPM drive will have like 160-200 IOPS/SEC
SSD's wil start at ~5000 to 10000 IOPS/SEC
Now, if I want large capacity and higher number of IOPS I have to 'bond' the drives together, and this is a must.
My options are quite limited:
RAID 5 - 1 drive failure, would not touch this, not even with a stick!
RAID 6 - 2 drive failure, this is something that is recommended for backup servers, and typically not more than 12 drives.
RAID10 - can endure 50% of drives being loss, as long as you do not loose the same 2 drives / mirror. This is definitely the way to go for VPS/VM, but you loose 50% of available capacity, as each pair of 2 drives is mirrored.
Of all, RAID 10 has the highest performance, Raid 5/6 will have the write performance of a single drive, the reads will be good, as all drives participate in reading, at raid 10 all drives participate at reading, at writing you will have the speed all drives / 2.
How many do you think will burn 50% of the 10/12 TB drives in a raid 10 array with today prices?
Get a backup plan.
I'm not a hosting provider, so I use RAID 5 for certain use cases and it's great. However, if I were a hosting provider, I agree, I would definitely NOT use it.
And agreed about RAID 10. My preference when I have enough drives.
200% agree. Typical 3-2-1 approach is a good starting place for everyone to consider.
https://www.crashplan.com/en-us/business/resources/3-2-1-backup-method/
I do not wish to spam this tread, but again, even for personal files, i **highly ****recommend **a raid 6 or raid 10, even if it is from 4 drives.
I personally lost a lot of important data to me on a RAID5 array.
The biggest problem is that disks from the same batch tend to fail around the same time. Maybe it's different with SSD, but that was certainly the case with HDDs. Maybe not immediately, but often within a few days or weeks of each other.
Many years back when I used to manage lots of servers, we'd always have disks from multiple batches in stock for this reason, and every RAID would be built with 3 or 4 disks ordered at least 2 or 3 months apart from each other.
RAID is not a backup, it just gives you a bigger window to notice a failing disk and hopefully repair the array before another failure happens. Repairing the array is typically quite tough on disks too, so there's an increased risk of failure during that time, and usually when the array is most vulnerable unless you have multiple redundant disks.
The other main advantage of RAID is that when a disk fails you can often hot swap the disk without loss of service other than reduced performance due to the background repair.
It's NEVER intended to be a replacement for a backup. You should have MULTIPLE backups as well.
Gotcha, I know the risks, I was pretty careful to indicate in my posts that it's great for certain use cases, and that in ALL cases, I have actual backups, and strongly suggest backups of course. And I certainly would not do it as a hosting provider, which I'm not one. I thought I was pretty clear in my posts, but maybe not. The point of my post was NOT push RAID 5 on anyone. This is a sidebar distraction. But to clarify, I was suggesting people become acquainted with the different RAID levels and make their own choices, understand the levels of risk, and as I also said, YMMV.
As for you losing data on RAID5, that sucks, been there, I've lost data on every kind of RAID configuration, and what matters more than anything is what backups I had. Obviously different RAID levels are more or less resistant to failure. Just understand the risk for each situation.
Small practical example, at the risk of further distracting from the thread, if I have a total of ONLY 3 drives or bays or ports to use, and nothing else, no other options, which has happened to me more than once, then in that situation, I would NEVER run RAID 6 on that, it's impossible. You need 4 drives. So that's an obvious one. I would also NEVER run RAID 5 on it, since there's no local backup, UNLESS I was very comfortable with the risk of losing everything. So for me, in that case, I would probably run RAID 1, and use the third drive as a backup. Not very efficient, but I'd still have a local backup, which is better than no backup at all.
The same logic can be applied for whatever number of drives you have, whatever resources you have, from little guy up to huge cloud provider. Just understand the risks, benefits. But I'm not pushing a specific level of RAID. And I thought I was clear about that, but I could have been more clear for sure.
Back to the regular discussion.
Every lowend server is required to have standalone backups just in case, i remeber i loosing my data once and after that, i don't even believe any backup solution that comes with the VPS, always i mean always have a remote backup available in hand, i am doing my backups with s3 that just have me spend my days without worries. Enjoy your 6 month service extension.
This is why I love providers like Vultr/DigitalOcean/Hetzner for production. They make it really easy to set up and it's cheap.
I would never trust solely on the providers backups alone. You never know if the provider will deadpool, their backup systems fail, or the provider has had enough of you and boots you out the door.
Having an additional remote backup sites is always a fine plan on keeping data loss to a minimum.
Relying on RAID is reckless. Relying on the providers backup is careless. Having no backup just means the data was worthless.
An email should be sent to affected customers on that node already. We do not use RAID-0, RAID-5 or RAID-6. On that node we have 8x drives: 2x drives in RAID-1 for the OS and 6x drives in RAID-10 for the VMs.
A scheduled maintenance took place to replace a bad RAM stick, the OS booted fine after that but the RAID-10 corrupted, disappeared and cannot be mounted back, all 8x drives are healthy. It's a really rare case that happened the first time in our 10 years in business, looked around and it might be related to both software/hardware issues. We have reprovisoned the node and tested all scenarios to prevent it from happening again. We only have 2 nodes with this hardware and setup, VMs on the other nodes will be migrated then we will apply the fix.
Was it node KVMSG22?
It was node NVMeDC2SG2. Node KVMSG22 also had a RAM replacement scheduled, all went fine.
Nothing on their network status page:
https://status.greencloudvps.com/
Seems their status page only monitors the datacenter/network, not individual server issues/maintenence.
Then you didn't obey rule #1: RAID is not backup. Always backup. And if it's really really really important data: double backup. Triple backup.
Every RAID type, even RAID0, has it's usecase. It depends on choices made by the user, the characteristics the user wants and the risks the user wants to take.
Depending on use-case, thats about all we deploy in production. RAID1, 6, or 10. It's just a buffer, and we monitor the RAID arrays closely. As soon as we see something drop, it's top priority to hot-swap and immediately rebuild. Then everyone sweats for ~1-12 hours.
@rafathossain as others have said it's not a backup, but it is used to buy you some time to potentially rebuild/recover by removing a SPOF (single point of failure: 1 disk down = all data lost). It also adds performance benefits and scaling benefits.
It sounds like GreenCloud had a different issue regarding the entire array which is rare but absolutely does happen--which is why RAID is not a replacement for backups. It is just an extra tool in your belt against data loss. This is why we include backup slots for all our VMs which go to a separate cluster. It's not fast disks or pretty, but they are RAID10, can be automatically scheduled, and just remove an excuse for a customer to have data loss. The ideal situation is:
1. Host uses RAID(1/6/10) on the node
2. User makes religious backups using host-provided resources
3. User also backs data up locally or to another separate provider
Personally, I think @NDTN handled it professionally if they kept you informed and gave half a year of free service. It's really not something you ever want to see as a service provider, but it is just a potential risk of business as you get more TB's under management.
I see quite clearly the provider in the debt even if it would only be a drop in the bucket and you had 1 backup before 1 week, from me on magnetic tapes would be enough. It can't be too much to ask for a disaster backup.
I am one of the victim in this incident.
As mentioned by the team, they sent a notification email to inform on the activity on 10 March, but I don't receive any email from my end so no backup being taken.
I have asked another victim and he also didn't receive any planned maintenance email.
The first email I received is on 14/3 1.42PM (UTC+8) which I raised the ticket after the server down.
Lesson learned: Do backup always.
...
... a bit dramatic, don't you think?
You had no backup plan and got bitten - I'm sure you'll take backups regularly from now on...
For many people having a 3 day-old backup automatically re-installed over a live system can be a worse outcome - they'll have missing data for that period, maybe not informed and if they have taken their own backups and want to restore from that, they then have to worry about new data that's come in and sort out divergent databases etc.
Do your own backups. No ifs or buts. Even if your provider offers free backups, if you don't backup your data yourself you have no idea what's being backed up, how often and whether the restore process actually works.
It's not like there's much of a reason to not do it. Get another "storage VPS" from somewhere for a couple of dollars a month (e.g. I have a 2TB hosthatch and 1TB interservermike just for backups) and set up borg. It's easy. I have a backup script like this on every machine that's run by cron:
Very simple to set up, it typically takes me about 5 minutes per new VM to set up backups to 4 different locations. FWIW, I use a different passphrase on every VM/backup combination which isn't strictly required but means that if one of my servers AND my one of my backup servers were ever compromised, they still only have data from the original compromised machine.
I also run them all in
--append-only
so that someone on a compromised machine can't delete any backups and then manually runborg compact $REPO
on the server every few months on the couple of backup locations where the disk is somewhat small. For the locations where I don't ever plan to do that, I also don't bother running the prune.