Don't build SSD in RAID5

concerto49 · April 2013

I think it's been talked about here by some providers. So don't do it!

RAID5 = writing to all the SSD multiple times per write and kills it. Doesn't help. Hurts performance. Hurst reliability.

Full article: http://thessdreview.com/daily-news/latest-buzz/skyera-reveals-raid-5-hinders-reliability-of-ssd-arrays/

MrAndroid · April 2013

IMHO for a SSD Raid, you'd be better with Raid 1 or Raid 10, and unless your raid card supports TRIM, use software raid.

rds100 · April 2013

And even with RAID1 or RAID10 make sure you use 4K block size, otherwise you are also causing write amplification.

concerto49 · April 2013

@rds100 said: And even with RAID1 or RAID10 make sure you use 4K block size, otherwise you are also causing write amplification.

Most SSD natively use 8K blocks since a while ago and the more recent ones have 16K blocks.

@MrAndroid said: IMHO for a SSD Raid, you'd be better with Raid 1 or Raid 10, and unless your raid card supports TRIM, use software raid.

I agree and never promoted RAID5 or even RAID6 on SSD. It's just I've read some providers suggesting it here, so hopefully this is a warning.

rds100 · April 2013

@concerto49 said: Most SSD natively use 8K blocks since a while ago and the more recent ones have 16K blocks

Which ones? Any official source, table, classification, etc. ?

concerto49 · April 2013

@rds100 said: Which ones? Any official source, table, classification, etc. ?

Since 20nm it's 16K. Since maybe 25nm it's 8K. Can't remember. As to official source, Anand has a nice table:

http://anandtech.com/show/6884/crucial-micron-m500-review-960gb-480gb-240gb-120gb

Scroll down. Page size in the table.

This is for Intel/Micron.

Sandisk/Toshiba have some similar transition. They make toggle.

rds100 · April 2013

@concerto49 interesting, thanks for the table! Now if we could convice the filesystem to use 8/16k blocks... oh, well.

MrAndroid · April 2013

@concerto49 said: I agree and never promoted RAID5 or even RAID6 on SSD. It's just I've read some providers suggesting it here, so hopefully this is a warning.

Raid 6 on SSDs sounds very horrible. I can imagine them dying very quickly.

Damian · April 2013

Reading the article, it appears to be such an article from a company selling things.

"Of COURSE this already existing technology is going to kill your SSDs and set your house on fire... the only solution is to buy our proprietary hardware."

So it may not be pure gospel here.

marcm · April 2013

@concerto49 - Thanks for the tip. I have never liked RAID 5 anyway, even with HDDs. Dual Parity RAID 6 with several hot spares is the way to go if you want a half way decent RAID array for storage and / or backup.

concerto49 · April 2013

@Damian said: "Of COURSE this already existing technology is going to kill your SSDs and set your house on fire... the only solution is to buy our proprietary hardware."

Ignore the marketing parts. It's still a good warning over RAID5. There are probably other sources saying so, but let's not get there.

Maounique · April 2013

IF we were to think of wear and tear of SSDs, we wont be using any ssd cached nodes, would we ? I cant think of a more horrible cycle generator than that. Yet, they are in production and work well for months if not more than a year in various setups, including in hardware ones such as cachecade.
When the drives will fail, they will be replaced, this is like saying, hey, dont drive your ATV on rough roads, it will break faster.
In the same logic we shouldnt use the servers for virtualization, this means higher CPU load, higher temperature average and faster breakdown.
The servers and enterprise grade SSDs are meant for heavy duty, however, we are now considering phasing out local storage and moving to SAN wherever possible.
"DD tests" will be giving lower results, but I think an enterprise grade storage is better for everyone and who needs consistency and reliability will choose that over local storage every day. We don't like raid failures and as the nodes numbers increase it is more and more likely to have them.

Shados · May 2013

The reality is that with wear leveling and redundant internal space, most SSDs will far outlast HDDs even in write-heavy environments - you essentially have to write to every block up the maximum number of per-block writes before any of them start to fail. Well, ideally, anyway.

Besides, there are better reasons to avoid RAID5:
1. The RAID5 write hole
2. Performance loss due to partial-stripe writes being horrible (synchronously read stripe, modify & generate parity, write out instead of just generating parity and writing out as in a full-stripe write)

Although if you really want it, you could always just use ZFS and go with RAID-Z, which neatly solves both those particular issues.

FRCorey · May 2013

Intel DC S3700's are what people should be using. They have onboard dram cache's that are backed up by capacaitors and they are HET drives high endurance so they're designed to be abused.

tjb · March 2014

Raid 5 and 6 do not require a full stripe write. Only the blocks that have changed, along with the parity blocks will need to be updated. Controllers optimised for HDD’s try to do full stripe writes because it avoids the expensive reads from drives in the stripe – but only if the OS/App sends the writes that would allow this.

If the OS/APP doesn't update all the blocks in a stripe, then the controller will read in those blocks not already cached so that the parity can be calulated - then the changed blocks and the parity can be written out.

There’s no value in writing the blocks on disks that haven’t changed, in fact it’s detrimental because those drives could be servicing reads from other IO threads at the time.

Raid-5 should have the same, or fewer disk writes per host write as raid-10. Consider the case where a stripe consisting of 5 data blocks and one parity block has two of those data blocks updated - that means a total of 3 blocks need to be written, while for raid 10, you would have 4 writes. For raid 6 it will vary, depending on how often you update multiple blocks per stripe.

Maounique · March 2014

In short, it depends on the level of redundancy and number of drives.
Leaving necromancy aside, I would be interested in some statistics from providers regarding the % of ssd they have in the number of drives in total, and the percentage of failed drives in both categories. We had no SSD failure yet, but quite a few HDDs. On the other hand, those were in places where we colo or rent, it probably depends a lot on quality too.

Gunter · March 2014

Damn it DigitalOcean. °-_-

rds100 · March 2014

Just had our first SSD failure yesterday. Apparently the SSD went completely dead - not detectable at all.

Maounique · March 2014

rds100 said: the SSD went completely dead - not detectable at all.

So that looks like a board problem, not really wearing off the cells.

rds100 · March 2014

Yes, i don't think it saw much wear, it was a relatively new SSD. It's still under warranty even.

nonuby · March 2014

There was an interesting article on HackerNews or perhaps via Twitter, a major startup lead devops was asked some questions (post talk/conference) by a feisty young hardware boy, they posed a number of points regarding SSD usage on their database cluster, challenging statements regarding brands, firmware, os tuning, and general reliability were made, alluding to the unreliableness/cost-effectiveness of SSD in the long term and other edge negative implications.

The devops thought for a while and responded "I don't care, I just rent them", a moment of enlightenment had occurred.

You shouldn't expect anything to last for a particular duration of time, weather its 2 months, 2 years or 8 years, they can fail at any time in any configuration, there's no such thing as single node durability in absolute terms. It seems a lot of effort (days of review reading, testing, tuning) goes into select SSDs brand/models for optimal reliability and then when disaster does occur complete shock that raid 10 samsungz 747s had failed followed by 3 days of painful poorly planned recovery via some untested r1soft or other poorly thoughout backup soluton.

In the startup's case, it wasnt noticable, the cluster kicked it out, AWS marked the node as degraded (or dead in the instant failure scenario) their controlling software noted this via the API, a new one was span up and a little light changed from green to red to green on some 100x100 matrix somewhere.

tl;dr - failure can happen anytime, do you put the same amount of effort into disaster recovery (and testing such solution?)

In practical terms do you have standby SSD drives ready (a warranty turnover wont suffice), or standby cold nodes, or capacity on the grid to migrate guests vm, and method to recovery quickly with minimal disruptions to clients? So WHEN a failure when it does occur its a minor blip rather than a shitstorm

Maounique · March 2014

nonuby said: So WHEN a failure when it does occur its a minor blip rather than a shitstorm

Yep, shit will happen, it depends how will you get out of it. So far we had one total raid failure and some 3-4 disks replaced. The failed was SSD but the controller was to be blamed, so it wasnt a ssd vs hdd issue. The failed raid meant one node had to be restored from some 12 hours old backup on a stand-by node, took a couple of hours or so.

The cloud has now N+2 at least, I mean, when it will approach that load, we will add more pods. But stricto senso, the SAN can fail too, even if it is one of those expensive hitachi ones with redundant everything, from controllers to firmware, if that happens, will need to ask hitachi to come and restore it, they do not allow outside intervention, however, the data should be safe even in those situations, it will take at most 24 hours to get back on track. We do have another, recently bought, even more expensive, 100% SLA compared to 99.99, the current one, but the data is not replicated there, it costs 1 mil Eur and only the enterprise sector is using it.

enrique750 · March 2014

Both (R1 and R5) will do two writes for every single write. The real difrerence between them is that in R5 you sould read both blocks before writting them, so:

R1: one write means two write iops
R5: one write means two read iops and two write iops

So I am not sure if having R1 will affect disks longevity. In R6 it would do three reads and three writes, so in this case they would last less.

Maounique · March 2014

2 more iops under heavy disk access will not do great, but for ssd you do not need many disks, most people will use 4 and in this case you would be better with raid 10. If you need large space with many disks, say, over 6-8, then you will do raid 5 or 6 because you need the capacity and the extra read/write is distributed along many disks, there will be many IOPS so it wont matter, really.

pcan · March 2014

For what is worth, I recently put on service a $100K IBM Power server (a high performance database server) with 6 SSD drives. I was aware of the issue listed by OP and I escalated a support request about the preferred RAID level to the IBM main offices. The answer was that RAID5 is the best choice for SSD drives on that machine; RAID 10 was explicitly not recomended.
I believe that the RAID5 issue listed on the original paper is either outdated or biased to promote other products.

According to my experience, SSD failure rate is far lower than mechanical drives. Over about 100 drives, I had one failure due to a known Intel SSD BIOS bug and another failure almost immediately after putting the drive in production (a obvious manufacturing flaw).

MikeIn · March 2014

Coming out of RAID setup, which are the best SSD's till now!

Or it's still the same Intel>Samsung>Any other.

enrique750 · March 2014

I told that from the LONGEVITY perspective, it should be similar. And really, as one write iop converts to two backend write iops but, with raid5, sometimes it could be a full write stripe. In that case, it is an advantage. For example:

Raid 10 with 4+4 (for example): 8 write secuencial iops should be 16 write iops.
Raid 5 with 8+1 (for example): 8 write secuencial ips could be (in any cases) 9 write iops.

I mean that Raild 5 could have some advantages and write less.

But of course, in a random small write pattern, it is better to have it on Raid 1, but with SSD, SAS, NL SAS, FC and every kind of drive.

Really, in every environment, depending on the size, the best is a conbination of R1, R5 and R6 and distribute the load across them depending on the io pattern.

In a small environment, if you do not know the pattern, it is safe to make it in R1

debinski · March 2015

When ganging up multiple SSD's for striping, more times than not, it's the controller that becomes the bottle neck. So your never going to realize the "theoretical" IOPS and/or throughput anyway. Sometimes an R5 makes sense (empirically) so you can utilize the added capacity you gain. Adamantly going with an R10 in this case may not gain you anything, but lost capacity. (is that a double negative?)

I'm guessing this is why IBM recommended the R5 in the previous post.

I'm sure someone is going to say change the controller, but chasing the bottle necks can be cost prohibitive it the real world. So sometimes one just needs make the best of what he has and call it a day.

dragon2611 · March 2015

I'd go with the "who cares" argument.

SSD prices are falling, capacities are increasing and anything important should have redundancy and backups.

So When the SSD does die you replace it and get on with life just as you would if a HDD died.

debinski · March 2015

In our case I guess we don't care if it dies, its covered under Dell gold support. But we do care about tweaking the maximum throughput for given set of hardware. I'm presently setting up a multi-user environment (terminal server) that runs an application that is never CPU or memory bound. But is always IO bound. We got everything from 16 SLC SSD's on two controllers (which is actually the slower tier), to (2) 1.2TB Fusion IO PCIe16x cards, to a 384GB RAM drive (the fastest tier) for temp files, and 384GB RAM for the CPU's. (768GB total). The server cost $100 grand. The application cost more. The application uses all of these paths simultaneously while processing. It can really move stuff around. (not a "lowend" box)

The point I was making in a nutshell was sometimes a raid5 makes sense (over a raid10), especially if the workload is mostly read IOPS or if the controller supporting R10 can't deliver - hence the recommendation from IBM. (not the case with the Dell R720)

linuxthefish · March 2015

Sorry if this is off topic, but @Maounique won't one SAN thingy for a load of nodes failing be worse than the local storage in one node failing?

Howdy, Stranger!

Categories

In this Discussion

Don't build SSD in RAID5

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Don't build SSD in RAID5

Comments