Pros and cons of software Parity-RAID (e.g. RAID5)
Written by: J Dawg
I was recently told about some problems concerning Parity-RAIDs without a non-volatile cache. More exspensive HW-controllers do have battery-powered caches to finish write-operations in case of power failure. Now, some people say that such a failure, perhaps in combination with a degraded array, may kill your whole filesystem. Others claim that those issues are outdated and/or misconceptions.
Unfortunately, nobody gives hard references and neither a search for md RAID and non-volatile cache, nor for bitmap caching gives reliable answers about if md-RAID5 is advisable or not.
Any information about that?
I assume Linux’s software RAID is as reliable as a hardware RAID card without a BBU and with write-back caching enabled. After all, uncommitted data in a software RAID system resides in the kernel’s buffer cache, which is a form of write-back caching without battery backup.
Since every hardware RAID-5 card I have ever used allows you to enable write-back caching without having a BBU, I expect software RAID-5 can work okay for people with a certain level of risk tolerance.
That having been said, I have personally experienced serious data loss due to having no BBU installed on a RAID-5 card though write-back caching was enabled. (No UPS, either. Don’t yell at me, not my call.)
My boss called me in a panic while I was on vacation because one of our production systems wouldn’t come back up after a power outage. He’d run out of things to try. I had to pull off to the side of the road, pull out the laptop, turn on WiFi tethering on my phone,
ssh into the stricken system, and fix it, while my family sat there with me on the side of the road until I finished restoring a roached database table from backup. (We were about a mile away from losing cell reception at the time.)
So tell me: how much would you pay for a RAID card + BBU now?
Just a warning notice : RAID-5/6 write operations take a significant CPU time while your array is degraded. If your server is already fully loaded when a disk comes to fail, it may drop into an abyss of unresponsiveness. Such problem won’t happen with a hardware RAID controller. So I’d strongly advise against using software RAID-5/6 on a production server. For a workstation or lightly loaded server, it’s OK though.
SW RAID does have a failure mode – if the server goes down halfway through a write you can get a corrupted stripe. A HW RAID controller with a BBU isn’t all that expensive, and it will retain dirty blocks until you can restart the disks.
The BBU on the cache does not guarantee writes in the event of power failure (i.e. it does not power the disks). It powers the cache for a few days until you can re-start the disks. Then the controller will flush any dirty buffers to disk.
Some notes about SW vs. HW RAID-5
- Writes on a SW RAID-5 volume can be slow if write-through caching is used with blocking I/O, as the call doesn’t return until all the I/O has completed. A HW RAID controller with a BBWC can optimise this considerably, so you can see substantially better performance.
- The last time I looked you couldn’t do direct I/O (i.e. zero-copy DMA) on a SW RAID volume. This may have changed and is really only relevant to applications like database managers using raw partitions.
- A modern SAS RAID controller can pull or push 1GB/sec or more of data off a disk array, particularly if formatted with a large (say 256kb) stripe size. I’ve even benchmarked an older Adaptec ASR-2200s at speeds that indicated it was pretty much saturating both its scsi channels at 600MB/sec+ in aggregate (10x 15k disks) with very little CPU load on the host machine. I’m not sure you could get that out of software RAID-5 without a lot of CPU load if at all, even on a modern machine. Maybe you could read that quickly.
- Configuration for booting off a HW RAID volume is simple – the RAID volume is transparent to the O/S.
A low-end RAID controller from a tier-1 vendor such as adaptec is not that expensive at retail street prices and can be purchased for peanuts off ebay. But remember, if you buy secondhand, stick to tier-1 vendors and make sure you know the model and verify the avialability of drivers from their web site.
Edit: From @psusi’s comment, make sure you don’t get a fakeraid (transparent SW RAID hidden in the driver) controller, but most of the offerings from the bigger names (Adaptec, 3Ware or LSI) aren’t fakeraid units. Anything that can take a BBU won’t be fakeraid.
If you got data in cache but not on the disk yet, and power fails, then the data is going to disappear, and your disk is most likely going to be in an inconsistent state. The probability of that isn’t very high unless you got a system that’s constantly writing, but I still wouldn’t want to bet my data on probability games.
An interesting twist would be to make a main filesystem on RAID5/6 but put a journal on a regular drive, so the data is first dumped on the regular drive. The performance would probably go to the crapper as you’d be limited to the write speed of a single drive, but the reliability would go up. So I guess in a situation where your write performance isn’t important, but your read is, that might work just fine.
Or you could just spend another 100$ and get the card with BBU, or a small UPS, and avoid all these complications altogether ;)
Linux mdadm software raid is designed to be just as reliable as a hardware raid with battery backed cache. There are no problems with sudden loss of power, beyond those that also apply to sudden power loss on a single disk.
When the system comes back up after power fail, the array will be resynchronized, which basically means that the parity is recomputed to match the data that was written before the power failure. It takes some time, but really, no big deal. The resynchronize time can be greatly reduced by enabling the write-intent bitmap.