Friday, April 23, 2010

Battery backed cache for Linux software raid (md / mdadm)?

Linux's software RAID implementation is absolutely wonderful. Sufficiently so that I no longer use hardware RAID controllers unless I need write caching for database workloads, in which case a battery backed cache is a necessity. I'm extremely thankful to those involved in its creation and maintenance.

Alas, when I do need write-through mode (write caching), I can't use mdadm software RAID. There's actually no technical reason hardware already on the market (like "RAM Drives") can't be used as write cache, it's just that the Linux `md' layer doesn't know how to do it.

I say "it's just that" in the same way that I "just" don't know how to fly a high performance fighter jet with my eyes closed. Being possible doesn't make it easy to implement or practical to implement in a safe, reliable and robust way.

This would be a really interesting project to tackle to bring software RAID truly on par with hardware RAID, but I can't help wondering if there's a reason nobody's already done it.


Are you wondering what write caching is, why write caching is so important for databases, or how on earth you can safely write-cache with software RAID? Read on...




WHAT IS WRITE CACHING?

Write caching (often called "write back mode") refers to a storage device that  copies data written to it by the host OS into some kind of fast non-disk storage and then tells the operating system that data has hit the disk before it truly has. Not only does this provide incredibly fast fsync()s, but it also lets the storage device intelligently bunch up many small writes into fewer bigger writes to nearby disk regions, massively speeding up random write speeds. If your workload is mostly random writes with frequent fsync()s then you can expect speedups of tens or hundreds of times when enabling write caching.

For write caching to be safe, the fast storage used for cache has to be persistent even in the face of system power loss, sudden OS reset, removal of the RAID card/hard drive from the computer, etc. If it is not, any sort of failure will leave your data in a horrifying half-written half-lost state made even worse by the out-of-order writes done to improve random write speeds. It doesn't bear thinking about.

So - all it takes to make write caching perfectly safe is robust, persistent storage for the cache, and with such storage you can achieve orders of magnitude improvements in performance. Sound interesting?


WHY USE WRITE CACHING FOR DATABASES?

Database workloads benefit from write caching because they tend to require strict guarantees about what has hit disk. They also tend to do a lot of writes across random and widely scattered parts of the storage. Because of their need to ensure ordering of their writes, they tend to force those random writes to disk in order, preventing the RAID controller from accumulating them and combining them into bigger more ordered writes.

Given how bad rotational storage is at random writes, this is pretty much a worst-case workload for storage performance.

Even if your particular workload doesn't care about a few lost transactions so it doesn't require the database to promise that data has hit disk before returning from COMMIT, the database its self tends to require strictly ordered writes to avoid severe corruption of its storage in the case of unexpected power loss or reboot. This is particularly true for database systems like PostgreSQL that do write-ahead logging, a crash-integrity strategy somewhat like file system data journaling.

WHY DO YOU NEED A HARDWARE RAID CONTROLLER FOR WRITE CACHING?

Pure software RAID working only hard disks cannot provide write caching. It can cache writes in RAM, but if the OS suddenly crashes, the user hard-resets the machine, or the power goes, your data is in a truly messy half-written, half-lost state. The software raid system simply has no suitable storage for the write cache.

Hardware RAID controllers generally use ordinary DDR/DDR2 DIMMs (standard PC memory) plugged into the controller as their cache. Because this memory is erased when it loses power, they also include a small battery on the RAID card that maintains power to the memory for many hours, giving you time to get power back to the system before the cache contents are lost and your data is corrupted. When power is restored, the RAID controller resumes where it left off, writing any data it finds in the battery-backed cache memory.

There is nothing that inherently prevents software RAID from using the same strategy. All it needs is somewhere persistent to put the cache. There's an obvious way to provide such cache. Exactly the same option used by hardware RAID controllers can be used for software RAID - a DIMM or two on a PCI Express card or even a SATA "RAM drive" would do, so long as it had a battery to keep that memory alive across power outages.


So, in fact, you don't need a hardware RAID controller for write caching at all, you just need somewhere to put the cache.

IF SOFTWARE RAID CAN DO WRITE CACHING WITH SUITABLE CACHE MODULES, WHERE ARE THOSE MODULES?

Given the huge costs of true hardware RAID controllers, especially with the large premium charged by most vendors for an add-on battery backup unit, you'd expect that there'd be plenty of options out there, and that software RAID implementations would take advantage of them.

Yet, oddly, the only software RAID implementations I can find that support write-through (caching) mode with a battery-backed cache are vendor-specific "host raid" implementations embedded in drivers for "fakeraid" cards. These cards pretend to be hardware RAID, but really do the work in a driver in the operating system. The hardware is just a plain old SATA or SAS controller with an on-board BIOS Option ROM that understands enough of the RAID layout that it can read the boot loader and get the OS up and running to the point where the drivers load. It's easy for the vendors of these cards to add BBU / write cache support, since the software RAID is tightly bound to particular hardware that they can just add a RAM slot and a battery to. Unfortunately, they tend to add a rather impressive price tag as well.

I find myself wondering where the "write-through mode cards" for the built-in software RAID features in Linux, Mac OS X and Windows are. The success of fakeraid cards with BBU shows there's a market, so where are the devices?

Well, they're right here . "RAM drives" are pretty much perfect for the job. They're fast, battery backed storage that's just what's needed. It'd be nice if they offered direct PCIe access to the memory via an onboard ACHI-compatible host interface (or custom driver) rather than having a real SATA interface, but that's far from vital.

Unfortunately, the RAID implementations in Windows, Mac OS X and Linux don't know how to use persistent cache like this to enable write-through mode! It's out there, cheap, and ready for the taking, but the software just doesn't support it. md/mdadm support storing a write-intent bitmap for faster crash recovery, but it's just not close to the same.

Wouldn't it be awfully handy if md/mdadm supported an external cache module? Software RAID at hardware-RAID speeds.








An aside: WHY NOT USE AN SSD FOR WRITE CACHE STORAGE?

"Why not use an SSD?" you ask? Good question. Their random write performance is awful, but if you use them as a cyclic write buffer they're wonderful, and they'd be a near-perfect choice as they don't require backup batteries. Unfortunately, they're generally incredibly bad at doing lots of small writes, even if those writes are sequential, because of their erase blocks. Only high-end SSDs with internal cache and a big capacitor to let them finish writing after power-loss would be suitable, and they're way more expensive than a board with some DIMMS and a battery would be. You're better off with hardware RAID cards or a dedicated cache module for software RAID.

Before you say "Intel's X-25 series address the issues with small writes" ... try getting one of them, doing a bunch of writes, and yanking the power. See if what you see when you restore the power makes any sense at all. Those drives apparently don't have the juice to finish writing their internal cache out when they lose power - they're very much like software RAID in write-through mode without a BBU - ie incredibly unsafe. I've heard horror stories about them on the PostgreSQL mailing list, and wouldn't use them except on a machine with a really good UPS. All Intel would need to add would be a great honking capacitor or an itsy little backup battery to give it enough juice to do an emergency write-out, but presumably for cost reasons the devices don't offer anything like that.

RELATED LINKS

very experimental write-back cache support patches that only use unsafe, non-BBU main memory as cache: http://lwn.net/Articles/229976/

1 comment:

  1. Interesting. We implemented this idea of battery-backed cache for the Agami NAS in 2004 (which did software RAID via a custom battery-backed cache on the SATA controller cards), and Intransa implemented it for their Building Block series in 2008. So the idea of battery-backed cache for software RAID is out there. I wonder if perhaps the patent trolls that got the Agami IP at the end of the day have anything to do with it? Hmm...

    ReplyDelete

Captchas suck. Bots suck more. Sorry.