All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2] MD: add doc for raid5-cache
@ 2017-02-06 19:23 Shaohua Li
  2017-02-06 21:02 ` Anthony Youngman
  2017-02-12  0:16 ` Nix
  0 siblings, 2 replies; 4+ messages in thread
From: Shaohua Li @ 2017-02-06 19:23 UTC (permalink / raw)
  To: linux-raid
  Cc: antlists, philip, songliubraving, neilb, jure.erznoznik, rramesh2400

I'm starting document of the raid5-cache feature. Please note this is a
kernel doc instead of a mdadm manual, so I don't add the details about
how to use the feature in mdadm side. Please let me know what else we
should put into the document. Of course, comments are welcome!

Signed-off-by: Shaohua Li <shli@fb.com>
---
 Documentation/md/raid5-cache.txt | 109 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)
 create mode 100644 Documentation/md/raid5-cache.txt

diff --git a/Documentation/md/raid5-cache.txt b/Documentation/md/raid5-cache.txt
new file mode 100644
index 0000000..2b210f2
--- /dev/null
+++ b/Documentation/md/raid5-cache.txt
@@ -0,0 +1,109 @@
+RAID5 cache
+
+Raid 4/5/6 could include an extra disk for data cache besides normal RAID
+disks. The role of RAID disks isn't changed with the cache disk. The cache disk
+caches data to the RAID disks. The cache can be in write-through (supported
+since 4.4) or write-back mode (supported since 4.10). mdadm (supported since
+3.4) has a new option '--write-journal' to create array with cache. Please
+refer to mdadm manual for details. By default (RAID array starts), the cache is
+in write-through mode. A user can switch it to write-back mode by:
+
+echo "write-back" > /sys/block/md0/md/journal_mode
+
+And switch it back to write-through mode by:
+
+echo "write-through" > /sys/block/md0/md/journal_mode
+
+In both modes, all writes to the array will hit cache disk first. This means
+the cache disk must be fast and sustainable.
+
+-------------------------------------
+write-through mode:
+
+This mode mainly fixes the 'write hole' issue. For RAID 4/5/6 array, an unclean
+shutdown can cause data in some stripes to not be in consistent state, eg, data
+and parity don't match. The reason is that a stripe write involves several RAID
+disks and it's possible the writes don't hit all RAID disks yet before the
+unclean shutdown. We call an array degraded if it has inconsistent data. MD
+tries to resync the array to bring it back to normal state. But before the
+resync completes, any system crash will expose the chance of real data
+corruption in the RAID array. This problem is called 'write hole'.
+
+The write-through cache will cache all data on cache disk first. After the data
+is safe on the cache disk, the data will be flushed onto RAID disks. The
+two-step write will guarantee MD can recover correct data after unclean
+shutdown even the array is degraded. Thus the cache can close the 'write hole'.
+
+In write-through mode, MD reports IO completion to upper layer (usually
+filesystems) after the data is safe on RAID disks, so cache disk failure
+doesn't cause data loss. Of course cache disk failure means the array is
+exposed to 'write hole' again.
+
+In write-through mode, the cache disk isn't required to be big. Several
+hundreds megabytes are enough.
+
+--------------------------------------
+write-back mode:
+
+write-back mode fixes the 'write hole' issue too, since all write data is
+cached on cache disk. But the main goal of 'write-back' cache is to speed up
+write. If a write crosses all RAID disks of a stripe, we call it full-stripe
+write. For non-full-stripe writes, MD must read old data before the new parity
+can be calculated. These synchronous reads hurt write throughput. Some writes
+which are sequential but not dispatched in the same time will suffer from this
+overhead too. Write-back cache will aggregate the data and flush the data to
+RAID disks only after the data becomes a full stripe write. This will
+completely avoid the overhead, so it's very helpful for some workloads. A
+typical workload which does sequential write followed by fsync is an example.
+
+In write-back mode, MD reports IO completion to upper layer (usually
+filesystems) right after the data hits cache disk. The data is flushed to raid
+disks later after specific conditions met. So cache disk failure will cause
+data loss.
+
+In write-back mode, MD also caches data in memory. The memory cache includes
+the same data stored on cache disk, so a power loss doesn't cause data loss.
+The memory cache size has performance impact for the array. It's recommended
+the size is big. A user can configure the size by:
+
+echo "2048" > /sys/block/md0/md/stripe_cache_size
+
+Too small cache disk will make the write aggregation less efficient in this
+mode depending on the workloads. It's recommended to use a cache disk with at
+least several gigabytes size in write-back mode.
+
+--------------------------------------
+The implementation:
+
+The write-through and write-back cache use the same disk format. The cache disk
+is organized as a simple write log. The log consists of 'meta data' and 'data'
+pairs. The meta data describes the data. It also includes checksum and sequence
+ID for recovery identification. Data can be IO data and parity data. Data is
+checksumed too. The checksum is stored in the meta data ahead of the data. The
+checksum is an optimization because MD can write meta and data freely without
+worry about the order. MD superblock has a field pointed to the valid meta data
+of log head.
+
+The log implementation is pretty straightforward. The difficult part is the
+order in which MD writes data to cache disk and RAID disks. Specifically, in
+write-through mode, MD calculates parity for IO data, writes both IO data and
+parity to the log, writes the data and parity to RAID disks after the data and
+parity is settled down in log and finally the IO is finished. Read just reads
+from raid disks as usual.
+
+In write-back mode, MD writes IO data to the log and reports IO completion. The
+data is also fully cached in memory at that time, which means read must query
+memory cache. If some conditions are met, MD will flush the data to RAID disks.
+MD will calculate parity for the data and write parity into the log. After this
+is finished, MD will write both data and parity into RAID disks, then MD can
+release the memory cache. The flush conditions could be stripe becomes a full
+stripe write, free cache disk space is low or free in-kernel memory cache space
+is low.
+
+After an unclean shutdown, MD does recovery. MD reads all meta data and data
+from the log. The sequence ID and checksum will help us detect corrupted meta
+data and data. If MD finds a stripe with data and valid parities (1 parity for
+raid4/5 and 2 for raid6), MD will write the data and parities to RAID disks. If
+parities are incompleted, they are discarded. If part of data is corrupted,
+they are discarded too. MD then loads valid data and writes them to RAID disks
+in normal way.
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH V2] MD: add doc for raid5-cache
  2017-02-06 19:23 [PATCH V2] MD: add doc for raid5-cache Shaohua Li
@ 2017-02-06 21:02 ` Anthony Youngman
  2017-02-08  1:03   ` Shaohua Li
  2017-02-12  0:16 ` Nix
  1 sibling, 1 reply; 4+ messages in thread
From: Anthony Youngman @ 2017-02-06 21:02 UTC (permalink / raw)
  To: Shaohua Li, linux-raid
  Cc: philip, songliubraving, neilb, jure.erznoznik, rramesh2400



On 06/02/17 19:23, Shaohua Li wrote:
> I'm starting document of the raid5-cache feature. Please note this is a
> kernel doc instead of a mdadm manual, so I don't add the details about
> how to use the feature in mdadm side. Please let me know what else we
> should put into the document. Of course, comments are welcome!
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  Documentation/md/raid5-cache.txt | 109 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 109 insertions(+)
>  create mode 100644 Documentation/md/raid5-cache.txt

Note that the kernel documentation is moving over to a new format - 
.rst. They're using a new system called Sphinx. There was an article on 
lwn about it. I've lost the reference to lwn, but if you look at 
https://www.kernel.org/doc/html/latest/ you will see that it is a match 
to Documentation/index.rst.

So of course, converting all the md stuff has made its way onto my "to 
do" list, but I don't want to start until I've got a converted kernel 
running on my system. I'm currently running 4.4.6 and I think it came 
out with something like 4.6. Problem is, running gentoo, my system is 
well out of date because I need a helper system to get me to upgrade and 
going from KDE4 to KDE5 is likely to give me grief :-)

You're probably working with an up-to-date kernel, but it's up to you - 
do you want to create a .rst file, or submit it as .txt and let me 
convert it and go through the grief of working out how to do it :-). 
Conversion is probably rather more than just converting the one file, 
it'll need converting the directory structure too, I expect. Any 
problems, Jon Corbet's probably our go-to guy.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH V2] MD: add doc for raid5-cache
  2017-02-06 21:02 ` Anthony Youngman
@ 2017-02-08  1:03   ` Shaohua Li
  0 siblings, 0 replies; 4+ messages in thread
From: Shaohua Li @ 2017-02-08  1:03 UTC (permalink / raw)
  To: Anthony Youngman
  Cc: Shaohua Li, linux-raid, philip, songliubraving, neilb,
	jure.erznoznik, rramesh2400

On Mon, Feb 06, 2017 at 09:02:32PM +0000, Anthony Youngman wrote:
> 
> 
> On 06/02/17 19:23, Shaohua Li wrote:
> > I'm starting document of the raid5-cache feature. Please note this is a
> > kernel doc instead of a mdadm manual, so I don't add the details about
> > how to use the feature in mdadm side. Please let me know what else we
> > should put into the document. Of course, comments are welcome!
> > 
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  Documentation/md/raid5-cache.txt | 109 +++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 109 insertions(+)
> >  create mode 100644 Documentation/md/raid5-cache.txt
> 
> Note that the kernel documentation is moving over to a new format - .rst.
> They're using a new system called Sphinx. There was an article on lwn about
> it. I've lost the reference to lwn, but if you look at
> https://www.kernel.org/doc/html/latest/ you will see that it is a match to
> Documentation/index.rst.
> 
> So of course, converting all the md stuff has made its way onto my "to do"
> list, but I don't want to start until I've got a converted kernel running on
> my system. I'm currently running 4.4.6 and I think it came out with
> something like 4.6. Problem is, running gentoo, my system is well out of
> date because I need a helper system to get me to upgrade and going from KDE4
> to KDE5 is likely to give me grief :-)
> 
> You're probably working with an up-to-date kernel, but it's up to you - do
> you want to create a .rst file, or submit it as .txt and let me convert it
> and go through the grief of working out how to do it :-). Conversion is
> probably rather more than just converting the one file, it'll need
> converting the directory structure too, I expect. Any problems, Jon Corbet's
> probably our go-to guy.

Yep, I saw some directories are converted to .rst, but most not yet. I'll
commit this as-is if nobody objects, being lazy to learn/install the rst
stuffes :). It would be great you can help convert the md directory to new
format, but I don't think it's in a hurry.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH V2] MD: add doc for raid5-cache
  2017-02-06 19:23 [PATCH V2] MD: add doc for raid5-cache Shaohua Li
  2017-02-06 21:02 ` Anthony Youngman
@ 2017-02-12  0:16 ` Nix
  1 sibling, 0 replies; 4+ messages in thread
From: Nix @ 2017-02-12  0:16 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, antlists, philip, songliubraving, neilb,
	jure.erznoznik, rramesh2400

On 6 Feb 2017, Shaohua Li stated:

> +write-back mode:
> +
> +write-back mode fixes the 'write hole' issue too, since all write data is
> +cached on cache disk. But the main goal of 'write-back' cache is to speed up
> +write. If a write crosses all RAID disks of a stripe, we call it full-stripe
> +write. For non-full-stripe writes, MD must read old data before the new parity
> +can be calculated. These synchronous reads hurt write throughput. Some writes
> +which are sequential but not dispatched in the same time will suffer from this
> +overhead too. Write-back cache will aggregate the data and flush the data to
> +RAID disks only after the data becomes a full stripe write. This will
> +completely avoid the overhead, so it's very helpful for some workloads. A
> +typical workload which does sequential write followed by fsync is an example.
> +
> +In write-back mode, MD reports IO completion to upper layer (usually
> +filesystems) right after the data hits cache disk. The data is flushed to raid
> +disks later after specific conditions met. So cache disk failure will cause
> +data loss.
> +
> +In write-back mode, MD also caches data in memory. The memory cache includes
> +the same data stored on cache disk, so a power loss doesn't cause data loss.
> +The memory cache size has performance impact for the array. It's recommended
> +the size is big. A user can configure the size by:
> +
> +echo "2048" > /sys/block/md0/md/stripe_cache_size

I'm missing something. Won't a big stripe_cache_size have the same
effect on reducing the read size of RMW as the writeback cache has?
That's the entire point of it: to remember stripes so you don't need to
take the R hit so often. I mean, sure, it won't survive a power loss: is
this just to avoid RMWs for the first write after a power loss to
stripes that were previously written before the power loss? Or is it
because the raid5-cache can be much bigger than the in-memory cache,
caching many thousands of stripes? (in which case, the raid5-cache is
preferable for any workload in which random or sub-stripe sequential
writes are scattered across very many distinct stripes rather than being
concentrated in a few, or a few dozen. This is probably a very common
case even for things like compilations or git checkouts, because new
file creation tends to be fairly scattered: every new object file might
well be in a different stripe from every other, so virtually every write
of less than the stripe size would have to block on the completion of a
read.)

(... this question is because I'm re-entering the world of md5 after
years wandering in the wilderness of hardware RAID: the writethrough
mode looks very compelling, particularly now your docs have described
how big it needs to be, or rather how big it doesn't need to be. But I
don't quite see the point of writeback mode yet.)

Hm. This is probably also a reason to keep your stripes not too large:
it's more likely that smallish writes will fill whole stripes and avoid
the read entirely. I was considering it pointless to make the stripe
size smaller than the average size of a disk track (if you can figure
that out these days), but making it much smaller seems like it's still
worthwhile.

Does anyone have recentish performance figures on the effect of changing
chunk, and thus, stripe sizes on things like file creations for a range
of sizes, or is picking a stripe size, stripe cache size, and readahead
value still basically guesswork like it was when I did this last? The
RAID performance pages show figures all over the shop, with most people
apparently agreeing on chunk sizes of 128--256KiB and *nobody* agreeing
on readahead or stripe cache sizes :( is there anything resembling a
consensus here yet?

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-02-12  0:16 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-06 19:23 [PATCH V2] MD: add doc for raid5-cache Shaohua Li
2017-02-06 21:02 ` Anthony Youngman
2017-02-08  1:03   ` Shaohua Li
2017-02-12  0:16 ` Nix

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.