linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* O_DIRECT to md raid 6 is slow
@ 2012-08-15  0:49 Andy Lutomirski
  2012-08-15  1:07 ` kedacomkernel
  2012-08-15 11:50 ` John Robinson
  0 siblings, 2 replies; 21+ messages in thread
From: Andy Lutomirski @ 2012-08-15  0:49 UTC (permalink / raw)
  To: linux-kernel, linux-raid

If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M
then iostat -m 5 says:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   26.88   35.27    0.00   37.85

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb             265.20         1.16        54.79          5        273
sdc             266.20         1.47        54.73          7        273
sdd             264.20         1.38        54.54          6        272
sdf             286.00         1.84        54.74          9        273
sde             266.60         1.04        54.75          5        273
sdg             265.00         1.02        54.74          5        273
md0           55808.00         0.00       218.00          0       1090

If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
then iostat -m 5 says:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   11.70   12.94    0.00   75.36

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb             831.00         8.58        30.42         42        152
sdc             832.80         8.05        29.99         40        149
sdd             832.00         9.10        29.78         45        148
sdf             838.40         9.11        29.72         45        148
sde             828.80         7.91        29.79         39        148
sdg             850.80         8.00        30.18         40        150
md0            1012.60         0.00       101.27          0        506

It looks like md isn't recognizing that I'm writing whole stripes when
I'm in O_DIRECT mode.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15  0:49 O_DIRECT to md raid 6 is slow Andy Lutomirski
@ 2012-08-15  1:07 ` kedacomkernel
  2012-08-15  1:12   ` Andy Lutomirski
  2012-08-15 11:50 ` John Robinson
  1 sibling, 1 reply; 21+ messages in thread
From: kedacomkernel @ 2012-08-15  1:07 UTC (permalink / raw)
  To: Andy Lutomirski, linux-kernel, linux-raid

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 2146 bytes --]

On 2012-08-15 08:49 Andy Lutomirski <luto@amacapital.net> Wrote:
>If I do:
># dd if=/dev/zero of=/dev/md0p1 bs=8M
>then iostat -m 5 says:
>
>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           0.00    0.00   26.88   35.27    0.00   37.85
>
>Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>sdb             265.20         1.16        54.79          5        273
>sdc             266.20         1.47        54.73          7        273
>sdd             264.20         1.38        54.54          6        272
>sdf             286.00         1.84        54.74          9        273
>sde             266.60         1.04        54.75          5        273
>sdg             265.00         1.02        54.74          5        273
>md0           55808.00         0.00       218.00          0       1090
>
>If I do:
># dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
>then iostat -m 5 says:
>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           0.00    0.00   11.70   12.94    0.00   75.36
>
>Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>sdb             831.00         8.58        30.42         42        152
>sdc             832.80         8.05        29.99         40        149
>sdd             832.00         9.10        29.78         45        148
>sdf             838.40         9.11        29.72         45        148
>sde             828.80         7.91        29.79         39        148
>sdg             850.80         8.00        30.18         40        150
>md0            1012.60         0.00       101.27          0        506
>
>It looks like md isn't recognizing that I'm writing whole stripes when
>I'm in O_DIRECT mode.
>
kernel version?

>--Andy
>
>-- 
>Andy Lutomirski
>AMA Capital Management, LLC
>--
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.htmlÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15  1:07 ` kedacomkernel
@ 2012-08-15  1:12   ` Andy Lutomirski
  2012-08-15  1:23     ` kedacomkernel
  0 siblings, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2012-08-15  1:12 UTC (permalink / raw)
  To: kedacomkernel; +Cc: linux-kernel, linux-raid

Ubuntu's 3.2.0-27-generic.  I can test on a newer kernel tomorrow.

--Andy

On Tue, Aug 14, 2012 at 6:07 PM, kedacomkernel <kedacomkernel@gmail.com> wrote:
> On 2012-08-15 08:49 Andy Lutomirski <luto@amacapital.net> Wrote:
>>If I do:
>># dd if=/dev/zero of=/dev/md0p1 bs=8M
>>then iostat -m 5 says:
>>
>>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>           0.00    0.00   26.88   35.27    0.00   37.85
>>
>>Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>>sdb             265.20         1.16        54.79          5        273
>>sdc             266.20         1.47        54.73          7        273
>>sdd             264.20         1.38        54.54          6        272
>>sdf             286.00         1.84        54.74          9        273
>>sde             266.60         1.04        54.75          5        273
>>sdg             265.00         1.02        54.74          5        273
>>md0           55808.00         0.00       218.00          0       1090
>>
>>If I do:
>># dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
>>then iostat -m 5 says:
>>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>           0.00    0.00   11.70   12.94    0.00   75.36
>>
>>Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>>sdb             831.00         8.58        30.42         42        152
>>sdc             832.80         8.05        29.99         40        149
>>sdd             832.00         9.10        29.78         45        148
>>sdf             838.40         9.11        29.72         45        148
>>sde             828.80         7.91        29.79         39        148
>>sdg             850.80         8.00        30.18         40        150
>>md0            1012.60         0.00       101.27          0        506
>>
>>It looks like md isn't recognizing that I'm writing whole stripes when
>>I'm in O_DIRECT mode.
>>
> kernel version?
>
>>--Andy
>>
>>--
>>Andy Lutomirski
>>AMA Capital Management, LLC
>>--
>>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Re: O_DIRECT to md raid 6 is slow
  2012-08-15  1:12   ` Andy Lutomirski
@ 2012-08-15  1:23     ` kedacomkernel
  0 siblings, 0 replies; 21+ messages in thread
From: kedacomkernel @ 2012-08-15  1:23 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel, linux-raid

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 2151 bytes --]

On 2012-08-15 09:12 Andy Lutomirski <luto@amacapital.net> Wrote:
>Ubuntu's 3.2.0-27-generic.  I can test on a newer kernel tomorrow.
I guess maybe miss the blk_plug function.
Can you add this patch and retest.

Move unplugging for direct I/O from around ->direct_IO() down to
do_blockdev_direct_IO(). This implicitly adds plugging for direct
writes.
 
CC: Li Shaohua <shli@fusionio.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/direct-io.c |    5 +++++
 mm/filemap.c   |    4 ----
 2 files changed, 5 insertions(+), 4 deletions(-)
 
--- linux-next.orig/mm/filemap.c 2012-08-05 16:24:47.859465122 +0800
+++ linux-next/mm/filemap.c 2012-08-05 16:24:48.407465135 +0800
@@ -1412,12 +1412,8 @@ generic_file_aio_read(struct kiocb *iocb
  retval = filemap_write_and_wait_range(mapping, pos,
  pos + iov_length(iov, nr_segs) - 1);
  if (!retval) {
- struct blk_plug plug;
-
- blk_start_plug(&plug);
  retval = mapping->a_ops->direct_IO(READ, iocb,
  iov, pos, nr_segs);
- blk_finish_plug(&plug);
  }
  if (retval > 0) {
  *ppos = pos + retval;
--- linux-next.orig/fs/direct-io.c 2012-07-07 21:46:39.531508198 +0800
+++ linux-next/fs/direct-io.c 2012-08-05 16:24:48.411465136 +0800
@@ -1062,6 +1062,7 @@ do_blockdev_direct_IO(int rw, struct kio
  unsigned long user_addr;
  size_t bytes;
  struct buffer_head map_bh = { 0, };
+ struct blk_plug plug;
 
  if (rw & WRITE)
  rw = WRITE_ODIRECT;
@@ -1177,6 +1178,8 @@ do_blockdev_direct_IO(int rw, struct kio
  PAGE_SIZE - user_addr / PAGE_SIZE);
  }
 
+ blk_start_plug(&plug);
+
  for (seg = 0; seg < nr_segs; seg++) {
  user_addr = (unsigned long)iov[seg].iov_base;
  sdio.size += bytes = iov[seg].iov_len;
@@ -1235,6 +1238,8 @@ do_blockdev_direct_IO(int rw, struct kio
  if (sdio.bio)
  dio_bio_submit(dio, &sdio);
 
+ blk_finish_plug(&plug);
+
  /*
   * It is possible that, we return short IO due to end of file.
   * In that case, we need to release all the pages we got hold on.
 
 
--
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15  0:49 O_DIRECT to md raid 6 is slow Andy Lutomirski
  2012-08-15  1:07 ` kedacomkernel
@ 2012-08-15 11:50 ` John Robinson
  2012-08-15 17:57   ` Andy Lutomirski
  1 sibling, 1 reply; 21+ messages in thread
From: John Robinson @ 2012-08-15 11:50 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel, linux-raid

On 15/08/2012 01:49, Andy Lutomirski wrote:
> If I do:
> # dd if=/dev/zero of=/dev/md0p1 bs=8M
[...]
> It looks like md isn't recognizing that I'm writing whole stripes when
> I'm in O_DIRECT mode.

I see your md device is partitioned. Is the partition itself stripe-aligned?

Cheers,

John.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 11:50 ` John Robinson
@ 2012-08-15 17:57   ` Andy Lutomirski
  2012-08-15 22:00     ` Stan Hoeppner
  0 siblings, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2012-08-15 17:57 UTC (permalink / raw)
  To: John Robinson; +Cc: linux-kernel, linux-raid

On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
<john.robinson@anonymous.org.uk> wrote:
> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>
>> If I do:
>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>
> [...]
>
>> It looks like md isn't recognizing that I'm writing whole stripes when
>> I'm in O_DIRECT mode.
>
>
> I see your md device is partitioned. Is the partition itself stripe-aligned?

Crud.

md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
      11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
[6/6] [UUUUUU]

IIUC this means that I/O should be aligned on 2MB boundaries (512k
chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
(i.e. 1MB) boundary.

Sadly, /sys/block/md0/md0p1/alignment_offset reports 0 (instead of 1MB).

Fixing this has no effect, though.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 17:57   ` Andy Lutomirski
@ 2012-08-15 22:00     ` Stan Hoeppner
  2012-08-15 22:10       ` Andy Lutomirski
  2012-08-15 23:07       ` Miquel van Smoorenburg
  0 siblings, 2 replies; 21+ messages in thread
From: Stan Hoeppner @ 2012-08-15 22:00 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: John Robinson, linux-kernel, linux-raid

On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
> <john.robinson@anonymous.org.uk> wrote:
>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>
>>> If I do:
>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>
>> [...]
>>
>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>> I'm in O_DIRECT mode.
>>
>>
>> I see your md device is partitioned. Is the partition itself stripe-aligned?
> 
> Crud.
> 
> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>       11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [6/6] [UUUUUU]
> 
> IIUC this means that I/O should be aligned on 2MB boundaries (512k
> chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
> (i.e. 1MB) boundary.

It's time to blow away the array and start over.  You're already
misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
but for a handful of niche all streaming workloads with little/no
rewrite, such as video surveillance or DVR workloads.

Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
Deleting a single file changes only a few bytes of directory metadata.
With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
modify the directory block in question, calculate parity, then write out
3MB of data to rust.  So you consume 6MB of bandwidth to write less than
a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
a few bytes of metadata.  Yes, insane.

Parity RAID sucks in general because of RMW, but it is orders of
magnitude worse when one chooses to use an insane chunk size to boot,
and especially so with a large drive count.

It seems people tend to use large chunk sizes because array
initialization is a bit faster, and running block x-fer "tests" with dd
buffered sequential reads/writes makes their Levi's expand.  Then they
are confused when their actual workloads are horribly slow.

Recreate your array, partition aligned, and manually specify a sane
chunk size of something like 32KB.  You'll be much happier with real
workloads.

-- 
Stan



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 22:00     ` Stan Hoeppner
@ 2012-08-15 22:10       ` Andy Lutomirski
  2012-08-15 23:50         ` Stan Hoeppner
  2012-08-15 23:07       ` Miquel van Smoorenburg
  1 sibling, 1 reply; 21+ messages in thread
From: Andy Lutomirski @ 2012-08-15 22:10 UTC (permalink / raw)
  To: stan; +Cc: John Robinson, linux-kernel, linux-raid

On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>> <john.robinson@anonymous.org.uk> wrote:
>>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>>
>>>> If I do:
>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>
>>> [...]
>>>
>>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>>> I'm in O_DIRECT mode.
>>>
>>>
>>> I see your md device is partitioned. Is the partition itself stripe-aligned?
>>
>> Crud.
>>
>> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>>       11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
>> [6/6] [UUUUUU]
>>
>> IIUC this means that I/O should be aligned on 2MB boundaries (512k
>> chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
>> (i.e. 1MB) boundary.
>
> It's time to blow away the array and start over.  You're already
> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
> but for a handful of niche all streaming workloads with little/no
> rewrite, such as video surveillance or DVR workloads.
>
> Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
> Deleting a single file changes only a few bytes of directory metadata.
> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
> modify the directory block in question, calculate parity, then write out
> 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
> a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
> a few bytes of metadata.  Yes, insane.

Grr.  I thought the bad old days of filesystem and related defaults
sucking were over.  cryptsetup aligns sanely these days, xfs is
sensible, etc.  wtf?  <rant>Why is there no sensible filesystem for
huge disks?  zfs can't cp --reflink and has all kinds of source
availability and licensing issues, xfs can't dedupe at all, and btrfs
isn't nearly stable enough.</rant>

Anyhow, I'll try the patch from Wu Fengguang.  There's still a bug here...

--Andy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 22:00     ` Stan Hoeppner
  2012-08-15 22:10       ` Andy Lutomirski
@ 2012-08-15 23:07       ` Miquel van Smoorenburg
  2012-08-16 11:05         ` Stan Hoeppner
  1 sibling, 1 reply; 21+ messages in thread
From: Miquel van Smoorenburg @ 2012-08-15 23:07 UTC (permalink / raw)
  To: stan; +Cc: linux-kernel

In article <xs4all.502C1C01.1040509@hardwarefreak.com> you write:
>It's time to blow away the array and start over.  You're already
>misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>but for a handful of niche all streaming workloads with little/no
>rewrite, such as video surveillance or DVR workloads.
>
>Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
>Deleting a single file changes only a few bytes of directory metadata.
>With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>modify the directory block in question, calculate parity, then write out
>3MB of data to rust.  So you consume 6MB of bandwidth to write less than
>a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
>a few bytes of metadata.  Yes, insane.

Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
to read that 4K block, and the corresponding 4K block on the
parity drive, recalculate parity, and write back 4K of data and 4K
of parity. (read|read) modify (write|write). You do not have to
do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

>Parity RAID sucks in general because of RMW, but it is orders of
>magnitude worse when one chooses to use an insane chunk size to boot,
>and especially so with a large drive count.

If you have a lot of parallel readers (readers >> disks) then
you want chunk sizes of about 2*mean_read_size, so that for each
read you just have 1 seek on 1 disk.

If you have just a few readers (readers <<<< disks) that read
really large blocks then you want a small chunk size to keep
all disks busy.

If you have no readers and just writers and you write large
blocks, then you might want a small chunk size too, so that
you can write data+parity over the stripe in one go, bypassing rmw.

Also, 256K or 512K isn't all that big nowadays, there's not much
latency difference between reading 32K or 512K..

>Recreate your array, partition aligned, and manually specify a sane
>chunk size of something like 32KB.  You'll be much happier with real
>workloads.

Aligning is a good idea, and on modern distributions partitions,
LVM lv's etc are generally created with 1MB alignment. But using
a small chunksize like 32K? That depends on the workload, but
in most cases I'd advise against it.

Mike.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 22:10       ` Andy Lutomirski
@ 2012-08-15 23:50         ` Stan Hoeppner
  2012-08-16  1:08           ` Andy Lutomirski
  2012-08-16  6:41           ` Roman Mamedov
  0 siblings, 2 replies; 21+ messages in thread
From: Stan Hoeppner @ 2012-08-15 23:50 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: John Robinson, linux-kernel, linux-raid

On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
> On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>>> <john.robinson@anonymous.org.uk> wrote:
>>>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>>>
>>>>> If I do:
>>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>>
>>>> [...]
>>>>
>>>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>>>> I'm in O_DIRECT mode.
>>>>
>>>>
>>>> I see your md device is partitioned. Is the partition itself stripe-aligned?
>>>
>>> Crud.
>>>
>>> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>>>       11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
>>> [6/6] [UUUUUU]
>>>
>>> IIUC this means that I/O should be aligned on 2MB boundaries (512k
>>> chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
>>> (i.e. 1MB) boundary.
>>
>> It's time to blow away the array and start over.  You're already
>> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>> but for a handful of niche all streaming workloads with little/no
>> rewrite, such as video surveillance or DVR workloads.
>>
>> Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
>> Deleting a single file changes only a few bytes of directory metadata.
>> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>> modify the directory block in question, calculate parity, then write out
>> 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
>> a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
>> a few bytes of metadata.  Yes, insane.
> 
> Grr.  I thought the bad old days of filesystem and related defaults
> sucking were over.  

The previous md chunk default of 64KB wasn't horribly bad, though still
maybe a bit high for alot of common workloads.  I didn't have eyes/ears
on the discussion and/or testing process that led to the 'new' 512KB
default.  Obviously something went horribly wrong here.  512KB isn't a
show stopper as a default for 0/1/10, but is 8-16 times too large for
parity RAID.

> cryptsetup aligns sanely these days, xfs is
> sensible, etc.  

XFS won't align with the 512KB chunk default of metadata 1.2.  The
largest XFS journal stripe unit (su--chunk) is 256KB, and even that
isn't recommended.  Thus mkfs.xfs throws an error due to the 512KB
stripe.  See the md and xfs archives for more details, specifically Dave
Chinner's colorful comments on the md 512KB default.

> wtf?  <rant>Why is there no sensible filesystem for
> huge disks?  zfs can't cp --reflink and has all kinds of source
> availability and licensing issues, xfs can't dedupe at all, and btrfs
> isn't nearly stable enough.</rant>

Deduplication isn't a responsibility of a filesystem.  TTBOMK there are
two, and only two, COW filesystems in existence:  ZFS and BTRFS.  And
these are the only two to offer a native dedupe capability.  They did it
because they could, with COW, not necessarily because they *should*.
There are dozens of other single node, cluster, and distributed
filesystems in use today and none of them support COW, and thus none
support dedup.  So to *expect* a 'sensible' filesystem to include dedupe
is wishful thinking at best.

> Anyhow, I'll try the patch from Wu Fengguang.  There's still a bug here...

Always one somewhere.

-- 
Stan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 23:50         ` Stan Hoeppner
@ 2012-08-16  1:08           ` Andy Lutomirski
  2012-08-16  6:41           ` Roman Mamedov
  1 sibling, 0 replies; 21+ messages in thread
From: Andy Lutomirski @ 2012-08-16  1:08 UTC (permalink / raw)
  To: stan; +Cc: John Robinson, linux-kernel, linux-raid

On Wed, Aug 15, 2012 at 4:50 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
>> On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>>>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>>>> <john.robinson@anonymous.org.uk> wrote:
>>>>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>>>>
>>>>>> If I do:
>>>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>>>
>>>>> [...]
>>
>> Grr.  I thought the bad old days of filesystem and related defaults
>> sucking were over.
>
> The previous md chunk default of 64KB wasn't horribly bad, though still
> maybe a bit high for alot of common workloads.  I didn't have eyes/ears
> on the discussion and/or testing process that led to the 'new' 512KB
> default.  Obviously something went horribly wrong here.  512KB isn't a
> show stopper as a default for 0/1/10, but is 8-16 times too large for
> parity RAID.
>
>> cryptsetup aligns sanely these days, xfs is
>> sensible, etc.
>
> XFS won't align with the 512KB chunk default of metadata 1.2.  The
> largest XFS journal stripe unit (su--chunk) is 256KB, and even that
> isn't recommended.  Thus mkfs.xfs throws an error due to the 512KB
> stripe.  See the md and xfs archives for more details, specifically Dave
> Chinner's colorful comments on the md 512KB default.

Heh -- that's why the math didn't make any sense :)

>
>> wtf?  <rant>Why is there no sensible filesystem for
>> huge disks?  zfs can't cp --reflink and has all kinds of source
>> availability and licensing issues, xfs can't dedupe at all, and btrfs
>> isn't nearly stable enough.</rant>
>
> Deduplication isn't a responsibility of a filesystem.  TTBOMK there are
> two, and only two, COW filesystems in existence:  ZFS and BTRFS.  And
> these are the only two to offer a native dedupe capability.  They did it
> because they could, with COW, not necessarily because they *should*.
> There are dozens of other single node, cluster, and distributed
> filesystems in use today and none of them support COW, and thus none
> support dedup.  So to *expect* a 'sensible' filesystem to include dedupe
> is wishful thinking at best.

I should clarify my rant for the record.  I don't care about in-fs
dedupe.  I want COW so userspace can dedupe and generally replace
hardlinks with sensible cowlinks.  I'm also working on some fun tools
that *require* reflinks for anything resembling decent performance.

--Andy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 23:50         ` Stan Hoeppner
  2012-08-16  1:08           ` Andy Lutomirski
@ 2012-08-16  6:41           ` Roman Mamedov
  1 sibling, 0 replies; 21+ messages in thread
From: Roman Mamedov @ 2012-08-16  6:41 UTC (permalink / raw)
  To: stan; +Cc: Andy Lutomirski, John Robinson, linux-kernel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

On Wed, 15 Aug 2012 18:50:44 -0500
Stan Hoeppner <stan@hardwarefreak.com> wrote:

> TTBOMK there are two, and only two, COW filesystems in existence:  ZFS and BTRFS.

There is also NILFS2: http://www.nilfs.org/en/
And in general, any https://en.wikipedia.org/wiki/Log-structured_file_system
is COW by design, but afaik of those only NILFS is also in the mainline Linux
kernel AND is not aimed just for some niche like flash-based devices, but for
general-purpose usage.

-- 
With respect,
Roman

~~~~~~~~~~~~~~~~~~~~~~~~~~~
"Stallman had a printer,
with code he could not see.
So he began to tinker,
and set the software free."

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 23:07       ` Miquel van Smoorenburg
@ 2012-08-16 11:05         ` Stan Hoeppner
  2012-08-16 21:50           ` Miquel van Smoorenburg
  0 siblings, 1 reply; 21+ messages in thread
From: Stan Hoeppner @ 2012-08-16 11:05 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: linux-kernel

On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
> In article <xs4all.502C1C01.1040509@hardwarefreak.com> you write:
>> It's time to blow away the array and start over.  You're already
>> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>> but for a handful of niche all streaming workloads with little/no
>> rewrite, such as video surveillance or DVR workloads.
>>
>> Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
>> Deleting a single file changes only a few bytes of directory metadata.
>> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>> modify the directory block in question, calculate parity, then write out
>> 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
>> a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
>> a few bytes of metadata.  Yes, insane.
> 
> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
> to read that 4K block, and the corresponding 4K block on the
> parity drive, recalculate parity, and write back 4K of data and 4K
> of parity. (read|read) modify (write|write). You do not have to
> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

See:  http://www.spinics.net/lists/xfs/msg12627.html

Dave usually knows what he's talking about, and I didn't see Neil nor
anyone else correcting him on his description of md RMW behavior.  What
I stated above is pretty much exactly what Dave stated, but for the fact
I got the RMW read bytes wrong--should be 2MB/3MB for a 6 drive md/RAID6
and 5MB/6MB for 12 drives.

>> Parity RAID sucks in general because of RMW, but it is orders of
>> magnitude worse when one chooses to use an insane chunk size to boot,
>> and especially so with a large drive count.
[snip]
> Also, 256K or 512K isn't all that big nowadays, there's not much
> latency difference between reading 32K or 512K..

You're forgetting 3 very important things:

1.  All filesystems have metadata
2.  All (worth using) filesystems have a metadata journal
3.  All workloads include some, if not major, metadata operations

When writing journal and directory metadata there is a huge difference
between a 32KB and 512KB chunk especially as the drive count in the
array increases.  Rarely does a filesystem pack enough journal
operations into a single writeout to fill a 512KB stripe, let alone a
4MB stripe.  With a 32KB chunk you see full stripe width journal writes
frequently, minimizing the number of RMW writes to the journal, even up
to 16 data spindle parity arrays (18 drive RAID6).   Using a 512KB chunk
will cause most journal writes to be partial stripe writes, triggering
RMW for most journal writes.  The same is true for directory metadata
writes.

Everyone knows that parity RAID sucks for anything but purely streaming
workloads with little metadata.  With most/all other workloads, using a
large chunk size, such as the md metadata 1.2 default of 512KB, with
parity RAID, simply makes it much worse, whether the RMW cycle affects
all disks or just one data disk and one parity disk.

>> Recreate your array, partition aligned, and manually specify a sane
>> chunk size of something like 32KB.  You'll be much happier with real
>> workloads.
> 
> Aligning is a good idea, 

Understatement of the century.  Just as critical, if not more so, FS
stripe alignment is mandatory with parity RAID lest full stripe writeout
can/will trigger RMW.

> and on modern distributions partitions,
> LVM lv's etc are generally created with 1MB alignment. But using
> a small chunksize like 32K? That depends on the workload, but
> in most cases I'd advise against it.

People should ignore your advice in this regard.  A small chunk size is
optimal for nearly all workloads on a parity array for the reasons I
stated above.  It's the large chunk that is extremely workload
dependent, as again, it only fits well with low metadata streaming
workloads.

-- 
Stan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-16 11:05         ` Stan Hoeppner
@ 2012-08-16 21:50           ` Miquel van Smoorenburg
  2012-08-17  7:31             ` Stan Hoeppner
  0 siblings, 1 reply; 21+ messages in thread
From: Miquel van Smoorenburg @ 2012-08-16 21:50 UTC (permalink / raw)
  To: stan; +Cc: linux-kernel

On 16-08-12 1:05 PM, Stan Hoeppner wrote:
> On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
>> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
>> to read that 4K block, and the corresponding 4K block on the
>> parity drive, recalculate parity, and write back 4K of data and 4K
>> of parity. (read|read) modify (write|write). You do not have to
>> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.
>
> See:  http://www.spinics.net/lists/xfs/msg12627.html
>
> Dave usually knows what he's talking about, and I didn't see Neil nor
> anyone else correcting him on his description of md RMW behavior.

Well he's wrong, or you're interpreting it incorrectly.

I did a simple test:

* created a 1G partition on 3 seperate disks
* created a md raid5 array with 512K chunksize:
   mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1 
/dev/sdd1
* ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
* wrote a single 4K block:
   dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0

Output from iostat over the period in which the 4K write was done. Look 
at kB read and kB written:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb1              0.60         0.00         1.60          0          8
sdc1              0.60         0.80         0.80          4          4
sdd1              0.60         0.00         1.60          0          8

As you can see, a single 4K read, and a few writes. You see a few blocks 
more written that you'd expect because the superblock is updated too.

Mike.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-16 21:50           ` Miquel van Smoorenburg
@ 2012-08-17  7:31             ` Stan Hoeppner
  2012-08-17 11:16               ` Miquel van Smoorenburg
  0 siblings, 1 reply; 21+ messages in thread
From: Stan Hoeppner @ 2012-08-17  7:31 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: linux-kernel

On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:
> On 16-08-12 1:05 PM, Stan Hoeppner wrote:
>> On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
>>> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
>>> to read that 4K block, and the corresponding 4K block on the
>>> parity drive, recalculate parity, and write back 4K of data and 4K
>>> of parity. (read|read) modify (write|write). You do not have to
>>> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.
>>
>> See:  http://www.spinics.net/lists/xfs/msg12627.html
>>
>> Dave usually knows what he's talking about, and I didn't see Neil nor
>> anyone else correcting him on his description of md RMW behavior.
> 
> Well he's wrong, or you're interpreting it incorrectly.
> 
> I did a simple test:
> 
> * created a 1G partition on 3 seperate disks
> * created a md raid5 array with 512K chunksize:
>   mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
> /dev/sdd1
> * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
> * wrote a single 4K block:
>   dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0
> 
> Output from iostat over the period in which the 4K write was done. Look
> at kB read and kB written:
> 
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sdb1              0.60         0.00         1.60          0          8
> sdc1              0.60         0.80         0.80          4          4
> sdd1              0.60         0.00         1.60          0          8
> 
> As you can see, a single 4K read, and a few writes. You see a few blocks
> more written that you'd expect because the superblock is updated too.

I'm no dd expert, but this looks like you're simply writing a 4KB block
to a new stripe, using an offset, but not to an existing stripe, as the
array is in a virgin state.  So it doesn't appear this test is going to
trigger RMW.  Don't you need now need to do another write in the same
stripe to to trigger RMW?  Maybe I'm just reading this wrong.

-- 
Stan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-17  7:31             ` Stan Hoeppner
@ 2012-08-17 11:16               ` Miquel van Smoorenburg
       [not found]                 ` <502F237D.6060806@hardwarefreak.com>
  0 siblings, 1 reply; 21+ messages in thread
From: Miquel van Smoorenburg @ 2012-08-17 11:16 UTC (permalink / raw)
  To: stan; +Cc: linux-kernel

On 08/17/2012 09:31 AM, Stan Hoeppner wrote:
> On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:
>> I did a simple test:
>>
>> * created a 1G partition on 3 seperate disks
>> * created a md raid5 array with 512K chunksize:
>>    mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
>> /dev/sdd1
>> * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
>> * wrote a single 4K block:
>>    dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0
>>
>> Output from iostat over the period in which the 4K write was done. Look
>> at kB read and kB written:
>>
>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>> sdb1              0.60         0.00         1.60          0          8
>> sdc1              0.60         0.80         0.80          4          4
>> sdd1              0.60         0.00         1.60          0          8
>>
>> As you can see, a single 4K read, and a few writes. You see a few blocks
>> more written that you'd expect because the superblock is updated too.
>
> I'm no dd expert, but this looks like you're simply writing a 4KB block
> to a new stripe, using an offset, but not to an existing stripe, as the
> array is in a virgin state.  So it doesn't appear this test is going to
> trigger RMW.  Don't you need now need to do another write in the same
> stripe to to trigger RMW?  Maybe I'm just reading this wrong.

That shouldn't matter, but that is easily checked ofcourse, by writing 
some random random data first, then doing the dd 4K write also with 
random data somewhere in the same area:

# dd if=/dev/urandom bs=1M count=3 of=/dev/md0
3+0 records in
3+0 records out
3145728 bytes (3.1 MB) copied, 0.794494 s, 4.0 MB/s

Now the first 6 chunks are filled with random data, let write 4K 
somewhere in there:

# dd if=/dev/urandom bs=4k count=1 seek=25 of=/dev/md0
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.10149 s, 40.4 kB/s

Output from iostat over the period in which the 4K write was done:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb1              0.60         0.00         1.60          0          8
sdc1              0.60         0.80         0.80          4          4
sdd1              0.60         0.00         1.60          0          8

Mike.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
       [not found]                       ` <5030F1C6.90205@hesbynett.no>
@ 2012-08-19 23:34                         ` Stan Hoeppner
  2012-08-20  0:01                           ` NeilBrown
  2012-08-21 14:51                           ` Miquel van Smoorenburg
  0 siblings, 2 replies; 21+ messages in thread
From: Stan Hoeppner @ 2012-08-19 23:34 UTC (permalink / raw)
  To: David Brown; +Cc: Michael Tokarev, Miquel van Smoorenburg, Linux RAID, LKML

On 8/19/2012 9:01 AM, David Brown wrote:
> I'm sort of jumping in to this thread, so my apologies if I repeat
> things other people have said already.

I'm glad you jumped in David.  You made a critical statement of fact
below which clears some things up.  If you had stated it early on,
before Miquel stole the thread and moved it to LKML proper, it would
have short circuited a lot of this discussion.  Which is:

> AFAIK, there is scope for a few performance optimisations in raid6.  One
> is that for small writes which only need to change one block, raid5 uses
> a "short-cut" RMW cycle (read the old data block, read the old parity
> block, calculate the new parity block, write the new data and parity
> blocks).  A similar short-cut could be implemented in raid6, though it
> is not clear how much a difference it would really make.

Thus my original statement was correct, or at least half correct[1], as
it pertained to md/RAID6.  Then Miquel switched the discussion to
md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
Chinner.  I was simply unaware of this md/RAID5 single block write RMW
shortcut.  I'm copying lkml proper on this simply to set the record
straight.  Not that anyone was paying attention, but it needs to be in
the same thread in the archives.  The takeaway:

md/RAID6 must read all devices in a RMW cycle.

md/RAID5 takes a shortcut for single block writes, and must only read
one drive for the RMW cycle.

[1}The only thing that's not clear at this point is if md/RAID6 also
always writes back all chunks during RMW, or only the chunk that has
changed.

-- 
Stan


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-19 23:34                         ` Stan Hoeppner
@ 2012-08-20  0:01                           ` NeilBrown
  2012-08-20  7:47                             ` David Brown
  2012-08-21 14:51                           ` Miquel van Smoorenburg
  1 sibling, 1 reply; 21+ messages in thread
From: NeilBrown @ 2012-08-20  0:01 UTC (permalink / raw)
  To: stan
  Cc: David Brown, Michael Tokarev, Miquel van Smoorenburg, Linux RAID, LKML

[-- Attachment #1: Type: text/plain, Size: 2541 bytes --]

On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> On 8/19/2012 9:01 AM, David Brown wrote:
> > I'm sort of jumping in to this thread, so my apologies if I repeat
> > things other people have said already.
> 
> I'm glad you jumped in David.  You made a critical statement of fact
> below which clears some things up.  If you had stated it early on,
> before Miquel stole the thread and moved it to LKML proper, it would
> have short circuited a lot of this discussion.  Which is:
> 
> > AFAIK, there is scope for a few performance optimisations in raid6.  One
> > is that for small writes which only need to change one block, raid5 uses
> > a "short-cut" RMW cycle (read the old data block, read the old parity
> > block, calculate the new parity block, write the new data and parity
> > blocks).  A similar short-cut could be implemented in raid6, though it
> > is not clear how much a difference it would really make.
> 
> Thus my original statement was correct, or at least half correct[1], as
> it pertained to md/RAID6.  Then Miquel switched the discussion to
> md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
> Chinner.  I was simply unaware of this md/RAID5 single block write RMW
> shortcut.  I'm copying lkml proper on this simply to set the record
> straight.  Not that anyone was paying attention, but it needs to be in
> the same thread in the archives.  The takeaway:
> 

Since we are trying to set the record straight....

> md/RAID6 must read all devices in a RMW cycle.

md/RAID6 must read all data devices (i.e. not parity devices) which it is not
going to write to, in an RWM cycle (which the code actually calls RCW -
reconstruct-write).

> 
> md/RAID5 takes a shortcut for single block writes, and must only read
> one drive for the RMW cycle.

md/RAID5 uses an alternate mechanism when the number of data blocks that need
to be written is less than half the number of data blocks in a stripe.  In
this alternate mechansim (which the code calls RMW - read-modify-write),
md/RAID5 reads all the blocks that it is about to write to, plus the parity
block.  It then computes the new parity and writes it out along with the new
data.

> 
> [1}The only thing that's not clear at this point is if md/RAID6 also
> always writes back all chunks during RMW, or only the chunk that has
> changed.

Do you seriously imagine anyone would write code to write out data which it
is known has not changed?  Sad. :-)

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-20  0:01                           ` NeilBrown
@ 2012-08-20  7:47                             ` David Brown
  0 siblings, 0 replies; 21+ messages in thread
From: David Brown @ 2012-08-20  7:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: stan, Michael Tokarev, Miquel van Smoorenburg, Linux RAID, LKML

On 20/08/2012 02:01, NeilBrown wrote:
> On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
>
>
> Since we are trying to set the record straight....
>
>> md/RAID6 must read all devices in a RMW cycle.
>
> md/RAID6 must read all data devices (i.e. not parity devices) which it is not
> going to write to, in an RWM cycle (which the code actually calls RCW -
> reconstruct-write).
>
>>
>> md/RAID5 takes a shortcut for single block writes, and must only read
>> one drive for the RMW cycle.
>
> md/RAID5 uses an alternate mechanism when the number of data blocks that need
> to be written is less than half the number of data blocks in a stripe.  In
> this alternate mechansim (which the code calls RMW - read-modify-write),
> md/RAID5 reads all the blocks that it is about to write to, plus the parity
> block.  It then computes the new parity and writes it out along with the new
> data.
>

I've learned something here too - I thought this mechanism was only used 
for a single block write.  Thanks for the correction, Neil.

If you (or anyone else) are ever interested in implementing the same 
thing in raid6, the maths is not actually too bad (now that I've thought 
about it).  (I understand the theory here, but I'm afraid I don't have 
the experience with kernel programming to do the implementation.)

To change a few data blocks, you need to read in the old data blocks 
(Da, Db, etc.) and the old parities (P, Q).

Calculate the xor differences Xa = Da + D'a, Xb = Db + D'b, etc.

The new P parity is P' = P + Xa + Xb +...

The new Q parity is Q' = P + (g^a).Xa + (g^b).Xb + ...
The power series there is just the normal raid6 Q-parity calculation 
with most entries set to 0, and the Xa, Xb, etc. in the appropriate spots.

If the raid6 Q-parity function already has short-cuts for handling zero 
entries (I haven't looked, but the mechanism might be in place to 
slightly speed up dual-failure recovery), then all the blocks are in place.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-19 23:34                         ` Stan Hoeppner
  2012-08-20  0:01                           ` NeilBrown
@ 2012-08-21 14:51                           ` Miquel van Smoorenburg
  2012-08-22  3:59                             ` Stan Hoeppner
  1 sibling, 1 reply; 21+ messages in thread
From: Miquel van Smoorenburg @ 2012-08-21 14:51 UTC (permalink / raw)
  To: stan; +Cc: David Brown, Michael Tokarev, Linux RAID, LKML

On 08/20/2012 01:34 AM, Stan Hoeppner wrote:
> I'm glad you jumped in David.  You made a critical statement of fact
> below which clears some things up.  If you had stated it early on,
> before Miquel stole the thread and moved it to LKML proper, it would
> have short circuited a lot of this discussion.  Which is:

I'm sorry about that, that's because of the software that I use to 
follow most mailinglist. I didn't notice that the discussion was cc'ed 
to both lkml and l-r. I should fix that.

> Thus my original statement was correct, or at least half correct[1], as
> it pertained to md/RAID6.  Then Miquel switched the discussion to
> md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
> Chinner.  I was simply unaware of this md/RAID5 single block write RMW
> shortcut

Well, all I tried to say is that a small write of, say, 4K, to a 
raid5/raid6 array does not need to re-write the whole stripe (i.e. 
chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that.

Mike.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-21 14:51                           ` Miquel van Smoorenburg
@ 2012-08-22  3:59                             ` Stan Hoeppner
  0 siblings, 0 replies; 21+ messages in thread
From: Stan Hoeppner @ 2012-08-22  3:59 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: David Brown, Michael Tokarev, Linux RAID, LKML

On 8/21/2012 9:51 AM, Miquel van Smoorenburg wrote:
> On 08/20/2012 01:34 AM, Stan Hoeppner wrote:
>> I'm glad you jumped in David.  You made a critical statement of fact
>> below which clears some things up.  If you had stated it early on,
>> before Miquel stole the thread and moved it to LKML proper, it would
>> have short circuited a lot of this discussion.  Which is:
> 
> I'm sorry about that, that's because of the software that I use to
> follow most mailinglist. I didn't notice that the discussion was cc'ed
> to both lkml and l-r. I should fix that.

Oh, my bad.  I thought it was intentional.

Don't feel too bad about it.  When I tried to copy lkml back in on the
one message I screwed up as well.  I though Tbird had filled in the full
address but it didn't.

>> Thus my original statement was correct, or at least half correct[1], as
>> it pertained to md/RAID6.  Then Miquel switched the discussion to
>> md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
>> Chinner.  I was simply unaware of this md/RAID5 single block write RMW
>> shortcut
> 
> Well, all I tried to say is that a small write of, say, 4K, to a
> raid5/raid6 array does not need to re-write the whole stripe (i.e.
> chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that.

And I'm glad you did.  Before that I didn't know about these efficiency
shortcuts and exactly how md does writeback on partial stripe updates.

Even with these optimizations, a default 512KB chunk is too big, for the
reasons I stated, the big one being the fact that you'll rarely fill a
full stripe, meaning nearly every write will incur an RMW cycle.

-- 
Stan


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-08-22  4:00 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-15  0:49 O_DIRECT to md raid 6 is slow Andy Lutomirski
2012-08-15  1:07 ` kedacomkernel
2012-08-15  1:12   ` Andy Lutomirski
2012-08-15  1:23     ` kedacomkernel
2012-08-15 11:50 ` John Robinson
2012-08-15 17:57   ` Andy Lutomirski
2012-08-15 22:00     ` Stan Hoeppner
2012-08-15 22:10       ` Andy Lutomirski
2012-08-15 23:50         ` Stan Hoeppner
2012-08-16  1:08           ` Andy Lutomirski
2012-08-16  6:41           ` Roman Mamedov
2012-08-15 23:07       ` Miquel van Smoorenburg
2012-08-16 11:05         ` Stan Hoeppner
2012-08-16 21:50           ` Miquel van Smoorenburg
2012-08-17  7:31             ` Stan Hoeppner
2012-08-17 11:16               ` Miquel van Smoorenburg
     [not found]                 ` <502F237D.6060806@hardwarefreak.com>
     [not found]                   ` <502F698C.9010507@msgid.tls.msk.ru>
     [not found]                     ` <50305AB9.5080302@hardwarefreak.com>
     [not found]                       ` <5030F1C6.90205@hesbynett.no>
2012-08-19 23:34                         ` Stan Hoeppner
2012-08-20  0:01                           ` NeilBrown
2012-08-20  7:47                             ` David Brown
2012-08-21 14:51                           ` Miquel van Smoorenburg
2012-08-22  3:59                             ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).