All of lore.kernel.org
 help / color / mirror / Atom feed
* O_DIRECT to md raid 6 is slow
@ 2012-08-15  0:49 Andy Lutomirski
  2012-08-15  1:07   ` kedacomkernel
  2012-08-15 11:50 ` John Robinson
  0 siblings, 2 replies; 31+ messages in thread
From: Andy Lutomirski @ 2012-08-15  0:49 UTC (permalink / raw)
  To: linux-kernel, linux-raid

If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M
then iostat -m 5 says:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   26.88   35.27    0.00   37.85

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb             265.20         1.16        54.79          5        273
sdc             266.20         1.47        54.73          7        273
sdd             264.20         1.38        54.54          6        272
sdf             286.00         1.84        54.74          9        273
sde             266.60         1.04        54.75          5        273
sdg             265.00         1.02        54.74          5        273
md0           55808.00         0.00       218.00          0       1090

If I do:
# dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
then iostat -m 5 says:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   11.70   12.94    0.00   75.36

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb             831.00         8.58        30.42         42        152
sdc             832.80         8.05        29.99         40        149
sdd             832.00         9.10        29.78         45        148
sdf             838.40         9.11        29.72         45        148
sde             828.80         7.91        29.79         39        148
sdg             850.80         8.00        30.18         40        150
md0            1012.60         0.00       101.27          0        506

It looks like md isn't recognizing that I'm writing whole stripes when
I'm in O_DIRECT mode.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15  0:49 O_DIRECT to md raid 6 is slow Andy Lutomirski
@ 2012-08-15  1:07   ` kedacomkernel
  2012-08-15 11:50 ` John Robinson
  1 sibling, 0 replies; 31+ messages in thread
From: kedacomkernel @ 2012-08-15  1:07 UTC (permalink / raw)
  To: Andy Lutomirski, linux-kernel, linux-raid

On 2012-08-15 08:49 Andy Lutomirski <luto@amacapital.net> Wrote:
>If I do:
># dd if=/dev/zero of=/dev/md0p1 bs=8M
>then iostat -m 5 says:
>
>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           0.00    0.00   26.88   35.27    0.00   37.85
>
>Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>sdb             265.20         1.16        54.79          5        273
>sdc             266.20         1.47        54.73          7        273
>sdd             264.20         1.38        54.54          6        272
>sdf             286.00         1.84        54.74          9        273
>sde             266.60         1.04        54.75          5        273
>sdg             265.00         1.02        54.74          5        273
>md0           55808.00         0.00       218.00          0       1090
>
>If I do:
># dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
>then iostat -m 5 says:
>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           0.00    0.00   11.70   12.94    0.00   75.36
>
>Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>sdb             831.00         8.58        30.42         42        152
>sdc             832.80         8.05        29.99         40        149
>sdd             832.00         9.10        29.78         45        148
>sdf             838.40         9.11        29.72         45        148
>sde             828.80         7.91        29.79         39        148
>sdg             850.80         8.00        30.18         40        150
>md0            1012.60         0.00       101.27          0        506
>
>It looks like md isn't recognizing that I'm writing whole stripes when
>I'm in O_DIRECT mode.
>
kernel version?

>--Andy
>
>-- 
>Andy Lutomirski
>AMA Capital Management, LLC
>--
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
@ 2012-08-15  1:07   ` kedacomkernel
  0 siblings, 0 replies; 31+ messages in thread
From: kedacomkernel @ 2012-08-15  1:07 UTC (permalink / raw)
  To: Andy Lutomirski, linux-kernel, linux-raid

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 2146 bytes --]

On 2012-08-15 08:49 Andy Lutomirski <luto@amacapital.net> Wrote:
>If I do:
># dd if=/dev/zero of=/dev/md0p1 bs=8M
>then iostat -m 5 says:
>
>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           0.00    0.00   26.88   35.27    0.00   37.85
>
>Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>sdb             265.20         1.16        54.79          5        273
>sdc             266.20         1.47        54.73          7        273
>sdd             264.20         1.38        54.54          6        272
>sdf             286.00         1.84        54.74          9        273
>sde             266.60         1.04        54.75          5        273
>sdg             265.00         1.02        54.74          5        273
>md0           55808.00         0.00       218.00          0       1090
>
>If I do:
># dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
>then iostat -m 5 says:
>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           0.00    0.00   11.70   12.94    0.00   75.36
>
>Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>sdb             831.00         8.58        30.42         42        152
>sdc             832.80         8.05        29.99         40        149
>sdd             832.00         9.10        29.78         45        148
>sdf             838.40         9.11        29.72         45        148
>sde             828.80         7.91        29.79         39        148
>sdg             850.80         8.00        30.18         40        150
>md0            1012.60         0.00       101.27          0        506
>
>It looks like md isn't recognizing that I'm writing whole stripes when
>I'm in O_DIRECT mode.
>
kernel version?

>--Andy
>
>-- 
>Andy Lutomirski
>AMA Capital Management, LLC
>--
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.htmlÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15  1:07   ` kedacomkernel
  (?)
@ 2012-08-15  1:12   ` Andy Lutomirski
  2012-08-15  1:23       ` kedacomkernel
  -1 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2012-08-15  1:12 UTC (permalink / raw)
  To: kedacomkernel; +Cc: linux-kernel, linux-raid

Ubuntu's 3.2.0-27-generic.  I can test on a newer kernel tomorrow.

--Andy

On Tue, Aug 14, 2012 at 6:07 PM, kedacomkernel <kedacomkernel@gmail.com> wrote:
> On 2012-08-15 08:49 Andy Lutomirski <luto@amacapital.net> Wrote:
>>If I do:
>># dd if=/dev/zero of=/dev/md0p1 bs=8M
>>then iostat -m 5 says:
>>
>>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>           0.00    0.00   26.88   35.27    0.00   37.85
>>
>>Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>>sdb             265.20         1.16        54.79          5        273
>>sdc             266.20         1.47        54.73          7        273
>>sdd             264.20         1.38        54.54          6        272
>>sdf             286.00         1.84        54.74          9        273
>>sde             266.60         1.04        54.75          5        273
>>sdg             265.00         1.02        54.74          5        273
>>md0           55808.00         0.00       218.00          0       1090
>>
>>If I do:
>># dd if=/dev/zero of=/dev/md0p1 bs=8M oflag=direct
>>then iostat -m 5 says:
>>avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>           0.00    0.00   11.70   12.94    0.00   75.36
>>
>>Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
>>sdb             831.00         8.58        30.42         42        152
>>sdc             832.80         8.05        29.99         40        149
>>sdd             832.00         9.10        29.78         45        148
>>sdf             838.40         9.11        29.72         45        148
>>sde             828.80         7.91        29.79         39        148
>>sdg             850.80         8.00        30.18         40        150
>>md0            1012.60         0.00       101.27          0        506
>>
>>It looks like md isn't recognizing that I'm writing whole stripes when
>>I'm in O_DIRECT mode.
>>
> kernel version?
>
>>--Andy
>>
>>--
>>Andy Lutomirski
>>AMA Capital Management, LLC
>>--
>>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: O_DIRECT to md raid 6 is slow
  2012-08-15  1:12   ` Andy Lutomirski
@ 2012-08-15  1:23       ` kedacomkernel
  0 siblings, 0 replies; 31+ messages in thread
From: kedacomkernel @ 2012-08-15  1:23 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel, linux-raid

On 2012-08-15 09:12 Andy Lutomirski <luto@amacapital.net> Wrote:
>Ubuntu's 3.2.0-27-generic.  I can test on a newer kernel tomorrow.
I guess maybe miss the blk_plug function.
Can you add this patch and retest.

Move unplugging for direct I/O from around ->direct_IO() down to
do_blockdev_direct_IO(). This implicitly adds plugging for direct
writes.
 
CC: Li Shaohua <shli@fusionio.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/direct-io.c |    5 +++++
 mm/filemap.c   |    4 ----
 2 files changed, 5 insertions(+), 4 deletions(-)
 
--- linux-next.orig/mm/filemap.c 2012-08-05 16:24:47.859465122 +0800
+++ linux-next/mm/filemap.c 2012-08-05 16:24:48.407465135 +0800
@@ -1412,12 +1412,8 @@ generic_file_aio_read(struct kiocb *iocb
  retval = filemap_write_and_wait_range(mapping, pos,
  pos + iov_length(iov, nr_segs) - 1);
  if (!retval) {
- struct blk_plug plug;
-
- blk_start_plug(&plug);
  retval = mapping->a_ops->direct_IO(READ, iocb,
  iov, pos, nr_segs);
- blk_finish_plug(&plug);
  }
  if (retval > 0) {
  *ppos = pos + retval;
--- linux-next.orig/fs/direct-io.c 2012-07-07 21:46:39.531508198 +0800
+++ linux-next/fs/direct-io.c 2012-08-05 16:24:48.411465136 +0800
@@ -1062,6 +1062,7 @@ do_blockdev_direct_IO(int rw, struct kio
  unsigned long user_addr;
  size_t bytes;
  struct buffer_head map_bh = { 0, };
+ struct blk_plug plug;
 
  if (rw & WRITE)
  rw = WRITE_ODIRECT;
@@ -1177,6 +1178,8 @@ do_blockdev_direct_IO(int rw, struct kio
  PAGE_SIZE - user_addr / PAGE_SIZE);
  }
 
+ blk_start_plug(&plug);
+
  for (seg = 0; seg < nr_segs; seg++) {
  user_addr = (unsigned long)iov[seg].iov_base;
  sdio.size += bytes = iov[seg].iov_len;
@@ -1235,6 +1238,8 @@ do_blockdev_direct_IO(int rw, struct kio
  if (sdio.bio)
  dio_bio_submit(dio, &sdio);
 
+ blk_finish_plug(&plug);
+
  /*
   * It is possible that, we return short IO due to end of file.
   * In that case, we need to release all the pages we got hold on.
 
 
--

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: O_DIRECT to md raid 6 is slow
@ 2012-08-15  1:23       ` kedacomkernel
  0 siblings, 0 replies; 31+ messages in thread
From: kedacomkernel @ 2012-08-15  1:23 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel, linux-raid

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="gb2312", Size: 2151 bytes --]

On 2012-08-15 09:12 Andy Lutomirski <luto@amacapital.net> Wrote:
>Ubuntu's 3.2.0-27-generic.  I can test on a newer kernel tomorrow.
I guess maybe miss the blk_plug function.
Can you add this patch and retest.

Move unplugging for direct I/O from around ->direct_IO() down to
do_blockdev_direct_IO(). This implicitly adds plugging for direct
writes.
 
CC: Li Shaohua <shli@fusionio.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/direct-io.c |    5 +++++
 mm/filemap.c   |    4 ----
 2 files changed, 5 insertions(+), 4 deletions(-)
 
--- linux-next.orig/mm/filemap.c 2012-08-05 16:24:47.859465122 +0800
+++ linux-next/mm/filemap.c 2012-08-05 16:24:48.407465135 +0800
@@ -1412,12 +1412,8 @@ generic_file_aio_read(struct kiocb *iocb
  retval = filemap_write_and_wait_range(mapping, pos,
  pos + iov_length(iov, nr_segs) - 1);
  if (!retval) {
- struct blk_plug plug;
-
- blk_start_plug(&plug);
  retval = mapping->a_ops->direct_IO(READ, iocb,
  iov, pos, nr_segs);
- blk_finish_plug(&plug);
  }
  if (retval > 0) {
  *ppos = pos + retval;
--- linux-next.orig/fs/direct-io.c 2012-07-07 21:46:39.531508198 +0800
+++ linux-next/fs/direct-io.c 2012-08-05 16:24:48.411465136 +0800
@@ -1062,6 +1062,7 @@ do_blockdev_direct_IO(int rw, struct kio
  unsigned long user_addr;
  size_t bytes;
  struct buffer_head map_bh = { 0, };
+ struct blk_plug plug;
 
  if (rw & WRITE)
  rw = WRITE_ODIRECT;
@@ -1177,6 +1178,8 @@ do_blockdev_direct_IO(int rw, struct kio
  PAGE_SIZE - user_addr / PAGE_SIZE);
  }
 
+ blk_start_plug(&plug);
+
  for (seg = 0; seg < nr_segs; seg++) {
  user_addr = (unsigned long)iov[seg].iov_base;
  sdio.size += bytes = iov[seg].iov_len;
@@ -1235,6 +1238,8 @@ do_blockdev_direct_IO(int rw, struct kio
  if (sdio.bio)
  dio_bio_submit(dio, &sdio);
 
+ blk_finish_plug(&plug);
+
  /*
   * It is possible that, we return short IO due to end of file.
   * In that case, we need to release all the pages we got hold on.
 
 
--
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15  0:49 O_DIRECT to md raid 6 is slow Andy Lutomirski
  2012-08-15  1:07   ` kedacomkernel
@ 2012-08-15 11:50 ` John Robinson
  2012-08-15 17:57   ` Andy Lutomirski
  1 sibling, 1 reply; 31+ messages in thread
From: John Robinson @ 2012-08-15 11:50 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel, linux-raid

On 15/08/2012 01:49, Andy Lutomirski wrote:
> If I do:
> # dd if=/dev/zero of=/dev/md0p1 bs=8M
[...]
> It looks like md isn't recognizing that I'm writing whole stripes when
> I'm in O_DIRECT mode.

I see your md device is partitioned. Is the partition itself stripe-aligned?

Cheers,

John.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 11:50 ` John Robinson
@ 2012-08-15 17:57   ` Andy Lutomirski
  2012-08-15 22:00     ` Stan Hoeppner
  0 siblings, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2012-08-15 17:57 UTC (permalink / raw)
  To: John Robinson; +Cc: linux-kernel, linux-raid

On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
<john.robinson@anonymous.org.uk> wrote:
> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>
>> If I do:
>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>
> [...]
>
>> It looks like md isn't recognizing that I'm writing whole stripes when
>> I'm in O_DIRECT mode.
>
>
> I see your md device is partitioned. Is the partition itself stripe-aligned?

Crud.

md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
      11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
[6/6] [UUUUUU]

IIUC this means that I/O should be aligned on 2MB boundaries (512k
chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
(i.e. 1MB) boundary.

Sadly, /sys/block/md0/md0p1/alignment_offset reports 0 (instead of 1MB).

Fixing this has no effect, though.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 17:57   ` Andy Lutomirski
@ 2012-08-15 22:00     ` Stan Hoeppner
  2012-08-15 22:10       ` Andy Lutomirski
  2012-08-15 23:07       ` Miquel van Smoorenburg
  0 siblings, 2 replies; 31+ messages in thread
From: Stan Hoeppner @ 2012-08-15 22:00 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: John Robinson, linux-kernel, linux-raid

On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
> <john.robinson@anonymous.org.uk> wrote:
>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>
>>> If I do:
>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>
>> [...]
>>
>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>> I'm in O_DIRECT mode.
>>
>>
>> I see your md device is partitioned. Is the partition itself stripe-aligned?
> 
> Crud.
> 
> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>       11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [6/6] [UUUUUU]
> 
> IIUC this means that I/O should be aligned on 2MB boundaries (512k
> chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
> (i.e. 1MB) boundary.

It's time to blow away the array and start over.  You're already
misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
but for a handful of niche all streaming workloads with little/no
rewrite, such as video surveillance or DVR workloads.

Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
Deleting a single file changes only a few bytes of directory metadata.
With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
modify the directory block in question, calculate parity, then write out
3MB of data to rust.  So you consume 6MB of bandwidth to write less than
a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
a few bytes of metadata.  Yes, insane.

Parity RAID sucks in general because of RMW, but it is orders of
magnitude worse when one chooses to use an insane chunk size to boot,
and especially so with a large drive count.

It seems people tend to use large chunk sizes because array
initialization is a bit faster, and running block x-fer "tests" with dd
buffered sequential reads/writes makes their Levi's expand.  Then they
are confused when their actual workloads are horribly slow.

Recreate your array, partition aligned, and manually specify a sane
chunk size of something like 32KB.  You'll be much happier with real
workloads.

-- 
Stan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 22:00     ` Stan Hoeppner
@ 2012-08-15 22:10       ` Andy Lutomirski
  2012-08-15 23:50         ` Stan Hoeppner
  2012-08-15 23:07       ` Miquel van Smoorenburg
  1 sibling, 1 reply; 31+ messages in thread
From: Andy Lutomirski @ 2012-08-15 22:10 UTC (permalink / raw)
  To: stan; +Cc: John Robinson, linux-kernel, linux-raid

On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>> <john.robinson@anonymous.org.uk> wrote:
>>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>>
>>>> If I do:
>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>
>>> [...]
>>>
>>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>>> I'm in O_DIRECT mode.
>>>
>>>
>>> I see your md device is partitioned. Is the partition itself stripe-aligned?
>>
>> Crud.
>>
>> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>>       11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
>> [6/6] [UUUUUU]
>>
>> IIUC this means that I/O should be aligned on 2MB boundaries (512k
>> chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
>> (i.e. 1MB) boundary.
>
> It's time to blow away the array and start over.  You're already
> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
> but for a handful of niche all streaming workloads with little/no
> rewrite, such as video surveillance or DVR workloads.
>
> Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
> Deleting a single file changes only a few bytes of directory metadata.
> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
> modify the directory block in question, calculate parity, then write out
> 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
> a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
> a few bytes of metadata.  Yes, insane.

Grr.  I thought the bad old days of filesystem and related defaults
sucking were over.  cryptsetup aligns sanely these days, xfs is
sensible, etc.  wtf?  <rant>Why is there no sensible filesystem for
huge disks?  zfs can't cp --reflink and has all kinds of source
availability and licensing issues, xfs can't dedupe at all, and btrfs
isn't nearly stable enough.</rant>

Anyhow, I'll try the patch from Wu Fengguang.  There's still a bug here...

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 22:00     ` Stan Hoeppner
  2012-08-15 22:10       ` Andy Lutomirski
@ 2012-08-15 23:07       ` Miquel van Smoorenburg
  2012-08-16 11:05         ` Stan Hoeppner
  1 sibling, 1 reply; 31+ messages in thread
From: Miquel van Smoorenburg @ 2012-08-15 23:07 UTC (permalink / raw)
  To: stan; +Cc: linux-kernel

In article <xs4all.502C1C01.1040509@hardwarefreak.com> you write:
>It's time to blow away the array and start over.  You're already
>misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>but for a handful of niche all streaming workloads with little/no
>rewrite, such as video surveillance or DVR workloads.
>
>Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
>Deleting a single file changes only a few bytes of directory metadata.
>With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>modify the directory block in question, calculate parity, then write out
>3MB of data to rust.  So you consume 6MB of bandwidth to write less than
>a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
>a few bytes of metadata.  Yes, insane.

Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
to read that 4K block, and the corresponding 4K block on the
parity drive, recalculate parity, and write back 4K of data and 4K
of parity. (read|read) modify (write|write). You do not have to
do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

>Parity RAID sucks in general because of RMW, but it is orders of
>magnitude worse when one chooses to use an insane chunk size to boot,
>and especially so with a large drive count.

If you have a lot of parallel readers (readers >> disks) then
you want chunk sizes of about 2*mean_read_size, so that for each
read you just have 1 seek on 1 disk.

If you have just a few readers (readers <<<< disks) that read
really large blocks then you want a small chunk size to keep
all disks busy.

If you have no readers and just writers and you write large
blocks, then you might want a small chunk size too, so that
you can write data+parity over the stripe in one go, bypassing rmw.

Also, 256K or 512K isn't all that big nowadays, there's not much
latency difference between reading 32K or 512K..

>Recreate your array, partition aligned, and manually specify a sane
>chunk size of something like 32KB.  You'll be much happier with real
>workloads.

Aligning is a good idea, and on modern distributions partitions,
LVM lv's etc are generally created with 1MB alignment. But using
a small chunksize like 32K? That depends on the workload, but
in most cases I'd advise against it.

Mike.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 22:10       ` Andy Lutomirski
@ 2012-08-15 23:50         ` Stan Hoeppner
  2012-08-16  1:08           ` Andy Lutomirski
  2012-08-16  6:41           ` Roman Mamedov
  0 siblings, 2 replies; 31+ messages in thread
From: Stan Hoeppner @ 2012-08-15 23:50 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: John Robinson, linux-kernel, linux-raid

On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
> On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>>> <john.robinson@anonymous.org.uk> wrote:
>>>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>>>
>>>>> If I do:
>>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>>
>>>> [...]
>>>>
>>>>> It looks like md isn't recognizing that I'm writing whole stripes when
>>>>> I'm in O_DIRECT mode.
>>>>
>>>>
>>>> I see your md device is partitioned. Is the partition itself stripe-aligned?
>>>
>>> Crud.
>>>
>>> md0 : active raid6 sdg1[5] sdf1[4] sde1[3] sdd1[2] sdc1[1] sdb1[0]
>>>       11720536064 blocks super 1.2 level 6, 512k chunk, algorithm 2
>>> [6/6] [UUUUUU]
>>>
>>> IIUC this means that I/O should be aligned on 2MB boundaries (512k
>>> chunk * 4 non-parity disks).  gdisk put my partition on a 2048 sector
>>> (i.e. 1MB) boundary.
>>
>> It's time to blow away the array and start over.  You're already
>> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>> but for a handful of niche all streaming workloads with little/no
>> rewrite, such as video surveillance or DVR workloads.
>>
>> Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
>> Deleting a single file changes only a few bytes of directory metadata.
>> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>> modify the directory block in question, calculate parity, then write out
>> 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
>> a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
>> a few bytes of metadata.  Yes, insane.
> 
> Grr.  I thought the bad old days of filesystem and related defaults
> sucking were over.  

The previous md chunk default of 64KB wasn't horribly bad, though still
maybe a bit high for alot of common workloads.  I didn't have eyes/ears
on the discussion and/or testing process that led to the 'new' 512KB
default.  Obviously something went horribly wrong here.  512KB isn't a
show stopper as a default for 0/1/10, but is 8-16 times too large for
parity RAID.

> cryptsetup aligns sanely these days, xfs is
> sensible, etc.  

XFS won't align with the 512KB chunk default of metadata 1.2.  The
largest XFS journal stripe unit (su--chunk) is 256KB, and even that
isn't recommended.  Thus mkfs.xfs throws an error due to the 512KB
stripe.  See the md and xfs archives for more details, specifically Dave
Chinner's colorful comments on the md 512KB default.

> wtf?  <rant>Why is there no sensible filesystem for
> huge disks?  zfs can't cp --reflink and has all kinds of source
> availability and licensing issues, xfs can't dedupe at all, and btrfs
> isn't nearly stable enough.</rant>

Deduplication isn't a responsibility of a filesystem.  TTBOMK there are
two, and only two, COW filesystems in existence:  ZFS and BTRFS.  And
these are the only two to offer a native dedupe capability.  They did it
because they could, with COW, not necessarily because they *should*.
There are dozens of other single node, cluster, and distributed
filesystems in use today and none of them support COW, and thus none
support dedup.  So to *expect* a 'sensible' filesystem to include dedupe
is wishful thinking at best.

> Anyhow, I'll try the patch from Wu Fengguang.  There's still a bug here...

Always one somewhere.

-- 
Stan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 23:50         ` Stan Hoeppner
@ 2012-08-16  1:08           ` Andy Lutomirski
  2012-08-16  6:41           ` Roman Mamedov
  1 sibling, 0 replies; 31+ messages in thread
From: Andy Lutomirski @ 2012-08-16  1:08 UTC (permalink / raw)
  To: stan; +Cc: John Robinson, linux-kernel, linux-raid

On Wed, Aug 15, 2012 at 4:50 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 8/15/2012 5:10 PM, Andy Lutomirski wrote:
>> On Wed, Aug 15, 2012 at 3:00 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>> On 8/15/2012 12:57 PM, Andy Lutomirski wrote:
>>>> On Wed, Aug 15, 2012 at 4:50 AM, John Robinson
>>>> <john.robinson@anonymous.org.uk> wrote:
>>>>> On 15/08/2012 01:49, Andy Lutomirski wrote:
>>>>>>
>>>>>> If I do:
>>>>>> # dd if=/dev/zero of=/dev/md0p1 bs=8M
>>>>>
>>>>> [...]
>>
>> Grr.  I thought the bad old days of filesystem and related defaults
>> sucking were over.
>
> The previous md chunk default of 64KB wasn't horribly bad, though still
> maybe a bit high for alot of common workloads.  I didn't have eyes/ears
> on the discussion and/or testing process that led to the 'new' 512KB
> default.  Obviously something went horribly wrong here.  512KB isn't a
> show stopper as a default for 0/1/10, but is 8-16 times too large for
> parity RAID.
>
>> cryptsetup aligns sanely these days, xfs is
>> sensible, etc.
>
> XFS won't align with the 512KB chunk default of metadata 1.2.  The
> largest XFS journal stripe unit (su--chunk) is 256KB, and even that
> isn't recommended.  Thus mkfs.xfs throws an error due to the 512KB
> stripe.  See the md and xfs archives for more details, specifically Dave
> Chinner's colorful comments on the md 512KB default.

Heh -- that's why the math didn't make any sense :)

>
>> wtf?  <rant>Why is there no sensible filesystem for
>> huge disks?  zfs can't cp --reflink and has all kinds of source
>> availability and licensing issues, xfs can't dedupe at all, and btrfs
>> isn't nearly stable enough.</rant>
>
> Deduplication isn't a responsibility of a filesystem.  TTBOMK there are
> two, and only two, COW filesystems in existence:  ZFS and BTRFS.  And
> these are the only two to offer a native dedupe capability.  They did it
> because they could, with COW, not necessarily because they *should*.
> There are dozens of other single node, cluster, and distributed
> filesystems in use today and none of them support COW, and thus none
> support dedup.  So to *expect* a 'sensible' filesystem to include dedupe
> is wishful thinking at best.

I should clarify my rant for the record.  I don't care about in-fs
dedupe.  I want COW so userspace can dedupe and generally replace
hardlinks with sensible cowlinks.  I'm also working on some fun tools
that *require* reflinks for anything resembling decent performance.

--Andy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 23:50         ` Stan Hoeppner
  2012-08-16  1:08           ` Andy Lutomirski
@ 2012-08-16  6:41           ` Roman Mamedov
  1 sibling, 0 replies; 31+ messages in thread
From: Roman Mamedov @ 2012-08-16  6:41 UTC (permalink / raw)
  To: stan; +Cc: Andy Lutomirski, John Robinson, linux-kernel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

On Wed, 15 Aug 2012 18:50:44 -0500
Stan Hoeppner <stan@hardwarefreak.com> wrote:

> TTBOMK there are two, and only two, COW filesystems in existence:  ZFS and BTRFS.

There is also NILFS2: http://www.nilfs.org/en/
And in general, any https://en.wikipedia.org/wiki/Log-structured_file_system
is COW by design, but afaik of those only NILFS is also in the mainline Linux
kernel AND is not aimed just for some niche like flash-based devices, but for
general-purpose usage.

-- 
With respect,
Roman

~~~~~~~~~~~~~~~~~~~~~~~~~~~
"Stallman had a printer,
with code he could not see.
So he began to tinker,
and set the software free."

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-15 23:07       ` Miquel van Smoorenburg
@ 2012-08-16 11:05         ` Stan Hoeppner
  2012-08-16 21:50           ` Miquel van Smoorenburg
  0 siblings, 1 reply; 31+ messages in thread
From: Stan Hoeppner @ 2012-08-16 11:05 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: linux-kernel

On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
> In article <xs4all.502C1C01.1040509@hardwarefreak.com> you write:
>> It's time to blow away the array and start over.  You're already
>> misaligned, and a 512KB chunk is insanely unsuitable for parity RAID,
>> but for a handful of niche all streaming workloads with little/no
>> rewrite, such as video surveillance or DVR workloads.
>>
>> Yes, 512KB is the md 1.2 default.  And yes, it is insane.  Here's why:
>> Deleting a single file changes only a few bytes of directory metadata.
>> With your 6 drive md/RAID6 with 512KB chunk, you must read 3MB of data,
>> modify the directory block in question, calculate parity, then write out
>> 3MB of data to rust.  So you consume 6MB of bandwidth to write less than
>> a dozen bytes.  With a 12 drive RAID6 that's 12MB of bandwidth to modify
>> a few bytes of metadata.  Yes, insane.
> 
> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
> to read that 4K block, and the corresponding 4K block on the
> parity drive, recalculate parity, and write back 4K of data and 4K
> of parity. (read|read) modify (write|write). You do not have to
> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.

See:  http://www.spinics.net/lists/xfs/msg12627.html

Dave usually knows what he's talking about, and I didn't see Neil nor
anyone else correcting him on his description of md RMW behavior.  What
I stated above is pretty much exactly what Dave stated, but for the fact
I got the RMW read bytes wrong--should be 2MB/3MB for a 6 drive md/RAID6
and 5MB/6MB for 12 drives.

>> Parity RAID sucks in general because of RMW, but it is orders of
>> magnitude worse when one chooses to use an insane chunk size to boot,
>> and especially so with a large drive count.
[snip]
> Also, 256K or 512K isn't all that big nowadays, there's not much
> latency difference between reading 32K or 512K..

You're forgetting 3 very important things:

1.  All filesystems have metadata
2.  All (worth using) filesystems have a metadata journal
3.  All workloads include some, if not major, metadata operations

When writing journal and directory metadata there is a huge difference
between a 32KB and 512KB chunk especially as the drive count in the
array increases.  Rarely does a filesystem pack enough journal
operations into a single writeout to fill a 512KB stripe, let alone a
4MB stripe.  With a 32KB chunk you see full stripe width journal writes
frequently, minimizing the number of RMW writes to the journal, even up
to 16 data spindle parity arrays (18 drive RAID6).   Using a 512KB chunk
will cause most journal writes to be partial stripe writes, triggering
RMW for most journal writes.  The same is true for directory metadata
writes.

Everyone knows that parity RAID sucks for anything but purely streaming
workloads with little metadata.  With most/all other workloads, using a
large chunk size, such as the md metadata 1.2 default of 512KB, with
parity RAID, simply makes it much worse, whether the RMW cycle affects
all disks or just one data disk and one parity disk.

>> Recreate your array, partition aligned, and manually specify a sane
>> chunk size of something like 32KB.  You'll be much happier with real
>> workloads.
> 
> Aligning is a good idea, 

Understatement of the century.  Just as critical, if not more so, FS
stripe alignment is mandatory with parity RAID lest full stripe writeout
can/will trigger RMW.

> and on modern distributions partitions,
> LVM lv's etc are generally created with 1MB alignment. But using
> a small chunksize like 32K? That depends on the workload, but
> in most cases I'd advise against it.

People should ignore your advice in this regard.  A small chunk size is
optimal for nearly all workloads on a parity array for the reasons I
stated above.  It's the large chunk that is extremely workload
dependent, as again, it only fits well with low metadata streaming
workloads.

-- 
Stan


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-16 11:05         ` Stan Hoeppner
@ 2012-08-16 21:50           ` Miquel van Smoorenburg
  2012-08-17  7:31             ` Stan Hoeppner
  0 siblings, 1 reply; 31+ messages in thread
From: Miquel van Smoorenburg @ 2012-08-16 21:50 UTC (permalink / raw)
  To: stan; +Cc: linux-kernel

On 16-08-12 1:05 PM, Stan Hoeppner wrote:
> On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
>> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
>> to read that 4K block, and the corresponding 4K block on the
>> parity drive, recalculate parity, and write back 4K of data and 4K
>> of parity. (read|read) modify (write|write). You do not have to
>> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.
>
> See:  http://www.spinics.net/lists/xfs/msg12627.html
>
> Dave usually knows what he's talking about, and I didn't see Neil nor
> anyone else correcting him on his description of md RMW behavior.

Well he's wrong, or you're interpreting it incorrectly.

I did a simple test:

* created a 1G partition on 3 seperate disks
* created a md raid5 array with 512K chunksize:
   mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1 
/dev/sdd1
* ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
* wrote a single 4K block:
   dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0

Output from iostat over the period in which the 4K write was done. Look 
at kB read and kB written:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb1              0.60         0.00         1.60          0          8
sdc1              0.60         0.80         0.80          4          4
sdd1              0.60         0.00         1.60          0          8

As you can see, a single 4K read, and a few writes. You see a few blocks 
more written that you'd expect because the superblock is updated too.

Mike.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-16 21:50           ` Miquel van Smoorenburg
@ 2012-08-17  7:31             ` Stan Hoeppner
  2012-08-17 11:16               ` Miquel van Smoorenburg
  0 siblings, 1 reply; 31+ messages in thread
From: Stan Hoeppner @ 2012-08-17  7:31 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: linux-kernel

On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:
> On 16-08-12 1:05 PM, Stan Hoeppner wrote:
>> On 8/15/2012 6:07 PM, Miquel van Smoorenburg wrote:
>>> Ehrm no. If you modify, say, a 4K block on a RAID5 array, you just have
>>> to read that 4K block, and the corresponding 4K block on the
>>> parity drive, recalculate parity, and write back 4K of data and 4K
>>> of parity. (read|read) modify (write|write). You do not have to
>>> do I/O in chunksize, ehm, chunks, and you do not have to rmw all disks.
>>
>> See:  http://www.spinics.net/lists/xfs/msg12627.html
>>
>> Dave usually knows what he's talking about, and I didn't see Neil nor
>> anyone else correcting him on his description of md RMW behavior.
> 
> Well he's wrong, or you're interpreting it incorrectly.
> 
> I did a simple test:
> 
> * created a 1G partition on 3 seperate disks
> * created a md raid5 array with 512K chunksize:
>   mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
> /dev/sdd1
> * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
> * wrote a single 4K block:
>   dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0
> 
> Output from iostat over the period in which the 4K write was done. Look
> at kB read and kB written:
> 
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sdb1              0.60         0.00         1.60          0          8
> sdc1              0.60         0.80         0.80          4          4
> sdd1              0.60         0.00         1.60          0          8
> 
> As you can see, a single 4K read, and a few writes. You see a few blocks
> more written that you'd expect because the superblock is updated too.

I'm no dd expert, but this looks like you're simply writing a 4KB block
to a new stripe, using an offset, but not to an existing stripe, as the
array is in a virgin state.  So it doesn't appear this test is going to
trigger RMW.  Don't you need now need to do another write in the same
stripe to to trigger RMW?  Maybe I'm just reading this wrong.

-- 
Stan


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-17  7:31             ` Stan Hoeppner
@ 2012-08-17 11:16               ` Miquel van Smoorenburg
  2012-08-18  5:09                 ` Stan Hoeppner
  0 siblings, 1 reply; 31+ messages in thread
From: Miquel van Smoorenburg @ 2012-08-17 11:16 UTC (permalink / raw)
  To: stan; +Cc: linux-kernel

On 08/17/2012 09:31 AM, Stan Hoeppner wrote:
> On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:
>> I did a simple test:
>>
>> * created a 1G partition on 3 seperate disks
>> * created a md raid5 array with 512K chunksize:
>>    mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
>> /dev/sdd1
>> * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
>> * wrote a single 4K block:
>>    dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0
>>
>> Output from iostat over the period in which the 4K write was done. Look
>> at kB read and kB written:
>>
>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>> sdb1              0.60         0.00         1.60          0          8
>> sdc1              0.60         0.80         0.80          4          4
>> sdd1              0.60         0.00         1.60          0          8
>>
>> As you can see, a single 4K read, and a few writes. You see a few blocks
>> more written that you'd expect because the superblock is updated too.
>
> I'm no dd expert, but this looks like you're simply writing a 4KB block
> to a new stripe, using an offset, but not to an existing stripe, as the
> array is in a virgin state.  So it doesn't appear this test is going to
> trigger RMW.  Don't you need now need to do another write in the same
> stripe to to trigger RMW?  Maybe I'm just reading this wrong.

That shouldn't matter, but that is easily checked ofcourse, by writing 
some random random data first, then doing the dd 4K write also with 
random data somewhere in the same area:

# dd if=/dev/urandom bs=1M count=3 of=/dev/md0
3+0 records in
3+0 records out
3145728 bytes (3.1 MB) copied, 0.794494 s, 4.0 MB/s

Now the first 6 chunks are filled with random data, let write 4K 
somewhere in there:

# dd if=/dev/urandom bs=4k count=1 seek=25 of=/dev/md0
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.10149 s, 40.4 kB/s

Output from iostat over the period in which the 4K write was done:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb1              0.60         0.00         1.60          0          8
sdc1              0.60         0.80         0.80          4          4
sdd1              0.60         0.00         1.60          0          8

Mike.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-17 11:16               ` Miquel van Smoorenburg
@ 2012-08-18  5:09                 ` Stan Hoeppner
  2012-08-18 10:08                   ` Michael Tokarev
  0 siblings, 1 reply; 31+ messages in thread
From: Stan Hoeppner @ 2012-08-18  5:09 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: Linux RAID

On 8/17/2012 6:16 AM, Miquel van Smoorenburg wrote:
> On 08/17/2012 09:31 AM, Stan Hoeppner wrote:
>> On 8/16/2012 4:50 PM, Miquel van Smoorenburg wrote:
>>> I did a simple test:
>>>
>>> * created a 1G partition on 3 seperate disks
>>> * created a md raid5 array with 512K chunksize:
>>>    mdadm -C /dev/md0 -l 5 -c $((1024*512)) -n 3 /dev/sdb1 /dev/sdc1
>>> /dev/sdd1
>>> * ran disk monitoring using 'iostat -k 5 /dev/sdb1 /dev/sdc1 /dev/sdd1'
>>> * wrote a single 4K block:
>>>    dd if=/dev/zero bs=4K count=1 oflag=direct seek=30 of=/dev/md0
>>>
>>> Output from iostat over the period in which the 4K write was done. Look
>>> at kB read and kB written:
>>>
>>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>>> sdb1              0.60         0.00         1.60          0          8
>>> sdc1              0.60         0.80         0.80          4          4
>>> sdd1              0.60         0.00         1.60          0          8
>>>
>>> As you can see, a single 4K read, and a few writes. You see a few blocks
>>> more written that you'd expect because the superblock is updated too.
>>
>> I'm no dd expert, but this looks like you're simply writing a 4KB block
>> to a new stripe, using an offset, but not to an existing stripe, as the
>> array is in a virgin state.  So it doesn't appear this test is going to
>> trigger RMW.  Don't you need now need to do another write in the same
>> stripe to to trigger RMW?  Maybe I'm just reading this wrong.
> 
> That shouldn't matter, but that is easily checked ofcourse, by writing
> some random random data first, then doing the dd 4K write also with
> random data somewhere in the same area:
> 
> # dd if=/dev/urandom bs=1M count=3 of=/dev/md0
> 3+0 records in
> 3+0 records out
> 3145728 bytes (3.1 MB) copied, 0.794494 s, 4.0 MB/s
> 
> Now the first 6 chunks are filled with random data, let write 4K
> somewhere in there:
> 
> # dd if=/dev/urandom bs=4k count=1 seek=25 of=/dev/md0
> 1+0 records in
> 1+0 records out
> 4096 bytes (4.1 kB) copied, 0.10149 s, 40.4 kB/s
> 
> Output from iostat over the period in which the 4K write was done:
> 
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sdb1              0.60         0.00         1.60          0          8
> sdc1              0.60         0.80         0.80          4          4
> sdd1              0.60         0.00         1.60          0          8

According to your iostat output, the IO is identical for both tests.  So
either you triggered an RMW in the first test, or you haven't triggered
an RMW with either test.  Your fist test shouldn't have triggered RMW.
The second one should have.

BTW, I'm curious why you replied to my message posted to linux-raid,
then stripped linux-raid from the CC list and added lkml proper.  What
was the reason for this?  I'm adding linux-raid and removing lkml.

-- 
Stan


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-18  5:09                 ` Stan Hoeppner
@ 2012-08-18 10:08                   ` Michael Tokarev
  2012-08-19  3:17                     ` Stan Hoeppner
  0 siblings, 1 reply; 31+ messages in thread
From: Michael Tokarev @ 2012-08-18 10:08 UTC (permalink / raw)
  To: stan; +Cc: Miquel van Smoorenburg, Linux RAID

On 18.08.2012 09:09, Stan Hoeppner wrote:
[]
>>>> Output from iostat over the period in which the 4K write was done. Look
>>>> at kB read and kB written:
>>>>
>>>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>>>> sdb1              0.60         0.00         1.60          0          8
>>>> sdc1              0.60         0.80         0.80          4          4
>>>> sdd1              0.60         0.00         1.60          0          8
>>>>
>>>> As you can see, a single 4K read, and a few writes. You see a few blocks
>>>> more written that you'd expect because the superblock is updated too.
>>>
>>> I'm no dd expert, but this looks like you're simply writing a 4KB block
>>> to a new stripe, using an offset, but not to an existing stripe, as the
>>> array is in a virgin state.  So it doesn't appear this test is going to
>>> trigger RMW.  Don't you need now need to do another write in the same
>>> stripe to to trigger RMW?  Maybe I'm just reading this wrong.

What is a "new stripe" and "existing stripe" ?  For md raid, all stripes
are equally existing as long as they fall within device boundaries, and
the rest are non-existing (outside of the device).  Unlike for an SSD for
example, there's no distinction between places already written and "fresh",
unwritten areas - all are treated exactly the same way.

>> That shouldn't matter, but that is easily checked ofcourse, by writing
>> some random random data first, then doing the dd 4K write also with
>> random data somewhere in the same area:
>>
>> # dd if=/dev/urandom bs=1M count=3 of=/dev/md0
>> 3+0 records in
>> 3+0 records out
>> 3145728 bytes (3.1 MB) copied, 0.794494 s, 4.0 MB/s
>>
>> Now the first 6 chunks are filled with random data, let write 4K
>> somewhere in there:
>>
>> # dd if=/dev/urandom bs=4k count=1 seek=25 of=/dev/md0
>> 1+0 records in
>> 1+0 records out
>> 4096 bytes (4.1 kB) copied, 0.10149 s, 40.4 kB/s
>>
>> Output from iostat over the period in which the 4K write was done:
>>
>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>> sdb1              0.60         0.00         1.60          0          8
>> sdc1              0.60         0.80         0.80          4          4
>> sdd1              0.60         0.00         1.60          0          8
> 
> According to your iostat output, the IO is identical for both tests.  So
> either you triggered an RMW in the first test, or you haven't triggered
> an RMW with either test.  Your fist test shouldn't have triggered RMW.
> The second one should have.

Both tests did exactly the same, since in both cases the I/O requests
were the same, and md treats all (written and yet unwritten) areas the
same.

In this test, there IS RMW cycle which is clearly shown.  I'm not sure
why md wrote 8Kb to sdb and sdd, and why it wrote the "extra" 4kb to
sdc.  Maybe it is the metadata/superblock update.  But it clearly read
data from sdc and wrote new data to all drives.  Assuming that all drives
received a 4kb write of metadata and excluding these, we'll have 4
kb written to sdb, 4kb read from sdc and 4kb written to sdd.  Which is
a clear RMW - suppose our new 4kb went to sdb, sdc is a second data disk
for this place and sdd is the parity.  It all works nicely.

Overall, in order to update parity for a small write, there's no need to
read and rewrite whole stripe, only the small read+write is sufficient.

There are, however, 2 variants of RMW possible, and one can be choosen
over another based on number of drives, amount of data being written
and amount of data available in the cache.  It can either read the
"missing" data blocks to calculate new parity (based on new blocks
and the read "missing" ones), or it can read parity block only,
substract data being replaced from there (xor is nice for that),
add new data and write new parity back.  When you have array with
large amount of drives and you write only small amount, the second
approach (reading old data (which might even be in cache already!),
reading the parity block, substracting old data and adding new to
there, and writing new data + new parity) will be much more often
than reading from all other components.  I guess.

So.. large chunk size is actually good, as it allows large I/Os
in one go.  There's a tradeoff ofcourse: the less the chunk size
is, the more chances we have to write full stripe without RMW at
all, but at the same time, I/O size becomes very small too, which
is inefficient from the drive point of view.   So there's a balance,
but I guess on a realistic-sized raid5 array (with good number of
drives, like 5), I/O size will likely be less than 256Kb (with
64Kb minimum realistic chunk size and 4 data drives), so expecting
full-stripe writes is not wise (unless it is streaming some large
data, in which case 512Kb chunk size (resulting in 2Mb stripes)
will do just as well).

Also, large chunks may have negative impact on alignment requiriments
(ie, it might be more difficult to fullfil the requiriment), but
this is different story.

Overall, I think 512Kb is quite a good chunk size, even for a raid5
array.

/mjt

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-18 10:08                   ` Michael Tokarev
@ 2012-08-19  3:17                     ` Stan Hoeppner
  2012-08-19 14:01                       ` David Brown
  2012-08-19 17:02                       ` Chris Murphy
  0 siblings, 2 replies; 31+ messages in thread
From: Stan Hoeppner @ 2012-08-19  3:17 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Miquel van Smoorenburg, Linux RAID

On 8/18/2012 5:08 AM, Michael Tokarev wrote:
> On 18.08.2012 09:09, Stan Hoeppner wrote:
> []
>>>>> Output from iostat over the period in which the 4K write was done. Look
>>>>> at kB read and kB written:
>>>>>
>>>>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>>>>> sdb1              0.60         0.00         1.60          0          8
>>>>> sdc1              0.60         0.80         0.80          4          4
>>>>> sdd1              0.60         0.00         1.60          0          8
>>>>>
>>>>> As you can see, a single 4K read, and a few writes. You see a few blocks
>>>>> more written that you'd expect because the superblock is updated too.
>>>>
>>>> I'm no dd expert, but this looks like you're simply writing a 4KB block
>>>> to a new stripe, using an offset, but not to an existing stripe, as the
>>>> array is in a virgin state.  So it doesn't appear this test is going to
>>>> trigger RMW.  Don't you need now need to do another write in the same
>>>> stripe to to trigger RMW?  Maybe I'm just reading this wrong.
> 
> What is a "new stripe" and "existing stripe" ?  For md raid, all stripes
> are equally existing as long as they fall within device boundaries, and
> the rest are non-existing (outside of the device).  Unlike for an SSD for
> example, there's no distinction between places already written and "fresh",
> unwritten areas - all are treated exactly the same way.
> 
>>> That shouldn't matter, but that is easily checked ofcourse, by writing
>>> some random random data first, then doing the dd 4K write also with
>>> random data somewhere in the same area:
>>>
>>> # dd if=/dev/urandom bs=1M count=3 of=/dev/md0
>>> 3+0 records in
>>> 3+0 records out
>>> 3145728 bytes (3.1 MB) copied, 0.794494 s, 4.0 MB/s
>>>
>>> Now the first 6 chunks are filled with random data, let write 4K
>>> somewhere in there:
>>>
>>> # dd if=/dev/urandom bs=4k count=1 seek=25 of=/dev/md0
>>> 1+0 records in
>>> 1+0 records out
>>> 4096 bytes (4.1 kB) copied, 0.10149 s, 40.4 kB/s
>>>
>>> Output from iostat over the period in which the 4K write was done:
>>>
>>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>>> sdb1              0.60         0.00         1.60          0          8
>>> sdc1              0.60         0.80         0.80          4          4
>>> sdd1              0.60         0.00         1.60          0          8
>>
>> According to your iostat output, the IO is identical for both tests.  So
>> either you triggered an RMW in the first test, or you haven't triggered
>> an RMW with either test.  Your fist test shouldn't have triggered RMW.
>> The second one should have.
> 
> Both tests did exactly the same, since in both cases the I/O requests
> were the same, and md treats all (written and yet unwritten) areas the
> same.

Interesting.  So md always performs RMW unless writing a full stripe.
This is worse behavior than I'd assumed, as RMW will occur nearly all of
the time with most workloads.  I'd assumed writes to "virgin" stripes
wouldn't trigger RMW.

> In this test, there IS RMW cycle which is clearly shown.  I'm not sure
> why md wrote 8Kb to sdb and sdd, and why it wrote the "extra" 4kb to
> sdc.  Maybe it is the metadata/superblock update.  But it clearly read
> data from sdc and wrote new data to all drives.  Assuming that all drives
> received a 4kb write of metadata and excluding these, we'll have 4
> kb written to sdb, 4kb read from sdc and 4kb written to sdd.  Which is
> a clear RMW - suppose our new 4kb went to sdb, sdc is a second data disk
> for this place and sdd is the parity.  It all works nicely.

Makes sense.  It's doing RMW in both tests.  It would work much more
nicely if a RMW wasn't required on partial writes to virgin stripes.  I
guess this isn't possible?

> Overall, in order to update parity for a small write, there's no need to
> read and rewrite whole stripe, only the small read+write is sufficient.

I find it interesting that parity for an entire stripe can be
recalculated using only the changed chunk and the existing parity value
as input to the calculation.  I would think the calculation would need
all chunks as input to generate the new parity value.  Then again I was
never a great mathematician.

> There are, however, 2 variants of RMW possible, and one can be choosen
> over another based on number of drives, amount of data being written
> and amount of data available in the cache.  It can either read the
> "missing" data blocks to calculate new parity (based on new blocks
> and the read "missing" ones), or it can read parity block only,
> substract data being replaced from there (xor is nice for that),
> add new data and write new parity back.  When you have array with
> large amount of drives and you write only small amount, the second
> approach (reading old data (which might even be in cache already!),
> reading the parity block, substracting old data and adding new to
> there, and writing new data + new parity) will be much more often
> than reading from all other components.  I guess.

If that's the way it actually works, it's obviously better than having
to read all the chunks.

> So.. large chunk size is actually good, as it allows large I/Os
> in one go.  There's a tradeoff ofcourse: the less the chunk size
> is, the more chances we have to write full stripe without RMW at

Which is the way I've always approached striping with parity--smaller
chunks are better so we avoid RMW more often.

> all, but at the same time, I/O size becomes very small too, which
> is inefficient from the drive point of view.   

Most spinning drives these days have 16-64MB of cache and fast onboard
IO ASICs, thus quantity vs size of IOs shouldn't make much difference
unless you're constantly hammering your arrays.  If that's the case
you're very likely not using parity RAID anyway.

> So there's a balance,
> but I guess on a realistic-sized raid5 array (with good number of
> drives, like 5), I/O size will likely be less than 256Kb (with
> 64Kb minimum realistic chunk size and 4 data drives), so expecting
> full-stripe writes is not wise (unless it is streaming some large
> data, in which case 512Kb chunk size (resulting in 2Mb stripes)
> will do just as well).
> 
> Also, large chunks may have negative impact on alignment requiriments
> (ie, it might be more difficult to fullfil the requiriment), but
> this is different story.

Yes, as in the case of XFS journal alignment, where the maximum stripe
unit (chunk) size is 256KB and the recommended size is 32KB.  This is a
100% metadata workload, making full stripe writes difficult even with a
small stripe unit (chunk).  Large chunks simply make it much worse.  And
every modern filesystem uses a journal...

> Overall, I think 512Kb is quite a good chunk size, even for a raid5
> array.

I emphatically disagree.  For the vast majority of workloads, with a
512KB chunk RAID5/6, nearly every write will trigger RMW, and RMW is
what kills parity array performance.  And RMW is *far* more costly than
sending smaller vs larger IOs to the drives.

I recommend against using parity RAID in all cases where the write
workload is nontrivial, or the workload is random write heavy (most
workloads).  But if someone must use RAID5/6 for reason X, I recommend
the smallest chunk size they can get away with to increase the odds for
full stripe writes, decreasing the odds of RMW, and increasing overall
performance.

-- 
Stan


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-19  3:17                     ` Stan Hoeppner
@ 2012-08-19 14:01                       ` David Brown
  2012-08-19 23:34                         ` Stan Hoeppner
  2012-08-19 17:02                       ` Chris Murphy
  1 sibling, 1 reply; 31+ messages in thread
From: David Brown @ 2012-08-19 14:01 UTC (permalink / raw)
  To: stan; +Cc: Michael Tokarev, Miquel van Smoorenburg, Linux RAID

I'm sort of jumping in to this thread, so my apologies if I repeat 
things other people have said already.

On 19/08/12 05:17, Stan Hoeppner wrote:
> On 8/18/2012 5:08 AM, Michael Tokarev wrote:
>> On 18.08.2012 09:09, Stan Hoeppner wrote:
>> []
>>>>>> Output from iostat over the period in which the 4K write was done. Look
>>>>>> at kB read and kB written:
>>>>>>
>>>>>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>>>>>> sdb1              0.60         0.00         1.60          0          8
>>>>>> sdc1              0.60         0.80         0.80          4          4
>>>>>> sdd1              0.60         0.00         1.60          0          8
>>>>>>
>>>>>> As you can see, a single 4K read, and a few writes. You see a few blocks
>>>>>> more written that you'd expect because the superblock is updated too.
>>>>>
>>>>> I'm no dd expert, but this looks like you're simply writing a 4KB block
>>>>> to a new stripe, using an offset, but not to an existing stripe, as the
>>>>> array is in a virgin state.  So it doesn't appear this test is going to
>>>>> trigger RMW.  Don't you need now need to do another write in the same
>>>>> stripe to to trigger RMW?  Maybe I'm just reading this wrong.
>>
>> What is a "new stripe" and "existing stripe" ?  For md raid, all stripes
>> are equally existing as long as they fall within device boundaries, and
>> the rest are non-existing (outside of the device).  Unlike for an SSD for
>> example, there's no distinction between places already written and "fresh",
>> unwritten areas - all are treated exactly the same way.
>>
>>>> That shouldn't matter, but that is easily checked ofcourse, by writing
>>>> some random random data first, then doing the dd 4K write also with
>>>> random data somewhere in the same area:
>>>>
>>>> # dd if=/dev/urandom bs=1M count=3 of=/dev/md0
>>>> 3+0 records in
>>>> 3+0 records out
>>>> 3145728 bytes (3.1 MB) copied, 0.794494 s, 4.0 MB/s
>>>>
>>>> Now the first 6 chunks are filled with random data, let write 4K
>>>> somewhere in there:
>>>>
>>>> # dd if=/dev/urandom bs=4k count=1 seek=25 of=/dev/md0
>>>> 1+0 records in
>>>> 1+0 records out
>>>> 4096 bytes (4.1 kB) copied, 0.10149 s, 40.4 kB/s
>>>>
>>>> Output from iostat over the period in which the 4K write was done:
>>>>
>>>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>>>> sdb1              0.60         0.00         1.60          0          8
>>>> sdc1              0.60         0.80         0.80          4          4
>>>> sdd1              0.60         0.00         1.60          0          8
>>>
>>> According to your iostat output, the IO is identical for both tests.  So
>>> either you triggered an RMW in the first test, or you haven't triggered
>>> an RMW with either test.  Your fist test shouldn't have triggered RMW.
>>> The second one should have.
>>
>> Both tests did exactly the same, since in both cases the I/O requests
>> were the same, and md treats all (written and yet unwritten) areas the
>> same.
>
> Interesting.  So md always performs RMW unless writing a full stripe.
> This is worse behavior than I'd assumed, as RMW will occur nearly all of
> the time with most workloads.  I'd assumed writes to "virgin" stripes
> wouldn't trigger RMW.
>

You need a RMW to make sure the stripe is consistent - "virgin" or not - 
unless you are re-writing the whole stripe.

AFAIK, there is scope for a few performance optimisations in raid6.  One 
is that for small writes which only need to change one block, raid5 uses 
a "short-cut" RMW cycle (read the old data block, read the old parity 
block, calculate the new parity block, write the new data and parity 
blocks).  A similar short-cut could be implemented in raid6, though it 
is not clear how much a difference it would really make.

Also, once the bitmap of non-sync regions is implemented (as far as I 
know, it is still on the roadmap), it should be easy to implement a 
short-cut for RMW for non-sync regions by simply replacing the reads 
with zeros.  Of course, that only makes a difference for new arrays - 
once it has been in use for a while, it will all be in sync.


>> In this test, there IS RMW cycle which is clearly shown.  I'm not sure
>> why md wrote 8Kb to sdb and sdd, and why it wrote the "extra" 4kb to
>> sdc.  Maybe it is the metadata/superblock update.  But it clearly read
>> data from sdc and wrote new data to all drives.  Assuming that all drives
>> received a 4kb write of metadata and excluding these, we'll have 4
>> kb written to sdb, 4kb read from sdc and 4kb written to sdd.  Which is
>> a clear RMW - suppose our new 4kb went to sdb, sdc is a second data disk
>> for this place and sdd is the parity.  It all works nicely.
>
> Makes sense.  It's doing RMW in both tests.  It would work much more
> nicely if a RMW wasn't required on partial writes to virgin stripes.  I
> guess this isn't possible?
>
>> Overall, in order to update parity for a small write, there's no need to
>> read and rewrite whole stripe, only the small read+write is sufficient.
>
> I find it interesting that parity for an entire stripe can be
> recalculated using only the changed chunk and the existing parity value
> as input to the calculation.  I would think the calculation would need
> all chunks as input to generate the new parity value.  Then again I was
> never a great mathematician.
>

I don't consider myself a "great mathematician", but I /do/ understand 
how the parities are generated, and I can assure you that you can 
calculate the new parities using the old data, the old parities (2 
blocks for raid6), and the new data.  It is already done this way for 
raid5.  For a simple example, consider changing D1 in a 4+1 raid5 array:

 From before, we have:
Pold = D0old ^ D1old ^ D2old ^ D3old

What we want is:
Pnew = D0old ^ D1new ^ D2old ^ D3old

Since the xor is its own inverse, we have :

D0old ^ D2old ^ D3old = Pold ^ D1old

So :

Pnew = Pold ^ D1new ^ D1old

And there is no need to read D0old, D2old or D3old.

Theoretically, this could be done for any RMW writes, to reduce the 
number of reads - in practice in md raid it is only done in raid5 for 
single block writes.  Implementing it for more blocks would complicate 
the code for very little benefit in practice.

Currently, there is no such short-cut for raid6 - all modifies are done 
as RMW.  It is certainly possible to do it - and if anyone wants the 
details of the maths then I am happy to explain it.  But it is quite a 
bit more complicated than in the raid5 case, and you would need three 
reads (old data, and two old parities).


>> There are, however, 2 variants of RMW possible, and one can be choosen
>> over another based on number of drives, amount of data being written
>> and amount of data available in the cache.  It can either read the
>> "missing" data blocks to calculate new parity (based on new blocks
>> and the read "missing" ones), or it can read parity block only,
>> substract data being replaced from there (xor is nice for that),
>> add new data and write new parity back.  When you have array with
>> large amount of drives and you write only small amount, the second
>> approach (reading old data (which might even be in cache already!),
>> reading the parity block, substracting old data and adding new to
>> there, and writing new data + new parity) will be much more often
>> than reading from all other components.  I guess.
>
> If that's the way it actually works, it's obviously better than having
> to read all the chunks.
>
>> So.. large chunk size is actually good, as it allows large I/Os
>> in one go.  There's a tradeoff ofcourse: the less the chunk size
>> is, the more chances we have to write full stripe without RMW at
>
> Which is the way I've always approached striping with parity--smaller
> chunks are better so we avoid RMW more often.
>
>> all, but at the same time, I/O size becomes very small too, which
>> is inefficient from the drive point of view.
>
> Most spinning drives these days have 16-64MB of cache and fast onboard
> IO ASICs, thus quantity vs size of IOs shouldn't make much difference
> unless you're constantly hammering your arrays.  If that's the case
> you're very likely not using parity RAID anyway.
>
>> So there's a balance,
>> but I guess on a realistic-sized raid5 array (with good number of
>> drives, like 5), I/O size will likely be less than 256Kb (with
>> 64Kb minimum realistic chunk size and 4 data drives), so expecting
>> full-stripe writes is not wise (unless it is streaming some large
>> data, in which case 512Kb chunk size (resulting in 2Mb stripes)
>> will do just as well).
>>
>> Also, large chunks may have negative impact on alignment requiriments
>> (ie, it might be more difficult to fullfil the requiriment), but
>> this is different story.
>
> Yes, as in the case of XFS journal alignment, where the maximum stripe
> unit (chunk) size is 256KB and the recommended size is 32KB.  This is a
> 100% metadata workload, making full stripe writes difficult even with a
> small stripe unit (chunk).  Large chunks simply make it much worse.  And
> every modern filesystem uses a journal...
>
>> Overall, I think 512Kb is quite a good chunk size, even for a raid5
>> array.
>
> I emphatically disagree.  For the vast majority of workloads, with a
> 512KB chunk RAID5/6, nearly every write will trigger RMW, and RMW is
> what kills parity array performance.  And RMW is *far* more costly than
> sending smaller vs larger IOs to the drives.
>
> I recommend against using parity RAID in all cases where the write
> workload is nontrivial, or the workload is random write heavy (most
> workloads).  But if someone must use RAID5/6 for reason X, I recommend
> the smallest chunk size they can get away with to increase the odds for
> full stripe writes, decreasing the odds of RMW, and increasing overall
> performance.
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-19  3:17                     ` Stan Hoeppner
  2012-08-19 14:01                       ` David Brown
@ 2012-08-19 17:02                       ` Chris Murphy
  1 sibling, 0 replies; 31+ messages in thread
From: Chris Murphy @ 2012-08-19 17:02 UTC (permalink / raw)
  To: Linux RAID


On Aug 18, 2012, at 9:17 PM, Stan Hoeppner wrote:
>> 
> 
> Yes, as in the case of XFS journal alignment, where the maximum stripe
> unit (chunk) size is 256KB and the recommended size is 32KB.  This is a
> 100% metadata workload, making full stripe writes difficult even with a
> small stripe unit (chunk).  Large chunks simply make it much worse.  And
> every modern filesystem uses a journal…

I agree that a bigger chunk size is not inherently better. I suspect 512K is selected for the default because for most people storage loads, which aren't spectacularly heavy (either data or metadata). But all the documentation I find on mdadm fairly well hits home that to get the best performance, you have to test.

One small quibble, however, is that the three newest filesystems, don't use journals: ZFS, btrfs, ReFS.

> 
>> Overall, I think 512Kb is quite a good chunk size, even for a raid5
>> array.
> 
> I emphatically disagree.  For the vast majority of workloads, with a
> 512KB chunk RAID5/6, nearly every write will trigger RMW, and RMW is
> what kills parity array performance.  And RMW is *far* more costly than
> sending smaller vs larger IOs to the drives.

I thought that default seemed a bit high, but I'll bet you dollars to donuts the vast majority of workloads using default settings for parity RAID, are 4+MB files like music and video. I think if you get a really busy mail server, lots of tiny files, then you've got a pretty strong case that 512K across maybe 6 disks, is going to lead to a lot of unnecessary RMW, and a lower chunk size will help a lot.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-19 14:01                       ` David Brown
@ 2012-08-19 23:34                         ` Stan Hoeppner
  2012-08-20  0:01                           ` NeilBrown
  2012-08-21 14:51                           ` Miquel van Smoorenburg
  0 siblings, 2 replies; 31+ messages in thread
From: Stan Hoeppner @ 2012-08-19 23:34 UTC (permalink / raw)
  To: David Brown; +Cc: Michael Tokarev, Miquel van Smoorenburg, Linux RAID, LKML

On 8/19/2012 9:01 AM, David Brown wrote:
> I'm sort of jumping in to this thread, so my apologies if I repeat
> things other people have said already.

I'm glad you jumped in David.  You made a critical statement of fact
below which clears some things up.  If you had stated it early on,
before Miquel stole the thread and moved it to LKML proper, it would
have short circuited a lot of this discussion.  Which is:

> AFAIK, there is scope for a few performance optimisations in raid6.  One
> is that for small writes which only need to change one block, raid5 uses
> a "short-cut" RMW cycle (read the old data block, read the old parity
> block, calculate the new parity block, write the new data and parity
> blocks).  A similar short-cut could be implemented in raid6, though it
> is not clear how much a difference it would really make.

Thus my original statement was correct, or at least half correct[1], as
it pertained to md/RAID6.  Then Miquel switched the discussion to
md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
Chinner.  I was simply unaware of this md/RAID5 single block write RMW
shortcut.  I'm copying lkml proper on this simply to set the record
straight.  Not that anyone was paying attention, but it needs to be in
the same thread in the archives.  The takeaway:

md/RAID6 must read all devices in a RMW cycle.

md/RAID5 takes a shortcut for single block writes, and must only read
one drive for the RMW cycle.

[1}The only thing that's not clear at this point is if md/RAID6 also
always writes back all chunks during RMW, or only the chunk that has
changed.

-- 
Stan


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-19 23:34                         ` Stan Hoeppner
@ 2012-08-20  0:01                           ` NeilBrown
  2012-08-20  4:44                             ` Stan Hoeppner
  2012-08-20  7:47                             ` David Brown
  2012-08-21 14:51                           ` Miquel van Smoorenburg
  1 sibling, 2 replies; 31+ messages in thread
From: NeilBrown @ 2012-08-20  0:01 UTC (permalink / raw)
  To: stan
  Cc: David Brown, Michael Tokarev, Miquel van Smoorenburg, Linux RAID, LKML

[-- Attachment #1: Type: text/plain, Size: 2541 bytes --]

On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> On 8/19/2012 9:01 AM, David Brown wrote:
> > I'm sort of jumping in to this thread, so my apologies if I repeat
> > things other people have said already.
> 
> I'm glad you jumped in David.  You made a critical statement of fact
> below which clears some things up.  If you had stated it early on,
> before Miquel stole the thread and moved it to LKML proper, it would
> have short circuited a lot of this discussion.  Which is:
> 
> > AFAIK, there is scope for a few performance optimisations in raid6.  One
> > is that for small writes which only need to change one block, raid5 uses
> > a "short-cut" RMW cycle (read the old data block, read the old parity
> > block, calculate the new parity block, write the new data and parity
> > blocks).  A similar short-cut could be implemented in raid6, though it
> > is not clear how much a difference it would really make.
> 
> Thus my original statement was correct, or at least half correct[1], as
> it pertained to md/RAID6.  Then Miquel switched the discussion to
> md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
> Chinner.  I was simply unaware of this md/RAID5 single block write RMW
> shortcut.  I'm copying lkml proper on this simply to set the record
> straight.  Not that anyone was paying attention, but it needs to be in
> the same thread in the archives.  The takeaway:
> 

Since we are trying to set the record straight....

> md/RAID6 must read all devices in a RMW cycle.

md/RAID6 must read all data devices (i.e. not parity devices) which it is not
going to write to, in an RWM cycle (which the code actually calls RCW -
reconstruct-write).

> 
> md/RAID5 takes a shortcut for single block writes, and must only read
> one drive for the RMW cycle.

md/RAID5 uses an alternate mechanism when the number of data blocks that need
to be written is less than half the number of data blocks in a stripe.  In
this alternate mechansim (which the code calls RMW - read-modify-write),
md/RAID5 reads all the blocks that it is about to write to, plus the parity
block.  It then computes the new parity and writes it out along with the new
data.

> 
> [1}The only thing that's not clear at this point is if md/RAID6 also
> always writes back all chunks during RMW, or only the chunk that has
> changed.

Do you seriously imagine anyone would write code to write out data which it
is known has not changed?  Sad. :-)

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-20  0:01                           ` NeilBrown
@ 2012-08-20  4:44                             ` Stan Hoeppner
  2012-08-20  5:19                               ` Dave Chinner
  2012-08-20  7:47                             ` David Brown
  1 sibling, 1 reply; 31+ messages in thread
From: Stan Hoeppner @ 2012-08-20  4:44 UTC (permalink / raw)
  To: NeilBrown
  Cc: David Brown, Michael Tokarev, Miquel van Smoorenburg, Linux RAID,
	LKML, Dave Chinner

I'm copying Dave C. as he apparently misunderstood the behavior of
md/RAID6 as well.  My statement was based largely on Dave's information.
 See [1] below.

On 8/19/2012 7:01 PM, NeilBrown wrote:
> On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:

> Since we are trying to set the record straight....

Thank you for finally jumping in Neil--had hoped to see your
authoritative information sooner.

> md/RAID6 must read all data devices (i.e. not parity devices) which it is not
> going to write to, in an RWM cycle (which the code actually calls RCW -
> reconstruct-write).

> md/RAID5 uses an alternate mechanism when the number of data blocks that need
> to be written is less than half the number of data blocks in a stripe.  In
> this alternate mechansim (which the code calls RMW - read-modify-write),
> md/RAID5 reads all the blocks that it is about to write to, plus the parity
> block.  It then computes the new parity and writes it out along with the new
> data.

>> [1}The only thing that's not clear at this point is if md/RAID6 also
>> always writes back all chunks during RMW, or only the chunk that has
>> changed.

> Do you seriously imagine anyone would write code to write out data which it
> is known has not changed?  Sad. :-)

From a performance standpoint, absolutely not.  Though I wouldn't be
surprised if there are a few parity RAID implementations out there that
do always write a full stripe for other reasons, such as catching media
defects as early as possible, i.e. those occasions where bits in a
sector may read just fine but can't be re-magnetized.  I'm not
championing such an idea, merely stating that others may use this method
for this or other reasons.


[1]
On 6/25/2012 9:30 PM, Dave Chinner wrote:
> You can't, simple as that. The maximum supported is 256k. As it is,
> a default chunk size of 512k is probably harmful to most workloads -
> large chunk sizes mean that just about every write will trigger a
> RMW cycle in the RAID because it is pretty much impossible to issue
> full stripe writes. Writeback doesn't do any alignment of IO (the
> generic page cache writeback path is the problem here), so we will
> lamost always be doing unaligned IO to the RAID, and there will be
> little opportunity for sequential IOs to merge and form full stripe
> writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).
>
> IOWs, every time you do a small isolated write, the MD RAID volume
> will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
> Given that most workloads are not doing lots and lots of large
> sequential writes this is, IMO, a pretty bad default given typical
> RAID5/6 volume configurations we see....


-- 
Stan


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-20  4:44                             ` Stan Hoeppner
@ 2012-08-20  5:19                               ` Dave Chinner
  2012-08-20  5:42                                 ` Stan Hoeppner
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2012-08-20  5:19 UTC (permalink / raw)
  To: Stan Hoeppner
  Cc: NeilBrown, David Brown, Michael Tokarev, Miquel van Smoorenburg,
	Linux RAID, LKML

On Sun, Aug 19, 2012 at 11:44:25PM -0500, Stan Hoeppner wrote:
> I'm copying Dave C. as he apparently misunderstood the behavior of
> md/RAID6 as well.  My statement was based largely on Dave's information.
>  See [1] below.

Not sure what I'm supposed to have misunderstood...

> On 8/19/2012 7:01 PM, NeilBrown wrote:
> > On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> > wrote:
> 
> > Since we are trying to set the record straight....
> 
> Thank you for finally jumping in Neil--had hoped to see your
> authoritative information sooner.
> 
> > md/RAID6 must read all data devices (i.e. not parity devices) which it is not
> > going to write to, in an RWM cycle (which the code actually calls RCW -
> > reconstruct-write).

That's a RMW cycle from an IO point of view. i.e. sycnhronous read
must take place before the data can be modified and written...

> > md/RAID5 uses an alternate mechanism when the number of data blocks that need
> > to be written is less than half the number of data blocks in a stripe.  In
> > this alternate mechansim (which the code calls RMW - read-modify-write),
> > md/RAID5 reads all the blocks that it is about to write to, plus the parity
> > block.  It then computes the new parity and writes it out along with the new
> > data.

And by the same definition, that's also a RMW cycle.

> >> [1}The only thing that's not clear at this point is if md/RAID6 also
> >> always writes back all chunks during RMW, or only the chunk that has
> >> changed.
> 
> > Do you seriously imagine anyone would write code to write out data which it
> > is known has not changed?  Sad. :-)

Two words: media scrubbing.

> On 6/25/2012 9:30 PM, Dave Chinner wrote:
> > IOWs, every time you do a small isolated write, the MD RAID volume
> > will do a RMW cycle, reading 11MB and writing 12MB of data to disk.

Oh, you're probably complaining about that write number.  All I was
trying to do was demonstrate what a worst case RMW cycle looks like.
So by the above, that occurs when you have a same isolated write to
each chunk of the stripe. A single write is read 11MB, write 1.5MB
(data + 2 parity). It doesn't really change the IO latency or load,
though, you've still got the same read-all, modify, write-multiple
IO pattern....

> > Given that most workloads are not doing lots and lots of large
> > sequential writes this is, IMO, a pretty bad default given typical
> > RAID5/6 volume configurations we see....

Either way, the point I was making in the original post stands -
RAID6 sucks balls for most workloads as they only do small writes in
comparison to the stripe width of the volume....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-20  5:19                               ` Dave Chinner
@ 2012-08-20  5:42                                 ` Stan Hoeppner
  0 siblings, 0 replies; 31+ messages in thread
From: Stan Hoeppner @ 2012-08-20  5:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: NeilBrown, David Brown, Michael Tokarev, Miquel van Smoorenburg,
	Linux RAID, LKML

On 8/20/2012 12:19 AM, Dave Chinner wrote:

> Oh, you're probably complaining about that write number.

I wasn't complaining.  You know me better than that.

The cops caught me robbing a bank.  Under pressure, I pointed at you and
said "He gave me the gun and ammo!". ;)  Of course it was completely my
responsibility for shooting myself in the foot with it. :)

> Either way, the point I was making in the original post stands -
> RAID6 sucks balls for most workloads as they only do small writes in
> comparison to the stripe width of the volume....

Agreed.  Which was the point I originally made in this thread.  "We"
simply got a minor detail wrong.  It seems folks on the linux-raid list
like to split every hair 2^6 times, with extremely valid points getting
lost when the discussion drops into the weeds.

Sorry I dragged you into this Dave.

-- 
Stan


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-20  0:01                           ` NeilBrown
  2012-08-20  4:44                             ` Stan Hoeppner
@ 2012-08-20  7:47                             ` David Brown
  1 sibling, 0 replies; 31+ messages in thread
From: David Brown @ 2012-08-20  7:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: stan, Michael Tokarev, Miquel van Smoorenburg, Linux RAID, LKML

On 20/08/2012 02:01, NeilBrown wrote:
> On Sun, 19 Aug 2012 18:34:28 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
>
>
> Since we are trying to set the record straight....
>
>> md/RAID6 must read all devices in a RMW cycle.
>
> md/RAID6 must read all data devices (i.e. not parity devices) which it is not
> going to write to, in an RWM cycle (which the code actually calls RCW -
> reconstruct-write).
>
>>
>> md/RAID5 takes a shortcut for single block writes, and must only read
>> one drive for the RMW cycle.
>
> md/RAID5 uses an alternate mechanism when the number of data blocks that need
> to be written is less than half the number of data blocks in a stripe.  In
> this alternate mechansim (which the code calls RMW - read-modify-write),
> md/RAID5 reads all the blocks that it is about to write to, plus the parity
> block.  It then computes the new parity and writes it out along with the new
> data.
>

I've learned something here too - I thought this mechanism was only used 
for a single block write.  Thanks for the correction, Neil.

If you (or anyone else) are ever interested in implementing the same 
thing in raid6, the maths is not actually too bad (now that I've thought 
about it).  (I understand the theory here, but I'm afraid I don't have 
the experience with kernel programming to do the implementation.)

To change a few data blocks, you need to read in the old data blocks 
(Da, Db, etc.) and the old parities (P, Q).

Calculate the xor differences Xa = Da + D'a, Xb = Db + D'b, etc.

The new P parity is P' = P + Xa + Xb +...

The new Q parity is Q' = P + (g^a).Xa + (g^b).Xb + ...
The power series there is just the normal raid6 Q-parity calculation 
with most entries set to 0, and the Xa, Xb, etc. in the appropriate spots.

If the raid6 Q-parity function already has short-cuts for handling zero 
entries (I haven't looked, but the mechanism might be in place to 
slightly speed up dual-failure recovery), then all the blocks are in place.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-19 23:34                         ` Stan Hoeppner
  2012-08-20  0:01                           ` NeilBrown
@ 2012-08-21 14:51                           ` Miquel van Smoorenburg
  2012-08-22  3:59                             ` Stan Hoeppner
  1 sibling, 1 reply; 31+ messages in thread
From: Miquel van Smoorenburg @ 2012-08-21 14:51 UTC (permalink / raw)
  To: stan; +Cc: David Brown, Michael Tokarev, Linux RAID, LKML

On 08/20/2012 01:34 AM, Stan Hoeppner wrote:
> I'm glad you jumped in David.  You made a critical statement of fact
> below which clears some things up.  If you had stated it early on,
> before Miquel stole the thread and moved it to LKML proper, it would
> have short circuited a lot of this discussion.  Which is:

I'm sorry about that, that's because of the software that I use to 
follow most mailinglist. I didn't notice that the discussion was cc'ed 
to both lkml and l-r. I should fix that.

> Thus my original statement was correct, or at least half correct[1], as
> it pertained to md/RAID6.  Then Miquel switched the discussion to
> md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
> Chinner.  I was simply unaware of this md/RAID5 single block write RMW
> shortcut

Well, all I tried to say is that a small write of, say, 4K, to a 
raid5/raid6 array does not need to re-write the whole stripe (i.e. 
chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that.

Mike.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: O_DIRECT to md raid 6 is slow
  2012-08-21 14:51                           ` Miquel van Smoorenburg
@ 2012-08-22  3:59                             ` Stan Hoeppner
  0 siblings, 0 replies; 31+ messages in thread
From: Stan Hoeppner @ 2012-08-22  3:59 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: David Brown, Michael Tokarev, Linux RAID, LKML

On 8/21/2012 9:51 AM, Miquel van Smoorenburg wrote:
> On 08/20/2012 01:34 AM, Stan Hoeppner wrote:
>> I'm glad you jumped in David.  You made a critical statement of fact
>> below which clears some things up.  If you had stated it early on,
>> before Miquel stole the thread and moved it to LKML proper, it would
>> have short circuited a lot of this discussion.  Which is:
> 
> I'm sorry about that, that's because of the software that I use to
> follow most mailinglist. I didn't notice that the discussion was cc'ed
> to both lkml and l-r. I should fix that.

Oh, my bad.  I thought it was intentional.

Don't feel too bad about it.  When I tried to copy lkml back in on the
one message I screwed up as well.  I though Tbird had filled in the full
address but it didn't.

>> Thus my original statement was correct, or at least half correct[1], as
>> it pertained to md/RAID6.  Then Miquel switched the discussion to
>> md/RAID5 and stated I was all wet.  I wasn't, and neither was Dave
>> Chinner.  I was simply unaware of this md/RAID5 single block write RMW
>> shortcut
> 
> Well, all I tried to say is that a small write of, say, 4K, to a
> raid5/raid6 array does not need to re-write the whole stripe (i.e.
> chunksize * nr_disks) but just 4K * nr_disks, or the RMW variant of that.

And I'm glad you did.  Before that I didn't know about these efficiency
shortcuts and exactly how md does writeback on partial stripe updates.

Even with these optimizations, a default 512KB chunk is too big, for the
reasons I stated, the big one being the fact that you'll rarely fill a
full stripe, meaning nearly every write will incur an RMW cycle.

-- 
Stan

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2012-08-22  3:59 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-15  0:49 O_DIRECT to md raid 6 is slow Andy Lutomirski
2012-08-15  1:07 ` kedacomkernel
2012-08-15  1:07   ` kedacomkernel
2012-08-15  1:12   ` Andy Lutomirski
2012-08-15  1:23     ` kedacomkernel
2012-08-15  1:23       ` kedacomkernel
2012-08-15 11:50 ` John Robinson
2012-08-15 17:57   ` Andy Lutomirski
2012-08-15 22:00     ` Stan Hoeppner
2012-08-15 22:10       ` Andy Lutomirski
2012-08-15 23:50         ` Stan Hoeppner
2012-08-16  1:08           ` Andy Lutomirski
2012-08-16  6:41           ` Roman Mamedov
2012-08-15 23:07       ` Miquel van Smoorenburg
2012-08-16 11:05         ` Stan Hoeppner
2012-08-16 21:50           ` Miquel van Smoorenburg
2012-08-17  7:31             ` Stan Hoeppner
2012-08-17 11:16               ` Miquel van Smoorenburg
2012-08-18  5:09                 ` Stan Hoeppner
2012-08-18 10:08                   ` Michael Tokarev
2012-08-19  3:17                     ` Stan Hoeppner
2012-08-19 14:01                       ` David Brown
2012-08-19 23:34                         ` Stan Hoeppner
2012-08-20  0:01                           ` NeilBrown
2012-08-20  4:44                             ` Stan Hoeppner
2012-08-20  5:19                               ` Dave Chinner
2012-08-20  5:42                                 ` Stan Hoeppner
2012-08-20  7:47                             ` David Brown
2012-08-21 14:51                           ` Miquel van Smoorenburg
2012-08-22  3:59                             ` Stan Hoeppner
2012-08-19 17:02                       ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.