All of lore.kernel.org
 help / color / mirror / Atom feed
* raid6: rmw writes all the time?
@ 2013-05-23 12:55 Bernd Schubert
  2013-05-23 13:11 ` Chris Mason
  0 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2013-05-23 12:55 UTC (permalink / raw)
  To: linux-btrfs

Hi all,

we got a new test system here and I just also tested btrfs raid6 on 
that. Write performance is slightly lower than hw-raid (LSI megasas) and 
md-raid6, but it probably would be much better than any of these two, if 
it wouldn't read all the during the writes. Is this a known issue? This 
is with linux-3.9.2.

Thanks,
Bernd


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid6: rmw writes all the time?
  2013-05-23 12:55 raid6: rmw writes all the time? Bernd Schubert
@ 2013-05-23 13:11 ` Chris Mason
  2013-05-23 13:22   ` Bernd Schubert
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Mason @ 2013-05-23 13:11 UTC (permalink / raw)
  To: Bernd Schubert, linux-btrfs

Quoting Bernd Schubert (2013-05-23 08:55:47)
> Hi all,
> 
> we got a new test system here and I just also tested btrfs raid6 on 
> that. Write performance is slightly lower than hw-raid (LSI megasas) and 
> md-raid6, but it probably would be much better than any of these two, if 
> it wouldn't read all the during the writes. Is this a known issue? This 
> is with linux-3.9.2.

Hi Bernd,

Any time you do a write smaller than a full stripe, we'll have to do a
read/modify/write cycle to satisfy it.  This is true of md raid6 and the
hw-raid as well, but their reads don't show up in vmstat (try iostat
instead).

So the bigger question is where are your small writes coming from.  If
they are metadata, you can use raid1 for the metadata.

-chris


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid6: rmw writes all the time?
  2013-05-23 13:11 ` Chris Mason
@ 2013-05-23 13:22   ` Bernd Schubert
  2013-05-23 13:34     ` Chris Mason
  2013-05-23 13:41     ` Bob Marley
  0 siblings, 2 replies; 12+ messages in thread
From: Bernd Schubert @ 2013-05-23 13:22 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On 05/23/2013 03:11 PM, Chris Mason wrote:
> Quoting Bernd Schubert (2013-05-23 08:55:47)
>> Hi all,
>>
>> we got a new test system here and I just also tested btrfs raid6 on
>> that. Write performance is slightly lower than hw-raid (LSI megasas) and
>> md-raid6, but it probably would be much better than any of these two, if
>> it wouldn't read all the during the writes. Is this a known issue? This
>> is with linux-3.9.2.
>
> Hi Bernd,
>
> Any time you do a write smaller than a full stripe, we'll have to do a
> read/modify/write cycle to satisfy it.  This is true of md raid6 and the
> hw-raid as well, but their reads don't show up in vmstat (try iostat
> instead).

Yeah, I know and I'm using iostat already. md raid6 does not do rmw, but 
does not fill the device queue, afaik it flushes the underlying devices 
quickly as it does not have barrier support - that is another topic, but 
was the reason why I started to test btrfs.

>
> So the bigger question is where are your small writes coming from.  If
> they are metadata, you can use raid1 for the metadata.

I used this command

/tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x]

so meta-data should be raid10. And I'm using this iozone command:


> iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \
>         -F /data/fhgfs/storage/md126/testfile1 /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \
>            /data/fhgfs/storage/md127/testfile1 /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3


Higher IO sizes (e.g. -r16m) don't make a difference, it goes through 
the page cache anyway.
I'm not familiar with btrfs code at all, but maybe writepages() submits 
too small IOs?

Hrmm, just wanted to try direct IO, but then just noticed it went into 
RO mode before already:

> May 23 14:59:33 c8220a kernel: WARNING: at fs/btrfs/super.c:255 __btrfs_abort_transaction+0xdf/0x100 [btrfs]()

> ay 23 14:59:33 c8220a kernel: [<ffffffff8105db76>] warn_slowpath_fmt+0x46/0x50
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b5428a>] ? btrfs_free_path+0x2a/0x40 [btrfs]
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b4e18f>] __btrfs_abort_transaction+0xdf/0x100 [btrfs]
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b70b2f>] btrfs_save_ino_cache+0x22f/0x310 [btrfs]
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b793e2>] commit_fs_roots+0xd2/0x1c0 [btrfs]
> May 23 14:59:33 c8220a kernel: [<ffffffff815eb3fe>] ? mutex_lock+0x1e/0x50
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b7a555>] btrfs_commit_transaction+0x495/0xa40 [btrfs]
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b7af7b>] ? start_transaction+0xab/0x4d0 [btrfs]
> May 23 14:59:33 c8220a kernel: [<ffffffff81082f30>] ? wake_up_bit+0x40/0x40
> May 23 14:59:33 c8220a kernel: [<ffffffffa0b72b96>] transaction_kthread+0x1a6/0x220 [btrfs]

> May 23 14:59:33 c8220a kernel: ---[ end trace 3d91874abeab5984 ]---
> May 23 14:59:33 c8220a kernel: BTRFS error (device sdx) in btrfs_save_ino_cache:471: error 28
> May 23 14:59:33 c8220a kernel: btrfs is forced readonly
> May 23 14:59:33 c8220a kernel: BTRFS warning (device sdx): Skipping commit of aborted transaction.
> May 23 14:59:33 c8220a kernel: BTRFS error (device sdx) in cleanup_transaction:1455: error 28

errno 28 - out of disk space?

Going to recreate it and will play with it later on again.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid6: rmw writes all the time?
  2013-05-23 13:22   ` Bernd Schubert
@ 2013-05-23 13:34     ` Chris Mason
  2013-05-23 19:33       ` Bernd Schubert
  2013-05-23 13:41     ` Bob Marley
  1 sibling, 1 reply; 12+ messages in thread
From: Chris Mason @ 2013-05-23 13:34 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-btrfs

Quoting Bernd Schubert (2013-05-23 09:22:41)
> On 05/23/2013 03:11 PM, Chris Mason wrote:
> > Quoting Bernd Schubert (2013-05-23 08:55:47)
> >> Hi all,
> >>
> >> we got a new test system here and I just also tested btrfs raid6 on
> >> that. Write performance is slightly lower than hw-raid (LSI megasas) and
> >> md-raid6, but it probably would be much better than any of these two, if
> >> it wouldn't read all the during the writes. Is this a known issue? This
> >> is with linux-3.9.2.
> >
> > Hi Bernd,
> >
> > Any time you do a write smaller than a full stripe, we'll have to do a
> > read/modify/write cycle to satisfy it.  This is true of md raid6 and the
> > hw-raid as well, but their reads don't show up in vmstat (try iostat
> > instead).
> 
> Yeah, I know and I'm using iostat already. md raid6 does not do rmw, but 
> does not fill the device queue, afaik it flushes the underlying devices 
> quickly as it does not have barrier support - that is another topic, but 
> was the reason why I started to test btrfs.

md should support barriers with recent kernels.  You might want to
verify with blktrace that md raid6 isn't doing r/m/w.

> 
> >
> > So the bigger question is where are your small writes coming from.  If
> > they are metadata, you can use raid1 for the metadata.
> 
> I used this command
> 
> /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x]

Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB
times the number of devices on the FS.  If you have 13 devices, that's
832K.

Using buffered writes makes it much more likely the VM will break up the
IOs as they go down.  The btrfs writepages code does try to do full
stripe IO, and it also caches stripes as the IO goes down.  But for
buffered IO it is surprisingly hard to get a 100% hit rate on full
stripe IO at larger stripe sizes.

> 
> so meta-data should be raid10. And I'm using this iozone command:
> 
> 
> > iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \
> >         -F /data/fhgfs/storage/md126/testfile1 /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \
> >            /data/fhgfs/storage/md127/testfile1 /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3
> 
> 
> Higher IO sizes (e.g. -r16m) don't make a difference, it goes through 
> the page cache anyway.
> I'm not familiar with btrfs code at all, but maybe writepages() submits 
> too small IOs?
> 
> Hrmm, just wanted to try direct IO, but then just noticed it went into 
> RO mode before already:

Direct IO will make it easier to get full stripe writes.  I thought I
had fixed this abort, but it is just running out of space to write the
inode cache.  For now, please just don't mount with the inode cache
enabled, I'll send in a fix for the next rc.

-chris

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid6: rmw writes all the time?
  2013-05-23 13:22   ` Bernd Schubert
  2013-05-23 13:34     ` Chris Mason
@ 2013-05-23 13:41     ` Bob Marley
  2013-05-23 16:30       ` Bernd Schubert
  1 sibling, 1 reply; 12+ messages in thread
From: Bob Marley @ 2013-05-23 13:41 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Chris Mason, linux-btrfs

On 23/05/2013 15:22, Bernd Schubert wrote:
>
> Yeah, I know and I'm using iostat already. md raid6 does not do rmw, 
> but does not fill the device queue, afaik it flushes the underlying 
> devices quickly as it does not have barrier support - that is another 
> topic, but was the reason why I started to test btrfs.

MD raid6 DOES have barrier support!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid6: rmw writes all the time?
  2013-05-23 13:41     ` Bob Marley
@ 2013-05-23 16:30       ` Bernd Schubert
  0 siblings, 0 replies; 12+ messages in thread
From: Bernd Schubert @ 2013-05-23 16:30 UTC (permalink / raw)
  To: Bob Marley; +Cc: linux-btrfs

On 05/23/2013 03:41 PM, Bob Marley wrote:
> On 23/05/2013 15:22, Bernd Schubert wrote:
>>
>> Yeah, I know and I'm using iostat already. md raid6 does not do rmw,
>> but does not fill the device queue, afaik it flushes the underlying
>> devices quickly as it does not have barrier support - that is another
>> topic, but was the reason why I started to test btrfs.
> 
> MD raid6 DOES have barrier support!
> 

For underlying devices yes, but it does not further use it for
additional buffering.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid6: rmw writes all the time?
  2013-05-23 13:34     ` Chris Mason
@ 2013-05-23 19:33       ` Bernd Schubert
  2013-05-23 19:37         ` Chris Mason
  0 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2013-05-23 19:33 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On 05/23/2013 03:34 PM, Chris Mason wrote:
> Quoting Bernd Schubert (2013-05-23 09:22:41)
>> On 05/23/2013 03:11 PM, Chris Mason wrote:
>>> Quoting Bernd Schubert (2013-05-23 08:55:47)
>>>> Hi all,
>>>>
>>>> we got a new test system here and I just also tested btrfs raid6 on
>>>> that. Write performance is slightly lower than hw-raid (LSI megasas) and
>>>> md-raid6, but it probably would be much better than any of these two, if
>>>> it wouldn't read all the during the writes. Is this a known issue? This
>>>> is with linux-3.9.2.
>>>
>>> Hi Bernd,
>>>
>>> Any time you do a write smaller than a full stripe, we'll have to do a
>>> read/modify/write cycle to satisfy it.  This is true of md raid6 and the
>>> hw-raid as well, but their reads don't show up in vmstat (try iostat
>>> instead).
>>
>> Yeah, I know and I'm using iostat already. md raid6 does not do rmw, but 
>> does not fill the device queue, afaik it flushes the underlying devices 
>> quickly as it does not have barrier support - that is another topic, but 
>> was the reason why I started to test btrfs.
> 
> md should support barriers with recent kernels.  You might want to
> verify with blktrace that md raid6 isn't doing r/m/w.
> 
>>
>>>
>>> So the bigger question is where are your small writes coming from.  If
>>> they are metadata, you can use raid1 for the metadata.
>>
>> I used this command
>>
>> /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x]
> 
> Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB
> times the number of devices on the FS.  If you have 13 devices, that's
> 832K.

Actually I have 12 devices, but we have to subtract 2 parity disks. In
the mean time I also patched btrfsprogs to use a chunksize of 256K. So
that should be 2560kiB now if I found the right places.
Btw, any chance to generally use chunksize/chunklen instead of stripe,
such as the md layer does it? IMHO it is less confusing to use
n-datadisks * chunksize = stripesize.

> 
> Using buffered writes makes it much more likely the VM will break up the
> IOs as they go down.  The btrfs writepages code does try to do full
> stripe IO, and it also caches stripes as the IO goes down.  But for
> buffered IO it is surprisingly hard to get a 100% hit rate on full
> stripe IO at larger stripe sizes.

I have not found that part yet, somehow it looks like as if writepages
would submit single pages to another layer. I'm going to look into it
again during the weekend. I can reserve the hardware that long, but I
think we first need to fix striped writes in general.

> 
>>
>> so meta-data should be raid10. And I'm using this iozone command:
>>
>>
>>> iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \
>>>         -F /data/fhgfs/storage/md126/testfile1 /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \
>>>            /data/fhgfs/storage/md127/testfile1 /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3
>>
>>
>> Higher IO sizes (e.g. -r16m) don't make a difference, it goes through 
>> the page cache anyway.
>> I'm not familiar with btrfs code at all, but maybe writepages() submits 
>> too small IOs?
>>
>> Hrmm, just wanted to try direct IO, but then just noticed it went into 
>> RO mode before already:
> 
> Direct IO will make it easier to get full stripe writes.  I thought I
> had fixed this abort, but it is just running out of space to write the
> inode cache.  For now, please just don't mount with the inode cache
> enabled, I'll send in a fix for the next rc.

Thanks, I already noticed and disabled the inode cache.

Direct-io works as expected and without any RMW cycles. And that
provides more than 40% better performance than the Megasas controller or
buffered MD writes (I didn't compare with direct-io MD, as that is very
slow).


Cheers,
Bernd


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid6: rmw writes all the time?
  2013-05-23 19:33       ` Bernd Schubert
@ 2013-05-23 19:37         ` Chris Mason
  2013-05-23 19:45           ` Bernd Schubert
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Mason @ 2013-05-23 19:37 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-btrfs

Quoting Bernd Schubert (2013-05-23 15:33:24)
> On 05/23/2013 03:34 PM, Chris Mason wrote:
> > Quoting Bernd Schubert (2013-05-23 09:22:41)
> >> On 05/23/2013 03:11 PM, Chris Mason wrote:
> >>> Quoting Bernd Schubert (2013-05-23 08:55:47)
> >>>> Hi all,
> >>>>
> >>>> we got a new test system here and I just also tested btrfs raid6 on
> >>>> that. Write performance is slightly lower than hw-raid (LSI megasas) and
> >>>> md-raid6, but it probably would be much better than any of these two, if
> >>>> it wouldn't read all the during the writes. Is this a known issue? This
> >>>> is with linux-3.9.2.
> >>>
> >>> Hi Bernd,
> >>>
> >>> Any time you do a write smaller than a full stripe, we'll have to do a
> >>> read/modify/write cycle to satisfy it.  This is true of md raid6 and the
> >>> hw-raid as well, but their reads don't show up in vmstat (try iostat
> >>> instead).
> >>
> >> Yeah, I know and I'm using iostat already. md raid6 does not do rmw, but 
> >> does not fill the device queue, afaik it flushes the underlying devices 
> >> quickly as it does not have barrier support - that is another topic, but 
> >> was the reason why I started to test btrfs.
> > 
> > md should support barriers with recent kernels.  You might want to
> > verify with blktrace that md raid6 isn't doing r/m/w.
> > 
> >>
> >>>
> >>> So the bigger question is where are your small writes coming from.  If
> >>> they are metadata, you can use raid1 for the metadata.
> >>
> >> I used this command
> >>
> >> /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x]
> > 
> > Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB
> > times the number of devices on the FS.  If you have 13 devices, that's
> > 832K.
> 
> Actually I have 12 devices, but we have to subtract 2 parity disks. In
> the mean time I also patched btrfsprogs to use a chunksize of 256K. So
> that should be 2560kiB now if I found the right places.

Sorry, thanks for filling in for my pre-coffee email.

> Btw, any chance to generally use chunksize/chunklen instead of stripe,
> such as the md layer does it? IMHO it is less confusing to use
> n-datadisks * chunksize = stripesize.

Definitely, it will become much more configurable.

> 
> > 
> > Using buffered writes makes it much more likely the VM will break up the
> > IOs as they go down.  The btrfs writepages code does try to do full
> > stripe IO, and it also caches stripes as the IO goes down.  But for
> > buffered IO it is surprisingly hard to get a 100% hit rate on full
> > stripe IO at larger stripe sizes.
> 
> I have not found that part yet, somehow it looks like as if writepages
> would submit single pages to another layer. I'm going to look into it
> again during the weekend. I can reserve the hardware that long, but I
> think we first need to fix striped writes in general.

The VM calls writepages and btrfs tries to suck down all the pages that
belong to the same extent.  And we try to allocate the extents on
boundaries.  There is definitely some bleeding into rmw when I do it
here, but overall it does well.

But I was using 8 drives.  I'll try with 12.

> 
> > 
> >>
> >> so meta-data should be raid10. And I'm using this iozone command:
> >>
> >>
> >>> iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \
> >>>         -F /data/fhgfs/storage/md126/testfile1 /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \
> >>>            /data/fhgfs/storage/md127/testfile1 /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3
> >>
> >>
> >> Higher IO sizes (e.g. -r16m) don't make a difference, it goes through 
> >> the page cache anyway.
> >> I'm not familiar with btrfs code at all, but maybe writepages() submits 
> >> too small IOs?
> >>
> >> Hrmm, just wanted to try direct IO, but then just noticed it went into 
> >> RO mode before already:
> > 
> > Direct IO will make it easier to get full stripe writes.  I thought I
> > had fixed this abort, but it is just running out of space to write the
> > inode cache.  For now, please just don't mount with the inode cache
> > enabled, I'll send in a fix for the next rc.
> 
> Thanks, I already noticed and disabled the inode cache.
> 
> Direct-io works as expected and without any RMW cycles. And that
> provides more than 40% better performance than the Megasas controller or
> buffered MD writes (I didn't compare with direct-io MD, as that is very
> slow).

You can improve MD performance quite a lot by increasing the size of the
stripe cache.

-chris


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid6: rmw writes all the time?
  2013-05-23 19:37         ` Chris Mason
@ 2013-05-23 19:45           ` Bernd Schubert
  2013-05-23 20:33             ` Chris Mason
  0 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2013-05-23 19:45 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On 05/23/2013 09:37 PM, Chris Mason wrote:
> Quoting Bernd Schubert (2013-05-23 15:33:24)
>> Btw, any chance to generally use chunksize/chunklen instead of stripe,
>> such as the md layer does it? IMHO it is less confusing to use
>> n-datadisks * chunksize = stripesize.
> 
> Definitely, it will become much more configurable.

Actually I meant in the code. I'm going to write a patch during the weekend.

> 
>>
>>>
>>> Using buffered writes makes it much more likely the VM will break up the
>>> IOs as they go down.  The btrfs writepages code does try to do full
>>> stripe IO, and it also caches stripes as the IO goes down.  But for
>>> buffered IO it is surprisingly hard to get a 100% hit rate on full
>>> stripe IO at larger stripe sizes.
>>
>> I have not found that part yet, somehow it looks like as if writepages
>> would submit single pages to another layer. I'm going to look into it
>> again during the weekend. I can reserve the hardware that long, but I
>> think we first need to fix striped writes in general.
> 
> The VM calls writepages and btrfs tries to suck down all the pages that
> belong to the same extent.  And we try to allocate the extents on
> boundaries.  There is definitely some bleeding into rmw when I do it
> here, but overall it does well.
> 
> But I was using 8 drives.  I'll try with 12.

Hmm, I already tried with 10 drives (8+2), doesn't make a difference for
RMW.

> 
>> Direct-io works as expected and without any RMW cycles. And that
>> provides more than 40% better performance than the Megasas controller or
>> buffered MD writes (I didn't compare with direct-io MD, as that is very
>> slow).
> 
> You can improve MD performance quite a lot by increasing the size of the
> stripe cache.

I'm already doing that, without a higher stripe cache the performance is
much lower.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid6: rmw writes all the time?
  2013-05-23 19:45           ` Bernd Schubert
@ 2013-05-23 20:33             ` Chris Mason
  2013-05-24  8:35               ` Bernd Schubert
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Mason @ 2013-05-23 20:33 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-btrfs

Quoting Bernd Schubert (2013-05-23 15:45:36)
> On 05/23/2013 09:37 PM, Chris Mason wrote:
> > Quoting Bernd Schubert (2013-05-23 15:33:24)
> >> Btw, any chance to generally use chunksize/chunklen instead of stripe,
> >> such as the md layer does it? IMHO it is less confusing to use
> >> n-datadisks * chunksize = stripesize.
> > 
> > Definitely, it will become much more configurable.
> 
> Actually I meant in the code. I'm going to write a patch during the weekend.

The btrfs raid code refers to stripes because a chunk is a very large
(~1GB) slice of a set of drives that we allocate into raid levels. 

We have full stripes and device stripes, I'm afraid there are so many
different terms in other projects that it is hard to pick something clear.

> 
> > 
> >>
> >>>
> >>> Using buffered writes makes it much more likely the VM will break up the
> >>> IOs as they go down.  The btrfs writepages code does try to do full
> >>> stripe IO, and it also caches stripes as the IO goes down.  But for
> >>> buffered IO it is surprisingly hard to get a 100% hit rate on full
> >>> stripe IO at larger stripe sizes.
> >>
> >> I have not found that part yet, somehow it looks like as if writepages
> >> would submit single pages to another layer. I'm going to look into it
> >> again during the weekend. I can reserve the hardware that long, but I
> >> think we first need to fix striped writes in general.
> > 
> > The VM calls writepages and btrfs tries to suck down all the pages that
> > belong to the same extent.  And we try to allocate the extents on
> > boundaries.  There is definitely some bleeding into rmw when I do it
> > here, but overall it does well.
> > 
> > But I was using 8 drives.  I'll try with 12.
> 
> Hmm, I already tried with 10 drives (8+2), doesn't make a difference for
> RMW.

My benchmarks were on flash, so the rmw I was seeing may not have had as
big an impact.

-chris


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid6: rmw writes all the time?
  2013-05-23 20:33             ` Chris Mason
@ 2013-05-24  8:35               ` Bernd Schubert
  2013-05-24 12:40                 ` Chris Mason
  0 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2013-05-24  8:35 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

Hello Chris,

On 05/23/2013 10:33 PM, Chris Mason wrote:
> But I was using 8 drives.  I'll try with 12.
>>
> My benchmarks were on flash, so the rmw I was seeing may not have had as
> big an impact.


I just further played with it and simply introduced a requeue in 
raid56_rmw_stripe() if the rbio is 'younger' than 50 jiffies. I can 
still see reads, but by a factor 10 lower than before. And this is 
sufficient to bring performance almost to that of direc-io.
This is certainly no upstream code, I hope I find some time over the 
weekend to come up with something better.

Btw, I also noticed the cache logic copies pages from those rmw-threads. 
Well, this a numa system and memory bandwith is terribly bad from the 
remote cpu. These worker threads probably should numa aware and only 
handle rbios from their own cpu.


Cheers,
Bernd

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid6: rmw writes all the time?
  2013-05-24  8:35               ` Bernd Schubert
@ 2013-05-24 12:40                 ` Chris Mason
  0 siblings, 0 replies; 12+ messages in thread
From: Chris Mason @ 2013-05-24 12:40 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-btrfs

Quoting Bernd Schubert (2013-05-24 04:35:37)
> Hello Chris,
> 
> On 05/23/2013 10:33 PM, Chris Mason wrote:
> > But I was using 8 drives.  I'll try with 12.
> >>
> > My benchmarks were on flash, so the rmw I was seeing may not have had as
> > big an impact.
> 
> 
> I just further played with it and simply introduced a requeue in 
> raid56_rmw_stripe() if the rbio is 'younger' than 50 jiffies. I can 
> still see reads, but by a factor 10 lower than before. And this is 
> sufficient to bring performance almost to that of direc-io.
> This is certainly no upstream code, I hope I find some time over the 
> weekend to come up with something better.

Interesting.  This probably shows that we need to do a better job of
maintaining a plug across the writepages calls, or that we need to be
much more aggressive in writepages to add more pages once we've started.

> 
> Btw, I also noticed the cache logic copies pages from those rmw-threads. 
> Well, this a numa system and memory bandwith is terribly bad from the 
> remote cpu. These worker threads probably should numa aware and only 
> handle rbios from their own cpu.

Yes, all of the helpers (especially crc and parity) should be made numa aware.

-chris


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-05-24 12:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-23 12:55 raid6: rmw writes all the time? Bernd Schubert
2013-05-23 13:11 ` Chris Mason
2013-05-23 13:22   ` Bernd Schubert
2013-05-23 13:34     ` Chris Mason
2013-05-23 19:33       ` Bernd Schubert
2013-05-23 19:37         ` Chris Mason
2013-05-23 19:45           ` Bernd Schubert
2013-05-23 20:33             ` Chris Mason
2013-05-24  8:35               ` Bernd Schubert
2013-05-24 12:40                 ` Chris Mason
2013-05-23 13:41     ` Bob Marley
2013-05-23 16:30       ` Bernd Schubert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.