All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about RAID1 plug/unplug code
@ 2014-09-08 13:55 Alexander Lyakas
  2014-09-09  1:45 ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Alexander Lyakas @ 2014-09-08 13:55 UTC (permalink / raw)
  To: linux-raid, Neil Brown

Hi Neil,
We have been seeing high latency on the md/raid1 block device, due to
the fact that all WRITEs are handed off to raid1d thread. This thread
also calls bitmap_unplug(), which writes the bitmap synchronously.
While it waits for the bitmap, it cannot trigger other WRITEs waiting
in its pending_bio_list. This is especially seen with SSDs: MD's
latency is much higher that SSD latency (I have been stoned by Peter
Grandi when I brought up this issue previously for raid5).

Then I have noticed the commit:

commit f54a9d0e59c4bea3db733921ca9147612a6f292c
Author: NeilBrown <neilb@suse.de>
Date:   Thu Aug 2 08:33:20 2012 +1000

    md/raid1: submit IO from originating thread instead of md thread.

Looking at the code, I learned that to avoid switching into raid1d,
the caller has to use blk_start_plug/blk_finish_plug. So I added these
calls in our kernel module, which submits bios to MD. Results were
awesome, MD latency got down significantly.

So I have several questions about this plug/unplug thing.

1/ Originally this infrastructure was supposed to help IO schedulers
in merging requests. It is useful when one has a bunch of requests to
submit in one shot.
But in MD case, thus infrastructure is used for a different purpose:
not to merge requests (which may help bandwidth, but probably not
latency), but to avoid making raid1d a bottleneck, to be able to
submit requests from multiple threads in parallel, which brings down
latency significantly in our case. Indeed "struct blk_plug" has a
special "cb_list", which is used only by MD.
In my case I have only individual bios (not a bunch of bios), and I
after wrap them with plug/unplug, MD latency gets better. So we are
using the plug infrastructure for a different purpose.
Is my understanding correct? Was this your intention?

2/ Now that md/raid1 submits WRITEs from several threads in parallel,
is there any issue you can think of? Like for example, multiple
threads now call bitmap_unplug() in parallel. Is this alright?

Thanks!
Alex.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about RAID1 plug/unplug code
  2014-09-08 13:55 Question about RAID1 plug/unplug code Alexander Lyakas
@ 2014-09-09  1:45 ` NeilBrown
  2014-09-09  8:33   ` Alexander Lyakas
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2014-09-09  1:45 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3714 bytes --]

On Mon, 8 Sep 2014 16:55:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> We have been seeing high latency on the md/raid1 block device, due to
> the fact that all WRITEs are handed off to raid1d thread. This thread
> also calls bitmap_unplug(), which writes the bitmap synchronously.
> While it waits for the bitmap, it cannot trigger other WRITEs waiting
> in its pending_bio_list. This is especially seen with SSDs: MD's
> latency is much higher that SSD latency (I have been stoned by Peter
> Grandi when I brought up this issue previously for raid5).
> 
> Then I have noticed the commit:
> 
> commit f54a9d0e59c4bea3db733921ca9147612a6f292c
> Author: NeilBrown <neilb@suse.de>
> Date:   Thu Aug 2 08:33:20 2012 +1000
> 
>     md/raid1: submit IO from originating thread instead of md thread.
> 
> Looking at the code, I learned that to avoid switching into raid1d,
> the caller has to use blk_start_plug/blk_finish_plug. So I added these
> calls in our kernel module, which submits bios to MD. Results were
> awesome, MD latency got down significantly.

That's good to hear.

> 
> So I have several questions about this plug/unplug thing.
> 
> 1/ Originally this infrastructure was supposed to help IO schedulers
> in merging requests. It is useful when one has a bunch of requests to
> submit in one shot.

That is exactly the whole point of plugging:  allow the device to handle a
batch of requests together instead of one at a time.

> But in MD case, thus infrastructure is used for a different purpose:
> not to merge requests (which may help bandwidth, but probably not
> latency), but to avoid making raid1d a bottleneck, to be able to
> submit requests from multiple threads in parallel, which brings down
> latency significantly in our case. Indeed "struct blk_plug" has a
> special "cb_list", which is used only by MD.

I don't think the way md uses plugging is conceptually different from any
other use: it is always about gathering a batch together.
"cb_list" is handled by blk_check_plugged() which is also used by
block/umem.c and btrfs.

The base plugging code assumes that it is only gathering a batch of requests
for a single device - if the target device changes then the batch is flushed.
It also assumed that it was "struct request" that was batched.
Devices like md that want to queue 'struct bio', something else was needed.
Also with layered devices it can be useful to gather multiple batches for
multiple layers.
So I created "cb_list" etc and a more generic interface.

> In my case I have only individual bios (not a bunch of bios), and I
> after wrap them with plug/unplug, MD latency gets better. So we are
> using the plug infrastructure for a different purpose.
> Is my understanding correct? Was this your intention?

I don't really understand what you are doing.  There is no point in using
plugging for individual bios.  The  main point for raid1 writes is to gather
a lot of writes together so that all multiple bitmap bits can be set all at
once.
It should be possible to submit individual bios directly from make_request
without passing them to raid1d and without using plugging.  You might end up
making more bitmap updates, but you might not.


> 
> 2/ Now that md/raid1 submits WRITEs from several threads in parallel,
> is there any issue you can think of? Like for example, multiple
> threads now call bitmap_unplug() in parallel. Is this alright?

Hmmm... there could be an issue there.  It is possible that some callers of
bitmap_unplug won't block when they should.  bitmap_unplug should probably
wait unconditionally.

NeilBrown


> 
> Thanks!
> Alex.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about RAID1 plug/unplug code
  2014-09-09  1:45 ` NeilBrown
@ 2014-09-09  8:33   ` Alexander Lyakas
  2014-09-09  9:45     ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Alexander Lyakas @ 2014-09-09  8:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Yair Hershko

Hi Neil,


On Tue, Sep 9, 2014 at 4:45 AM, NeilBrown <neilb@suse.de> wrote:
> On Mon, 8 Sep 2014 16:55:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>> We have been seeing high latency on the md/raid1 block device, due to
>> the fact that all WRITEs are handed off to raid1d thread. This thread
>> also calls bitmap_unplug(), which writes the bitmap synchronously.
>> While it waits for the bitmap, it cannot trigger other WRITEs waiting
>> in its pending_bio_list. This is especially seen with SSDs: MD's
>> latency is much higher that SSD latency (I have been stoned by Peter
>> Grandi when I brought up this issue previously for raid5).
>>
>> Then I have noticed the commit:
>>
>> commit f54a9d0e59c4bea3db733921ca9147612a6f292c
>> Author: NeilBrown <neilb@suse.de>
>> Date:   Thu Aug 2 08:33:20 2012 +1000
>>
>>     md/raid1: submit IO from originating thread instead of md thread.
>>
>> Looking at the code, I learned that to avoid switching into raid1d,
>> the caller has to use blk_start_plug/blk_finish_plug. So I added these
>> calls in our kernel module, which submits bios to MD. Results were
>> awesome, MD latency got down significantly.
>
> That's good to hear.
>
>>
>> So I have several questions about this plug/unplug thing.
>>
>> 1/ Originally this infrastructure was supposed to help IO schedulers
>> in merging requests. It is useful when one has a bunch of requests to
>> submit in one shot.
>
> That is exactly the whole point of plugging:  allow the device to handle a
> batch of requests together instead of one at a time.
>
>> But in MD case, thus infrastructure is used for a different purpose:
>> not to merge requests (which may help bandwidth, but probably not
>> latency), but to avoid making raid1d a bottleneck, to be able to
>> submit requests from multiple threads in parallel, which brings down
>> latency significantly in our case. Indeed "struct blk_plug" has a
>> special "cb_list", which is used only by MD.
>
> I don't think the way md uses plugging is conceptually different from any
> other use: it is always about gathering a batch together.
> "cb_list" is handled by blk_check_plugged() which is also used by
> block/umem.c and btrfs.
>
> The base plugging code assumes that it is only gathering a batch of requests
> for a single device - if the target device changes then the batch is flushed.
> It also assumed that it was "struct request" that was batched.
> Devices like md that want to queue 'struct bio', something else was needed.
> Also with layered devices it can be useful to gather multiple batches for
> multiple layers.
> So I created "cb_list" etc and a more generic interface.
>
>> In my case I have only individual bios (not a bunch of bios), and I
>> after wrap them with plug/unplug, MD latency gets better. So we are
>> using the plug infrastructure for a different purpose.
>> Is my understanding correct? Was this your intention?
>
> I don't really understand what you are doing.  There is no point in using
> plugging for individual bios.  The  main point for raid1 writes is to gather
> a lot of writes together so that all multiple bitmap bits can be set all at
> once.
> It should be possible to submit individual bios directly from make_request
> without passing them to raid1d and without using plugging.
Can you pls explain how it is possible?
You have this code for WRITEs:
        cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
        if (cb)
            plug = container_of(cb, struct raid1_plug_cb, cb);
        else
            plug = NULL;
        spin_lock_irqsave(&conf->device_lock, flags);
        if (plug) {
            bio_list_add(&plug->pending, mbio);
            plug->pending_cnt++;
        } else {
            bio_list_add(&conf->pending_bio_list, mbio);
            conf->pending_count++;
        }
        spin_unlock_irqrestore(&conf->device_lock, flags);

If the thread blk_check_plugged returns NULL, then you always hand the
WRITE to raid1d. So the only option to avoid handoff to raid1d is for
the caller to plug. Otherwise, all WRITEs are handed off to raid1d and
latency becomes terrible.
So in my case, I use plug/unplug for individual bios only to avoid the
handoff to raid1d.
What am I missing in this analysis?


 > You might end up
> making more bitmap updates, but you might not.
>
>
>>
>> 2/ Now that md/raid1 submits WRITEs from several threads in parallel,
>> is there any issue you can think of? Like for example, multiple
>> threads now call bitmap_unplug() in parallel. Is this alright?
>
> Hmmm... there could be an issue there.  It is possible that some callers of
> bitmap_unplug won't block when they should.  bitmap_unplug should probably
> wait unconditionally.
Ok, so you say there is a bug with bitmap updates when WRITE is
submitted directly from make_request. Is this something we should fix?

Alex.



>
> NeilBrown
>
>
>>
>> Thanks!
>> Alex.
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about RAID1 plug/unplug code
  2014-09-09  8:33   ` Alexander Lyakas
@ 2014-09-09  9:45     ` NeilBrown
  2014-09-10  8:01       ` Alexander Lyakas
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2014-09-09  9:45 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid, Yair Hershko

[-- Attachment #1: Type: text/plain, Size: 5106 bytes --]

On Tue, 9 Sep 2014 11:33:13 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> 
> 
> On Tue, Sep 9, 2014 at 4:45 AM, NeilBrown <neilb@suse.de> wrote:
> > On Mon, 8 Sep 2014 16:55:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Hi Neil,
> >> We have been seeing high latency on the md/raid1 block device, due to
> >> the fact that all WRITEs are handed off to raid1d thread. This thread
> >> also calls bitmap_unplug(), which writes the bitmap synchronously.
> >> While it waits for the bitmap, it cannot trigger other WRITEs waiting
> >> in its pending_bio_list. This is especially seen with SSDs: MD's
> >> latency is much higher that SSD latency (I have been stoned by Peter
> >> Grandi when I brought up this issue previously for raid5).
> >>
> >> Then I have noticed the commit:
> >>
> >> commit f54a9d0e59c4bea3db733921ca9147612a6f292c
> >> Author: NeilBrown <neilb@suse.de>
> >> Date:   Thu Aug 2 08:33:20 2012 +1000
> >>
> >>     md/raid1: submit IO from originating thread instead of md thread.
> >>
> >> Looking at the code, I learned that to avoid switching into raid1d,
> >> the caller has to use blk_start_plug/blk_finish_plug. So I added these
> >> calls in our kernel module, which submits bios to MD. Results were
> >> awesome, MD latency got down significantly.
> >
> > That's good to hear.
> >
> >>
> >> So I have several questions about this plug/unplug thing.
> >>
> >> 1/ Originally this infrastructure was supposed to help IO schedulers
> >> in merging requests. It is useful when one has a bunch of requests to
> >> submit in one shot.
> >
> > That is exactly the whole point of plugging:  allow the device to handle a
> > batch of requests together instead of one at a time.
> >
> >> But in MD case, thus infrastructure is used for a different purpose:
> >> not to merge requests (which may help bandwidth, but probably not
> >> latency), but to avoid making raid1d a bottleneck, to be able to
> >> submit requests from multiple threads in parallel, which brings down
> >> latency significantly in our case. Indeed "struct blk_plug" has a
> >> special "cb_list", which is used only by MD.
> >
> > I don't think the way md uses plugging is conceptually different from any
> > other use: it is always about gathering a batch together.
> > "cb_list" is handled by blk_check_plugged() which is also used by
> > block/umem.c and btrfs.
> >
> > The base plugging code assumes that it is only gathering a batch of requests
> > for a single device - if the target device changes then the batch is flushed.
> > It also assumed that it was "struct request" that was batched.
> > Devices like md that want to queue 'struct bio', something else was needed.
> > Also with layered devices it can be useful to gather multiple batches for
> > multiple layers.
> > So I created "cb_list" etc and a more generic interface.
> >
> >> In my case I have only individual bios (not a bunch of bios), and I
> >> after wrap them with plug/unplug, MD latency gets better. So we are
> >> using the plug infrastructure for a different purpose.
> >> Is my understanding correct? Was this your intention?
> >
> > I don't really understand what you are doing.  There is no point in using
> > plugging for individual bios.  The  main point for raid1 writes is to gather
> > a lot of writes together so that all multiple bitmap bits can be set all at
> > once.
> > It should be possible to submit individual bios directly from make_request
> > without passing them to raid1d and without using plugging.
> Can you pls explain how it is possible?
> You have this code for WRITEs:
>         cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
>         if (cb)
>             plug = container_of(cb, struct raid1_plug_cb, cb);
>         else
>             plug = NULL;
>         spin_lock_irqsave(&conf->device_lock, flags);
>         if (plug) {
>             bio_list_add(&plug->pending, mbio);
>             plug->pending_cnt++;
>         } else {
>             bio_list_add(&conf->pending_bio_list, mbio);
>             conf->pending_count++;
>         }
>         spin_unlock_irqrestore(&conf->device_lock, flags);
> 
> If the thread blk_check_plugged returns NULL, then you always hand the
> WRITE to raid1d. So the only option to avoid handoff to raid1d is for
> the caller to plug. Otherwise, all WRITEs are handed off to raid1d and
> latency becomes terrible.
> So in my case, I use plug/unplug for individual bios only to avoid the
> handoff to raid1d.
> What am I missing in this analysis?

if blk_check_plugged succeeds then it has arranged for raid1_unplug to be
called a little later by that same process.
So there is nothing to stop you calling raid1_unplug immediately.

raid1_unplug essentially does:
  bitmap_unplug()
  generic_make_request()

so you can very nearly just do that, without any plugging.

There is a bit of extra subtlety but I can't really know how relevant that
might be to you without actually seeing you code.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about RAID1 plug/unplug code
  2014-09-09  9:45     ` NeilBrown
@ 2014-09-10  8:01       ` Alexander Lyakas
  2014-09-10  9:36         ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Alexander Lyakas @ 2014-09-10  8:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Yair Hershko

Hello Neil,

On Tue, Sep 9, 2014 at 12:45 PM, NeilBrown <neilb@suse.de> wrote:
> On Tue, 9 Sep 2014 11:33:13 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>>
>>
>> On Tue, Sep 9, 2014 at 4:45 AM, NeilBrown <neilb@suse.de> wrote:
>> > On Mon, 8 Sep 2014 16:55:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> > wrote:
>> >
>> >> Hi Neil,
>> >> We have been seeing high latency on the md/raid1 block device, due to
>> >> the fact that all WRITEs are handed off to raid1d thread. This thread
>> >> also calls bitmap_unplug(), which writes the bitmap synchronously.
>> >> While it waits for the bitmap, it cannot trigger other WRITEs waiting
>> >> in its pending_bio_list. This is especially seen with SSDs: MD's
>> >> latency is much higher that SSD latency (I have been stoned by Peter
>> >> Grandi when I brought up this issue previously for raid5).
>> >>
>> >> Then I have noticed the commit:
>> >>
>> >> commit f54a9d0e59c4bea3db733921ca9147612a6f292c
>> >> Author: NeilBrown <neilb@suse.de>
>> >> Date:   Thu Aug 2 08:33:20 2012 +1000
>> >>
>> >>     md/raid1: submit IO from originating thread instead of md thread.
>> >>
>> >> Looking at the code, I learned that to avoid switching into raid1d,
>> >> the caller has to use blk_start_plug/blk_finish_plug. So I added these
>> >> calls in our kernel module, which submits bios to MD. Results were
>> >> awesome, MD latency got down significantly.
>> >
>> > That's good to hear.
>> >
>> >>
>> >> So I have several questions about this plug/unplug thing.
>> >>
>> >> 1/ Originally this infrastructure was supposed to help IO schedulers
>> >> in merging requests. It is useful when one has a bunch of requests to
>> >> submit in one shot.
>> >
>> > That is exactly the whole point of plugging:  allow the device to handle a
>> > batch of requests together instead of one at a time.
>> >
>> >> But in MD case, thus infrastructure is used for a different purpose:
>> >> not to merge requests (which may help bandwidth, but probably not
>> >> latency), but to avoid making raid1d a bottleneck, to be able to
>> >> submit requests from multiple threads in parallel, which brings down
>> >> latency significantly in our case. Indeed "struct blk_plug" has a
>> >> special "cb_list", which is used only by MD.
>> >
>> > I don't think the way md uses plugging is conceptually different from any
>> > other use: it is always about gathering a batch together.
>> > "cb_list" is handled by blk_check_plugged() which is also used by
>> > block/umem.c and btrfs.
>> >
>> > The base plugging code assumes that it is only gathering a batch of requests
>> > for a single device - if the target device changes then the batch is flushed.
>> > It also assumed that it was "struct request" that was batched.
>> > Devices like md that want to queue 'struct bio', something else was needed.
>> > Also with layered devices it can be useful to gather multiple batches for
>> > multiple layers.
>> > So I created "cb_list" etc and a more generic interface.
>> >
>> >> In my case I have only individual bios (not a bunch of bios), and I
>> >> after wrap them with plug/unplug, MD latency gets better. So we are
>> >> using the plug infrastructure for a different purpose.
>> >> Is my understanding correct? Was this your intention?
>> >
>> > I don't really understand what you are doing.  There is no point in using
>> > plugging for individual bios.  The  main point for raid1 writes is to gather
>> > a lot of writes together so that all multiple bitmap bits can be set all at
>> > once.
>> > It should be possible to submit individual bios directly from make_request
>> > without passing them to raid1d and without using plugging.
>> Can you pls explain how it is possible?
>> You have this code for WRITEs:
>>         cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
>>         if (cb)
>>             plug = container_of(cb, struct raid1_plug_cb, cb);
>>         else
>>             plug = NULL;
>>         spin_lock_irqsave(&conf->device_lock, flags);
>>         if (plug) {
>>             bio_list_add(&plug->pending, mbio);
>>             plug->pending_cnt++;
>>         } else {
>>             bio_list_add(&conf->pending_bio_list, mbio);
>>             conf->pending_count++;
>>         }
>>         spin_unlock_irqrestore(&conf->device_lock, flags);
>>
>> If the thread blk_check_plugged returns NULL, then you always hand the
>> WRITE to raid1d. So the only option to avoid handoff to raid1d is for
>> the caller to plug. Otherwise, all WRITEs are handed off to raid1d and
>> latency becomes terrible.
>> So in my case, I use plug/unplug for individual bios only to avoid the
>> handoff to raid1d.
>> What am I missing in this analysis?
>
> if blk_check_plugged succeeds then it has arranged for raid1_unplug to be
> called a little later by that same process.
> So there is nothing to stop you calling raid1_unplug immediately.
>
> raid1_unplug essentially does:
>   bitmap_unplug()
>   generic_make_request()
>
> so you can very nearly just do that, without any plugging.
I am sorry, but I did not understand your reply. Maybe I did not
explain myself, I will try again.

I am not changing raid1.c code. I just want to avoid the handoff to
raid1d on WRITEs. According to your code, there are only two possible
flows:

Flow 1 - with plugging
# caller calls blk_start_plug
# caller calls submit_bio
# blk_check_plugged succeeds, and bio is put onto plug->pending list
# caller calls blk_finish_plug
# raid1_unplug is called in the same caller's thread, so it does
bitmap_unplug and generic_make_request

Flow 2 - without plugging
# caller calls submit_bio
# blk_check_plugged fails, and bio is put onto conf->pending_bio_list,
which means it will be submitted by raid1d

My conclusion from that: to avoid the handoff to raid1, caller always
need to plug, even if it has a single bio to submit. But you said "it
should be possible to submit individual bios directly from
make_request without passing them to raid1d and without using
plugging". So can you explain how it is possible? I prefer not to
change raid1.c code.

>
> There is a bit of extra subtlety but I can't really know how relevant that
> might be to you without actually seeing you code.
My code (in a different kernel module, not in raid1.c) is simply doing
submit_bio. I want to wrap this with plug/unplug to avoid the handoff
to raid1d and improve raid1 latency.

Thanks,
Alex.


>
> NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about RAID1 plug/unplug code
  2014-09-10  8:01       ` Alexander Lyakas
@ 2014-09-10  9:36         ` NeilBrown
  2014-09-11  8:22           ` Alexander Lyakas
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2014-09-10  9:36 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid, Yair Hershko

[-- Attachment #1: Type: text/plain, Size: 7098 bytes --]

On Wed, 10 Sep 2014 11:01:30 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hello Neil,
> 
> On Tue, Sep 9, 2014 at 12:45 PM, NeilBrown <neilb@suse.de> wrote:
> > On Tue, 9 Sep 2014 11:33:13 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Hi Neil,
> >>
> >>
> >> On Tue, Sep 9, 2014 at 4:45 AM, NeilBrown <neilb@suse.de> wrote:
> >> > On Mon, 8 Sep 2014 16:55:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> >> > wrote:
> >> >
> >> >> Hi Neil,
> >> >> We have been seeing high latency on the md/raid1 block device, due to
> >> >> the fact that all WRITEs are handed off to raid1d thread. This thread
> >> >> also calls bitmap_unplug(), which writes the bitmap synchronously.
> >> >> While it waits for the bitmap, it cannot trigger other WRITEs waiting
> >> >> in its pending_bio_list. This is especially seen with SSDs: MD's
> >> >> latency is much higher that SSD latency (I have been stoned by Peter
> >> >> Grandi when I brought up this issue previously for raid5).
> >> >>
> >> >> Then I have noticed the commit:
> >> >>
> >> >> commit f54a9d0e59c4bea3db733921ca9147612a6f292c
> >> >> Author: NeilBrown <neilb@suse.de>
> >> >> Date:   Thu Aug 2 08:33:20 2012 +1000
> >> >>
> >> >>     md/raid1: submit IO from originating thread instead of md thread.
> >> >>
> >> >> Looking at the code, I learned that to avoid switching into raid1d,
> >> >> the caller has to use blk_start_plug/blk_finish_plug. So I added these
> >> >> calls in our kernel module, which submits bios to MD. Results were
> >> >> awesome, MD latency got down significantly.
> >> >
> >> > That's good to hear.
> >> >
> >> >>
> >> >> So I have several questions about this plug/unplug thing.
> >> >>
> >> >> 1/ Originally this infrastructure was supposed to help IO schedulers
> >> >> in merging requests. It is useful when one has a bunch of requests to
> >> >> submit in one shot.
> >> >
> >> > That is exactly the whole point of plugging:  allow the device to handle a
> >> > batch of requests together instead of one at a time.
> >> >
> >> >> But in MD case, thus infrastructure is used for a different purpose:
> >> >> not to merge requests (which may help bandwidth, but probably not
> >> >> latency), but to avoid making raid1d a bottleneck, to be able to
> >> >> submit requests from multiple threads in parallel, which brings down
> >> >> latency significantly in our case. Indeed "struct blk_plug" has a
> >> >> special "cb_list", which is used only by MD.
> >> >
> >> > I don't think the way md uses plugging is conceptually different from any
> >> > other use: it is always about gathering a batch together.
> >> > "cb_list" is handled by blk_check_plugged() which is also used by
> >> > block/umem.c and btrfs.
> >> >
> >> > The base plugging code assumes that it is only gathering a batch of requests
> >> > for a single device - if the target device changes then the batch is flushed.
> >> > It also assumed that it was "struct request" that was batched.
> >> > Devices like md that want to queue 'struct bio', something else was needed.
> >> > Also with layered devices it can be useful to gather multiple batches for
> >> > multiple layers.
> >> > So I created "cb_list" etc and a more generic interface.
> >> >
> >> >> In my case I have only individual bios (not a bunch of bios), and I
> >> >> after wrap them with plug/unplug, MD latency gets better. So we are
> >> >> using the plug infrastructure for a different purpose.
> >> >> Is my understanding correct? Was this your intention?
> >> >
> >> > I don't really understand what you are doing.  There is no point in using
> >> > plugging for individual bios.  The  main point for raid1 writes is to gather
> >> > a lot of writes together so that all multiple bitmap bits can be set all at
> >> > once.
> >> > It should be possible to submit individual bios directly from make_request
> >> > without passing them to raid1d and without using plugging.
> >> Can you pls explain how it is possible?
> >> You have this code for WRITEs:
> >>         cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
> >>         if (cb)
> >>             plug = container_of(cb, struct raid1_plug_cb, cb);
> >>         else
> >>             plug = NULL;
> >>         spin_lock_irqsave(&conf->device_lock, flags);
> >>         if (plug) {
> >>             bio_list_add(&plug->pending, mbio);
> >>             plug->pending_cnt++;
> >>         } else {
> >>             bio_list_add(&conf->pending_bio_list, mbio);
> >>             conf->pending_count++;
> >>         }
> >>         spin_unlock_irqrestore(&conf->device_lock, flags);
> >>
> >> If the thread blk_check_plugged returns NULL, then you always hand the
> >> WRITE to raid1d. So the only option to avoid handoff to raid1d is for
> >> the caller to plug. Otherwise, all WRITEs are handed off to raid1d and
> >> latency becomes terrible.
> >> So in my case, I use plug/unplug for individual bios only to avoid the
> >> handoff to raid1d.
> >> What am I missing in this analysis?
> >
> > if blk_check_plugged succeeds then it has arranged for raid1_unplug to be
> > called a little later by that same process.
> > So there is nothing to stop you calling raid1_unplug immediately.
> >
> > raid1_unplug essentially does:
> >   bitmap_unplug()
> >   generic_make_request()
> >
> > so you can very nearly just do that, without any plugging.
> I am sorry, but I did not understand your reply. Maybe I did not
> explain myself, I will try again.
> 
> I am not changing raid1.c code. I just want to avoid the handoff to
> raid1d on WRITEs. According to your code, there are only two possible
> flows:
> 
> Flow 1 - with plugging
> # caller calls blk_start_plug
> # caller calls submit_bio
> # blk_check_plugged succeeds, and bio is put onto plug->pending list
> # caller calls blk_finish_plug
> # raid1_unplug is called in the same caller's thread, so it does
> bitmap_unplug and generic_make_request
> 
> Flow 2 - without plugging
> # caller calls submit_bio
> # blk_check_plugged fails, and bio is put onto conf->pending_bio_list,
> which means it will be submitted by raid1d
> 
> My conclusion from that: to avoid the handoff to raid1, caller always
> need to plug, even if it has a single bio to submit. But you said "it
> should be possible to submit individual bios directly from
> make_request without passing them to raid1d and without using
> plugging". So can you explain how it is possible? I prefer not to
> change raid1.c code.
> 
> >
> > There is a bit of extra subtlety but I can't really know how relevant that
> > might be to you without actually seeing you code.
> My code (in a different kernel module, not in raid1.c) is simply doing
> submit_bio. I want to wrap this with plug/unplug to avoid the handoff
> to raid1d and improve raid1 latency.
> 

I think I need to see the code you are working with to be able to suggest
anything used.
But if it works with plugging, then just do it that way(?).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about RAID1 plug/unplug code
  2014-09-10  9:36         ` NeilBrown
@ 2014-09-11  8:22           ` Alexander Lyakas
  2014-09-12  6:16             ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Alexander Lyakas @ 2014-09-11  8:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Yair Hershko

Hi Neil,

On Wed, Sep 10, 2014 at 12:36 PM, NeilBrown <neilb@suse.de> wrote:
> On Wed, 10 Sep 2014 11:01:30 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hello Neil,
>>
>> On Tue, Sep 9, 2014 at 12:45 PM, NeilBrown <neilb@suse.de> wrote:
>> > On Tue, 9 Sep 2014 11:33:13 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> > wrote:
>> >
>> >> Hi Neil,
>> >>
>> >>
>> >> On Tue, Sep 9, 2014 at 4:45 AM, NeilBrown <neilb@suse.de> wrote:
>> >> > On Mon, 8 Sep 2014 16:55:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Hi Neil,
>> >> >> We have been seeing high latency on the md/raid1 block device, due to
>> >> >> the fact that all WRITEs are handed off to raid1d thread. This thread
>> >> >> also calls bitmap_unplug(), which writes the bitmap synchronously.
>> >> >> While it waits for the bitmap, it cannot trigger other WRITEs waiting
>> >> >> in its pending_bio_list. This is especially seen with SSDs: MD's
>> >> >> latency is much higher that SSD latency (I have been stoned by Peter
>> >> >> Grandi when I brought up this issue previously for raid5).
>> >> >>
>> >> >> Then I have noticed the commit:
>> >> >>
>> >> >> commit f54a9d0e59c4bea3db733921ca9147612a6f292c
>> >> >> Author: NeilBrown <neilb@suse.de>
>> >> >> Date:   Thu Aug 2 08:33:20 2012 +1000
>> >> >>
>> >> >>     md/raid1: submit IO from originating thread instead of md thread.
>> >> >>
>> >> >> Looking at the code, I learned that to avoid switching into raid1d,
>> >> >> the caller has to use blk_start_plug/blk_finish_plug. So I added these
>> >> >> calls in our kernel module, which submits bios to MD. Results were
>> >> >> awesome, MD latency got down significantly.
>> >> >
>> >> > That's good to hear.
>> >> >
>> >> >>
>> >> >> So I have several questions about this plug/unplug thing.
>> >> >>
>> >> >> 1/ Originally this infrastructure was supposed to help IO schedulers
>> >> >> in merging requests. It is useful when one has a bunch of requests to
>> >> >> submit in one shot.
>> >> >
>> >> > That is exactly the whole point of plugging:  allow the device to handle a
>> >> > batch of requests together instead of one at a time.
>> >> >
>> >> >> But in MD case, thus infrastructure is used for a different purpose:
>> >> >> not to merge requests (which may help bandwidth, but probably not
>> >> >> latency), but to avoid making raid1d a bottleneck, to be able to
>> >> >> submit requests from multiple threads in parallel, which brings down
>> >> >> latency significantly in our case. Indeed "struct blk_plug" has a
>> >> >> special "cb_list", which is used only by MD.
>> >> >
>> >> > I don't think the way md uses plugging is conceptually different from any
>> >> > other use: it is always about gathering a batch together.
>> >> > "cb_list" is handled by blk_check_plugged() which is also used by
>> >> > block/umem.c and btrfs.
>> >> >
>> >> > The base plugging code assumes that it is only gathering a batch of requests
>> >> > for a single device - if the target device changes then the batch is flushed.
>> >> > It also assumed that it was "struct request" that was batched.
>> >> > Devices like md that want to queue 'struct bio', something else was needed.
>> >> > Also with layered devices it can be useful to gather multiple batches for
>> >> > multiple layers.
>> >> > So I created "cb_list" etc and a more generic interface.
>> >> >
>> >> >> In my case I have only individual bios (not a bunch of bios), and I
>> >> >> after wrap them with plug/unplug, MD latency gets better. So we are
>> >> >> using the plug infrastructure for a different purpose.
>> >> >> Is my understanding correct? Was this your intention?
>> >> >
>> >> > I don't really understand what you are doing.  There is no point in using
>> >> > plugging for individual bios.  The  main point for raid1 writes is to gather
>> >> > a lot of writes together so that all multiple bitmap bits can be set all at
>> >> > once.
>> >> > It should be possible to submit individual bios directly from make_request
>> >> > without passing them to raid1d and without using plugging.
>> >> Can you pls explain how it is possible?
>> >> You have this code for WRITEs:
>> >>         cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
>> >>         if (cb)
>> >>             plug = container_of(cb, struct raid1_plug_cb, cb);
>> >>         else
>> >>             plug = NULL;
>> >>         spin_lock_irqsave(&conf->device_lock, flags);
>> >>         if (plug) {
>> >>             bio_list_add(&plug->pending, mbio);
>> >>             plug->pending_cnt++;
>> >>         } else {
>> >>             bio_list_add(&conf->pending_bio_list, mbio);
>> >>             conf->pending_count++;
>> >>         }
>> >>         spin_unlock_irqrestore(&conf->device_lock, flags);
>> >>
>> >> If the thread blk_check_plugged returns NULL, then you always hand the
>> >> WRITE to raid1d. So the only option to avoid handoff to raid1d is for
>> >> the caller to plug. Otherwise, all WRITEs are handed off to raid1d and
>> >> latency becomes terrible.
>> >> So in my case, I use plug/unplug for individual bios only to avoid the
>> >> handoff to raid1d.
>> >> What am I missing in this analysis?
>> >
>> > if blk_check_plugged succeeds then it has arranged for raid1_unplug to be
>> > called a little later by that same process.
>> > So there is nothing to stop you calling raid1_unplug immediately.
>> >
>> > raid1_unplug essentially does:
>> >   bitmap_unplug()
>> >   generic_make_request()
>> >
>> > so you can very nearly just do that, without any plugging.
>> I am sorry, but I did not understand your reply. Maybe I did not
>> explain myself, I will try again.
>>
>> I am not changing raid1.c code. I just want to avoid the handoff to
>> raid1d on WRITEs. According to your code, there are only two possible
>> flows:
>>
>> Flow 1 - with plugging
>> # caller calls blk_start_plug
>> # caller calls submit_bio
>> # blk_check_plugged succeeds, and bio is put onto plug->pending list
>> # caller calls blk_finish_plug
>> # raid1_unplug is called in the same caller's thread, so it does
>> bitmap_unplug and generic_make_request
>>
>> Flow 2 - without plugging
>> # caller calls submit_bio
>> # blk_check_plugged fails, and bio is put onto conf->pending_bio_list,
>> which means it will be submitted by raid1d
>>
>> My conclusion from that: to avoid the handoff to raid1, caller always
>> need to plug, even if it has a single bio to submit. But you said "it
>> should be possible to submit individual bios directly from
>> make_request without passing them to raid1d and without using
>> plugging". So can you explain how it is possible? I prefer not to
>> change raid1.c code.
>>
>> >
>> > There is a bit of extra subtlety but I can't really know how relevant that
>> > might be to you without actually seeing you code.
>> My code (in a different kernel module, not in raid1.c) is simply doing
>> submit_bio. I want to wrap this with plug/unplug to avoid the handoff
>> to raid1d and improve raid1 latency.
>>
>
> I think I need to see the code you are working with to be able to suggest
> anything used.
I am working with kernel 3.8.13. But your master branch has the same
code with respect to plug/unplug logic.

> But if it works with plugging, then just do it that way(?).
It works perfectly, and latency is much better. The only doubt is with
bitmap_unplug being called from multiple threads now. However, it can
happen for anybody that uses plug/unplug on top of MD raid1 (like ext4
for example). So question is whether it is safe for MD users to
plug/unplug when submitting bios to MD. If not, would you be fixing
this?

Alex.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about RAID1 plug/unplug code
  2014-09-11  8:22           ` Alexander Lyakas
@ 2014-09-12  6:16             ` NeilBrown
  2014-09-14 17:52               ` Alexander Lyakas
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2014-09-12  6:16 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid, Yair Hershko

[-- Attachment #1: Type: text/plain, Size: 8831 bytes --]

On Thu, 11 Sep 2014 11:22:40 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> 
> On Wed, Sep 10, 2014 at 12:36 PM, NeilBrown <neilb@suse.de> wrote:
> > On Wed, 10 Sep 2014 11:01:30 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> > wrote:
> >
> >> Hello Neil,
> >>
> >> On Tue, Sep 9, 2014 at 12:45 PM, NeilBrown <neilb@suse.de> wrote:
> >> > On Tue, 9 Sep 2014 11:33:13 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> >> > wrote:
> >> >
> >> >> Hi Neil,
> >> >>
> >> >>
> >> >> On Tue, Sep 9, 2014 at 4:45 AM, NeilBrown <neilb@suse.de> wrote:
> >> >> > On Mon, 8 Sep 2014 16:55:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> >> Hi Neil,
> >> >> >> We have been seeing high latency on the md/raid1 block device, due to
> >> >> >> the fact that all WRITEs are handed off to raid1d thread. This thread
> >> >> >> also calls bitmap_unplug(), which writes the bitmap synchronously.
> >> >> >> While it waits for the bitmap, it cannot trigger other WRITEs waiting
> >> >> >> in its pending_bio_list. This is especially seen with SSDs: MD's
> >> >> >> latency is much higher that SSD latency (I have been stoned by Peter
> >> >> >> Grandi when I brought up this issue previously for raid5).
> >> >> >>
> >> >> >> Then I have noticed the commit:
> >> >> >>
> >> >> >> commit f54a9d0e59c4bea3db733921ca9147612a6f292c
> >> >> >> Author: NeilBrown <neilb@suse.de>
> >> >> >> Date:   Thu Aug 2 08:33:20 2012 +1000
> >> >> >>
> >> >> >>     md/raid1: submit IO from originating thread instead of md thread.
> >> >> >>
> >> >> >> Looking at the code, I learned that to avoid switching into raid1d,
> >> >> >> the caller has to use blk_start_plug/blk_finish_plug. So I added these
> >> >> >> calls in our kernel module, which submits bios to MD. Results were
> >> >> >> awesome, MD latency got down significantly.
> >> >> >
> >> >> > That's good to hear.
> >> >> >
> >> >> >>
> >> >> >> So I have several questions about this plug/unplug thing.
> >> >> >>
> >> >> >> 1/ Originally this infrastructure was supposed to help IO schedulers
> >> >> >> in merging requests. It is useful when one has a bunch of requests to
> >> >> >> submit in one shot.
> >> >> >
> >> >> > That is exactly the whole point of plugging:  allow the device to handle a
> >> >> > batch of requests together instead of one at a time.
> >> >> >
> >> >> >> But in MD case, thus infrastructure is used for a different purpose:
> >> >> >> not to merge requests (which may help bandwidth, but probably not
> >> >> >> latency), but to avoid making raid1d a bottleneck, to be able to
> >> >> >> submit requests from multiple threads in parallel, which brings down
> >> >> >> latency significantly in our case. Indeed "struct blk_plug" has a
> >> >> >> special "cb_list", which is used only by MD.
> >> >> >
> >> >> > I don't think the way md uses plugging is conceptually different from any
> >> >> > other use: it is always about gathering a batch together.
> >> >> > "cb_list" is handled by blk_check_plugged() which is also used by
> >> >> > block/umem.c and btrfs.
> >> >> >
> >> >> > The base plugging code assumes that it is only gathering a batch of requests
> >> >> > for a single device - if the target device changes then the batch is flushed.
> >> >> > It also assumed that it was "struct request" that was batched.
> >> >> > Devices like md that want to queue 'struct bio', something else was needed.
> >> >> > Also with layered devices it can be useful to gather multiple batches for
> >> >> > multiple layers.
> >> >> > So I created "cb_list" etc and a more generic interface.
> >> >> >
> >> >> >> In my case I have only individual bios (not a bunch of bios), and I
> >> >> >> after wrap them with plug/unplug, MD latency gets better. So we are
> >> >> >> using the plug infrastructure for a different purpose.
> >> >> >> Is my understanding correct? Was this your intention?
> >> >> >
> >> >> > I don't really understand what you are doing.  There is no point in using
> >> >> > plugging for individual bios.  The  main point for raid1 writes is to gather
> >> >> > a lot of writes together so that all multiple bitmap bits can be set all at
> >> >> > once.
> >> >> > It should be possible to submit individual bios directly from make_request
> >> >> > without passing them to raid1d and without using plugging.
> >> >> Can you pls explain how it is possible?
> >> >> You have this code for WRITEs:
> >> >>         cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
> >> >>         if (cb)
> >> >>             plug = container_of(cb, struct raid1_plug_cb, cb);
> >> >>         else
> >> >>             plug = NULL;
> >> >>         spin_lock_irqsave(&conf->device_lock, flags);
> >> >>         if (plug) {
> >> >>             bio_list_add(&plug->pending, mbio);
> >> >>             plug->pending_cnt++;
> >> >>         } else {
> >> >>             bio_list_add(&conf->pending_bio_list, mbio);
> >> >>             conf->pending_count++;
> >> >>         }
> >> >>         spin_unlock_irqrestore(&conf->device_lock, flags);
> >> >>
> >> >> If the thread blk_check_plugged returns NULL, then you always hand the
> >> >> WRITE to raid1d. So the only option to avoid handoff to raid1d is for
> >> >> the caller to plug. Otherwise, all WRITEs are handed off to raid1d and
> >> >> latency becomes terrible.
> >> >> So in my case, I use plug/unplug for individual bios only to avoid the
> >> >> handoff to raid1d.
> >> >> What am I missing in this analysis?
> >> >
> >> > if blk_check_plugged succeeds then it has arranged for raid1_unplug to be
> >> > called a little later by that same process.
> >> > So there is nothing to stop you calling raid1_unplug immediately.
> >> >
> >> > raid1_unplug essentially does:
> >> >   bitmap_unplug()
> >> >   generic_make_request()
> >> >
> >> > so you can very nearly just do that, without any plugging.
> >> I am sorry, but I did not understand your reply. Maybe I did not
> >> explain myself, I will try again.
> >>
> >> I am not changing raid1.c code. I just want to avoid the handoff to
> >> raid1d on WRITEs. According to your code, there are only two possible
> >> flows:
> >>
> >> Flow 1 - with plugging
> >> # caller calls blk_start_plug
> >> # caller calls submit_bio
> >> # blk_check_plugged succeeds, and bio is put onto plug->pending list
> >> # caller calls blk_finish_plug
> >> # raid1_unplug is called in the same caller's thread, so it does
> >> bitmap_unplug and generic_make_request
> >>
> >> Flow 2 - without plugging
> >> # caller calls submit_bio
> >> # blk_check_plugged fails, and bio is put onto conf->pending_bio_list,
> >> which means it will be submitted by raid1d
> >>
> >> My conclusion from that: to avoid the handoff to raid1, caller always
> >> need to plug, even if it has a single bio to submit. But you said "it
> >> should be possible to submit individual bios directly from
> >> make_request without passing them to raid1d and without using
> >> plugging". So can you explain how it is possible? I prefer not to
> >> change raid1.c code.
> >>
> >> >
> >> > There is a bit of extra subtlety but I can't really know how relevant that
> >> > might be to you without actually seeing you code.
> >> My code (in a different kernel module, not in raid1.c) is simply doing
> >> submit_bio. I want to wrap this with plug/unplug to avoid the handoff
> >> to raid1d and improve raid1 latency.
> >>
> >
> > I think I need to see the code you are working with to be able to suggest
> > anything used.
> I am working with kernel 3.8.13. But your master branch has the same
> code with respect to plug/unplug logic.
> 
> > But if it works with plugging, then just do it that way(?).
> It works perfectly, and latency is much better. The only doubt is with
> bitmap_unplug being called from multiple threads now. However, it can
> happen for anybody that uses plug/unplug on top of MD raid1 (like ext4
> for example). So question is whether it is safe for MD users to
> plug/unplug when submitting bios to MD. If not, would you be fixing
> this?

Didn't I already answer that question?

  Date: Tue, 9 Sep 2014 11:45:38 +1000

  Hmmm... there could be an issue there.  It is possible that some callers of
  bitmap_unplug won't block when they should.  bitmap_unplug should probably
  wait unconditionally.

See
http://git.neil.brown.name/?p=md.git;a=commitdiff;h=339f60d943f848eb516aa4b82b5e187dbbe088dc

(not tested yet).

NeilBrown


> 
> Alex.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about RAID1 plug/unplug code
  2014-09-12  6:16             ` NeilBrown
@ 2014-09-14 17:52               ` Alexander Lyakas
  0 siblings, 0 replies; 9+ messages in thread
From: Alexander Lyakas @ 2014-09-14 17:52 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Yair Hershko

Thanks, Neil.
We will try to see whether it is possible to reproduce the corruption
w/o the patch. And also test what performance implication the patch
has.

Alex.


On Fri, Sep 12, 2014 at 9:16 AM, NeilBrown <neilb@suse.de> wrote:
> On Thu, 11 Sep 2014 11:22:40 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
> wrote:
>
>> Hi Neil,
>>
>> On Wed, Sep 10, 2014 at 12:36 PM, NeilBrown <neilb@suse.de> wrote:
>> > On Wed, 10 Sep 2014 11:01:30 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> > wrote:
>> >
>> >> Hello Neil,
>> >>
>> >> On Tue, Sep 9, 2014 at 12:45 PM, NeilBrown <neilb@suse.de> wrote:
>> >> > On Tue, 9 Sep 2014 11:33:13 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> Hi Neil,
>> >> >>
>> >> >>
>> >> >> On Tue, Sep 9, 2014 at 4:45 AM, NeilBrown <neilb@suse.de> wrote:
>> >> >> > On Mon, 8 Sep 2014 16:55:52 +0300 Alexander Lyakas <alex.bolshoy@gmail.com>
>> >> >> > wrote:
>> >> >> >
>> >> >> >> Hi Neil,
>> >> >> >> We have been seeing high latency on the md/raid1 block device, due to
>> >> >> >> the fact that all WRITEs are handed off to raid1d thread. This thread
>> >> >> >> also calls bitmap_unplug(), which writes the bitmap synchronously.
>> >> >> >> While it waits for the bitmap, it cannot trigger other WRITEs waiting
>> >> >> >> in its pending_bio_list. This is especially seen with SSDs: MD's
>> >> >> >> latency is much higher that SSD latency (I have been stoned by Peter
>> >> >> >> Grandi when I brought up this issue previously for raid5).
>> >> >> >>
>> >> >> >> Then I have noticed the commit:
>> >> >> >>
>> >> >> >> commit f54a9d0e59c4bea3db733921ca9147612a6f292c
>> >> >> >> Author: NeilBrown <neilb@suse.de>
>> >> >> >> Date:   Thu Aug 2 08:33:20 2012 +1000
>> >> >> >>
>> >> >> >>     md/raid1: submit IO from originating thread instead of md thread.
>> >> >> >>
>> >> >> >> Looking at the code, I learned that to avoid switching into raid1d,
>> >> >> >> the caller has to use blk_start_plug/blk_finish_plug. So I added these
>> >> >> >> calls in our kernel module, which submits bios to MD. Results were
>> >> >> >> awesome, MD latency got down significantly.
>> >> >> >
>> >> >> > That's good to hear.
>> >> >> >
>> >> >> >>
>> >> >> >> So I have several questions about this plug/unplug thing.
>> >> >> >>
>> >> >> >> 1/ Originally this infrastructure was supposed to help IO schedulers
>> >> >> >> in merging requests. It is useful when one has a bunch of requests to
>> >> >> >> submit in one shot.
>> >> >> >
>> >> >> > That is exactly the whole point of plugging:  allow the device to handle a
>> >> >> > batch of requests together instead of one at a time.
>> >> >> >
>> >> >> >> But in MD case, thus infrastructure is used for a different purpose:
>> >> >> >> not to merge requests (which may help bandwidth, but probably not
>> >> >> >> latency), but to avoid making raid1d a bottleneck, to be able to
>> >> >> >> submit requests from multiple threads in parallel, which brings down
>> >> >> >> latency significantly in our case. Indeed "struct blk_plug" has a
>> >> >> >> special "cb_list", which is used only by MD.
>> >> >> >
>> >> >> > I don't think the way md uses plugging is conceptually different from any
>> >> >> > other use: it is always about gathering a batch together.
>> >> >> > "cb_list" is handled by blk_check_plugged() which is also used by
>> >> >> > block/umem.c and btrfs.
>> >> >> >
>> >> >> > The base plugging code assumes that it is only gathering a batch of requests
>> >> >> > for a single device - if the target device changes then the batch is flushed.
>> >> >> > It also assumed that it was "struct request" that was batched.
>> >> >> > Devices like md that want to queue 'struct bio', something else was needed.
>> >> >> > Also with layered devices it can be useful to gather multiple batches for
>> >> >> > multiple layers.
>> >> >> > So I created "cb_list" etc and a more generic interface.
>> >> >> >
>> >> >> >> In my case I have only individual bios (not a bunch of bios), and I
>> >> >> >> after wrap them with plug/unplug, MD latency gets better. So we are
>> >> >> >> using the plug infrastructure for a different purpose.
>> >> >> >> Is my understanding correct? Was this your intention?
>> >> >> >
>> >> >> > I don't really understand what you are doing.  There is no point in using
>> >> >> > plugging for individual bios.  The  main point for raid1 writes is to gather
>> >> >> > a lot of writes together so that all multiple bitmap bits can be set all at
>> >> >> > once.
>> >> >> > It should be possible to submit individual bios directly from make_request
>> >> >> > without passing them to raid1d and without using plugging.
>> >> >> Can you pls explain how it is possible?
>> >> >> You have this code for WRITEs:
>> >> >>         cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
>> >> >>         if (cb)
>> >> >>             plug = container_of(cb, struct raid1_plug_cb, cb);
>> >> >>         else
>> >> >>             plug = NULL;
>> >> >>         spin_lock_irqsave(&conf->device_lock, flags);
>> >> >>         if (plug) {
>> >> >>             bio_list_add(&plug->pending, mbio);
>> >> >>             plug->pending_cnt++;
>> >> >>         } else {
>> >> >>             bio_list_add(&conf->pending_bio_list, mbio);
>> >> >>             conf->pending_count++;
>> >> >>         }
>> >> >>         spin_unlock_irqrestore(&conf->device_lock, flags);
>> >> >>
>> >> >> If the thread blk_check_plugged returns NULL, then you always hand the
>> >> >> WRITE to raid1d. So the only option to avoid handoff to raid1d is for
>> >> >> the caller to plug. Otherwise, all WRITEs are handed off to raid1d and
>> >> >> latency becomes terrible.
>> >> >> So in my case, I use plug/unplug for individual bios only to avoid the
>> >> >> handoff to raid1d.
>> >> >> What am I missing in this analysis?
>> >> >
>> >> > if blk_check_plugged succeeds then it has arranged for raid1_unplug to be
>> >> > called a little later by that same process.
>> >> > So there is nothing to stop you calling raid1_unplug immediately.
>> >> >
>> >> > raid1_unplug essentially does:
>> >> >   bitmap_unplug()
>> >> >   generic_make_request()
>> >> >
>> >> > so you can very nearly just do that, without any plugging.
>> >> I am sorry, but I did not understand your reply. Maybe I did not
>> >> explain myself, I will try again.
>> >>
>> >> I am not changing raid1.c code. I just want to avoid the handoff to
>> >> raid1d on WRITEs. According to your code, there are only two possible
>> >> flows:
>> >>
>> >> Flow 1 - with plugging
>> >> # caller calls blk_start_plug
>> >> # caller calls submit_bio
>> >> # blk_check_plugged succeeds, and bio is put onto plug->pending list
>> >> # caller calls blk_finish_plug
>> >> # raid1_unplug is called in the same caller's thread, so it does
>> >> bitmap_unplug and generic_make_request
>> >>
>> >> Flow 2 - without plugging
>> >> # caller calls submit_bio
>> >> # blk_check_plugged fails, and bio is put onto conf->pending_bio_list,
>> >> which means it will be submitted by raid1d
>> >>
>> >> My conclusion from that: to avoid the handoff to raid1, caller always
>> >> need to plug, even if it has a single bio to submit. But you said "it
>> >> should be possible to submit individual bios directly from
>> >> make_request without passing them to raid1d and without using
>> >> plugging". So can you explain how it is possible? I prefer not to
>> >> change raid1.c code.
>> >>
>> >> >
>> >> > There is a bit of extra subtlety but I can't really know how relevant that
>> >> > might be to you without actually seeing you code.
>> >> My code (in a different kernel module, not in raid1.c) is simply doing
>> >> submit_bio. I want to wrap this with plug/unplug to avoid the handoff
>> >> to raid1d and improve raid1 latency.
>> >>
>> >
>> > I think I need to see the code you are working with to be able to suggest
>> > anything used.
>> I am working with kernel 3.8.13. But your master branch has the same
>> code with respect to plug/unplug logic.
>>
>> > But if it works with plugging, then just do it that way(?).
>> It works perfectly, and latency is much better. The only doubt is with
>> bitmap_unplug being called from multiple threads now. However, it can
>> happen for anybody that uses plug/unplug on top of MD raid1 (like ext4
>> for example). So question is whether it is safe for MD users to
>> plug/unplug when submitting bios to MD. If not, would you be fixing
>> this?
>
> Didn't I already answer that question?
>
>   Date: Tue, 9 Sep 2014 11:45:38 +1000
>
>   Hmmm... there could be an issue there.  It is possible that some callers of
>   bitmap_unplug won't block when they should.  bitmap_unplug should probably
>   wait unconditionally.
>
> See
> http://git.neil.brown.name/?p=md.git;a=commitdiff;h=339f60d943f848eb516aa4b82b5e187dbbe088dc
>
> (not tested yet).
>
> NeilBrown
>
>
>>
>> Alex.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-09-14 17:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-08 13:55 Question about RAID1 plug/unplug code Alexander Lyakas
2014-09-09  1:45 ` NeilBrown
2014-09-09  8:33   ` Alexander Lyakas
2014-09-09  9:45     ` NeilBrown
2014-09-10  8:01       ` Alexander Lyakas
2014-09-10  9:36         ` NeilBrown
2014-09-11  8:22           ` Alexander Lyakas
2014-09-12  6:16             ` NeilBrown
2014-09-14 17:52               ` Alexander Lyakas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.